Frostflake - a Snowflake inspired high-performance unique identifier generator

hassila · November 17, 2022, 12:48pm

Just a micro framework for unique identifier generation that might be of use to someone else - inspired by Snowflake but with slightly different tradeoffs. Generates ~125M identifiers per second on an M1.

Usage

The generatorIdentifier must be uniquely in use at a given point in time, either it needs to
be set with a configuration file / persisted, or a global broker needs to assign it to components
that needs flake generators at runtime such that the same identifier is not used concurrently.

import Frostflake

func testFrostflake() {
  let frostflakeFactory = Frostflake(generatorIdentifier: 1)
  let frostflake = frostflakeFactory.generate()
  let decription = frostflake.frostflakeDescription()
  print(decription)
}

There's also an optional shared class generator (which gives approx. 1/2 the performance):

Frostflake.setup(generatorIdentifier: 1)
let frostflake1 =  Frostflake.generate()
let frostflake2 =  Frostflake.generate()

Karl · November 17, 2022, 1:24pm

Looks interesting!

I'm slightly concerned about this though:

One key difference compared to Snowflake is that Frostflake uses a frozen point in time repeatedly until running out of generation identifier for it, which avoids getting the current time for every id generated - it will update that frozen time point for every 1K generated identifiers (by default)

If a single timestamp can be used for up to 1000 IDs, no matter how far apart they are in time, wouldn't that significantly increase the probability of a collision? Especially if the clocks are not very high precision.

I think the disclaimer could be clearer about that. For average developers who have never considered how unique IDs are generated, the higher performance may not be a good trade-off compared to the increased (and less predictable) risk of collisions. It is something that must be considered very carefully.

hassila · November 17, 2022, 1:36pm

A requirement is the assignment of a unique generatorIdentifier which fundamentally namespaces the identifiers - so there is no risk for a collision. The intended usage is for e g. a cluster system with a few hundred nodes where such an identifier is assigned by a central authority at startup. This is mentioned in the doc, so as long as the generatorIdentifier is appropriately assigned collisions shouldn’t be an issue.

Karl · November 17, 2022, 2:14pm

I see. So when you say:

The generatorIdentifier must be uniquely in use at a given point in time, either it needs to be set with a configuration file / persisted, or a global broker needs to assign it to components that needs flake generators at runtime such that the same identifier is not used concurrently.

By "unique" and "global", you mean across all workers generating IDs.

hassila · November 17, 2022, 2:19pm

Yep, exactly - at a given point in time (+- for clock sync) a given generatorIdentifier may only be used by one worker.

ktoso · November 25, 2022, 5:57am

Oh sweet! I’m very excited to see a snowflake implementation in swift

I’ve added it to my backlog to give it a deeper look. Such libs are super important for distributed systems so it’s awesome seeing other folks invest there too already

ktoso · November 25, 2022, 6:01am

Quick question. It might be good to provide more details about but usage; snowflake uses node id (or anything, 10bit), some bits for the time stamp and more for try sequence number to avoid conflicts. Might be good to precisely explain how you allocate the bits in use in this lib.

hassila · November 25, 2022, 8:16am

Thanks @ktoso - I just updated the readme with some clarification:

Implementation notes

The Frostflake is a 64-bit value just like Snowflake, but the bit allocation differs a little bit.
Frostflake by default allocates 32 bits for the timestamp (~136 years span), 21 bits for the sequence number (allowing for up to 2.097.152 identifiers per second for a given generator) and 11 bits for the generator identifier (allowing for up to 2.048 unique workers/nodes in a system).

A possible future direction would be to allow for allocation of the bits between the sequence identifier and generator identifier up to the user to more easily allow for different use cases - as long as this would be reallocated during a service window (which just needs to be longer than the clock difference between the two nodes in the cluster being most out of sync) the timestamp portion will continue to ensure uniqeness.

The current bit allocation is just tuned for our use case (a large system for us would be < 200 nodes, so we allocated a factor of 10x there).

Currently we'll abort if generating more than 1 identifier per 477ns (which is completely unrealistic for how we use it), if one has a use case where that would be even remotely realistic, we'd recommend to reallocate the bit assignment if possible, or to allocate multiple generator identifiers and use a wrapper using them round-robin in such cases. Only an issue for synthetic tests for us at least, but YMMV.