Store your epoch times as 64-bit floats
Get quarter-microsecond granularity right now!
UNIX, since the 1970s, has had an internal notion of time that is the number of seconds after 1 Jan 1970 UTC.
This is often expressed as an integer, a signed integer. Many other APIs exist that specify fractional time, also as integers: clock_getres expresses seconds and nanoseconds as 32-bit integers, Java expresses time in milliseconds as a 64-bit integer, and a Date in JavaScript internally keeps track of milliseconds since 1970, PHP returns time in microseconds. Ruby keeps Time as nanoseconds and uses arbitrary-precision integers.
Instead of inventing a complex data structure yourself, use one implemented in hardware: the 64-bit float!
The float64 format has a sign bit, 11 exponent bits (representing exponents from ≈-1000 to ≈1000), and 52 explicit mantissa bits (representing a mantissa with precision of ≈ a quintillionth), as visualized by User:Codekaizen:
such that 1620620620 (in May 2021) is represented as 0b0100000111011000001001100010110101010011000000000000000000000000, or 0x41D8262D53000000.
The next largest floating point number is 0x41D8262D53000001, or 1620620620 + 2⁻²². This is a granularity of a quarter of a microsecond. Instead of many different APIs to try to represent fractional time, keep time as a float64, to adequately represent time with granularity of well under a microsecond for the next several decades, and only compute on this representation of epoch time.
Y2038 non-problem
Part of the problem of storing time as a 32-bit signed integer number of seconds after 1 Jan 1970: we have no more integers after 19 Jan 2038 that fit in 32 bits!
Signed integers roll over and turn negative when they overflow, at their current precision. Floats get half as precise when they overflow their current precision.
In 2038, float64s that represent time will degrade to a granularity of half a microsecond.
On 7 February 2106, when seconds after 1970 will exceed 2³², the floating point representation will have the precision of one microsecond, and maintain exactly the same bit structure.
At the extinction of the dinosaurs, 65 million years ago, when the epoch time was negative 2 quadrillion (-2051244000000000 for 65Mya), the precision is a quarter of a second.
Why am I not using float64 time already???
Even through the 90s, long after many system calls became formalized, floating point math was much more expensive than integer math. Also, while some of the earliest computers had floating-point support (C has a float and a double, because it initially ran on a computer that did!), there was no standard for what you could expect from a “float” or a “double”: K&R C explicitly warns you that a “double” could be 72 bits, and only in 1984 was there a floating point standard that people could ask for by name (IEEE-754), at which point many system APIs had settled.
Computing with float64 time
Floating point, especially when you least expect it, can be surprising: 0.1 (as expressed in the base 2 of a float64) + 0.2 (as expressed in the base 2 of a float64) generally equals 0.30000000000000004 (both 0.1 and 0.2, in float64 representations, are almost 2⁻⁵⁷ greater than their exact base 10 representations).
For this reason, financial computations in floating point are strongly discouraged.
Time is not money!
Whereas money can be contractually expressed as hundredths or millionths of a base currency ($, €, et cetera), time is not exact! Facebook increased the accuracy of their computers’ time from milliseconds to within hundreds of microseconds and it was a big deal.
Whereas you can reasonably divide a financial sum 3 ways, and you want to ensure that the parts sum to the whole, you will generally not be multiplying the time after 1970 by a number and making sense out of it, because 1970 is just an arbitrary zero-point.
Generally, to compute durations, you will be performing arithmetic on times. On computers that can adjust the system clock multiple microseconds at a time, sub-microsecond precision is entirely sufficient.
Furthermore, float64s are entirely adequate for storing both the number of seconds after 1970, and also the number of seconds of a particular duration, and when these numbers are smaller, the granularity increases: the granularity at a billion is a billionth of the granularity at one, so continuing to compute in float64 is a great idea, no type conversions required.
Case study: 128-bit UUIDs
Time stored as a float64 makes a lot of sense, especially when used in a fixed-length id!
Simple: float64+random64
Let us say that you want (probably) unique ids, which you can sort lexicographically (run through sort
) and get a rough ordering in time.
The big-endian representation of float64 supports this sort order: recall that 1620620620 (May 2021) in a float64 is 0x41D8262D53000000, and 0x41D8262D53000001 is 1620620620 + 2⁻²². All positive numbers sort in ascending order, as do all negative numbers.
When time is accurate to hundreds of microseconds, time storage at sub-microsecond precision is entirely adequate.
If you use all 128 bits of the UUID, disregarding UUID’s backwards compatibility built in for 1980s computers, you have 4M different float64s per second, and you have 64 full bits of randomness.
Based on the math powering the Birthday Problem, for a 50% chance that two 64-bit random strings are equal, you would need roughly 5 billion 64-bit random strings, every quarter of a microsecond.
If you are okay with a quarter of a percent chance of any of these float64+random64 UUIDs colliding in twenty years, then the probability of collision per timeslice needs to be one in a quintillion, 10⁻¹⁸: (1-10⁻¹⁸)^(4000000 * 86400 * 365 * 20) ≈ 99.75% , which is to say, the odds of not colliding per timeslice, 1-10⁻¹⁸, multiplied together for the timeslices in a second for the seconds in a day for the days in a year for twenty years.
If you are making 6 of these UUIDs every quarter-microsecond, the space to store only the ids is 16 bytes/id * 6 ids/tick * 4M ticks/s * 86400 s/day * 30 day/month ≈ one petabyte per month, only for UUIDs.
If these UUIDs are connected to event data, and your event data is at least 10x the size of the id of the event, that is over 2PB/week.
Most use cases do not have 2PB/week of new data! Using this float64+random64 scheme is entirely enough to identify most types of events as they happen, with a very low chance of collision.
Fancier: float56 + random72
The float64 corresponding to the current epoch time will have its highest-order byte equal to 0x41, from 2 Jan 1970 until 16 Mar 2242. If we only store the lower 56 bits, we can have 8 more bits of randomness per timeslice.
The number of random72s that we can make every quarter-microsecond tick to retain the odds of collision at 10⁻¹⁸ is 97: √(2 × 2⁷² × -ln(1-10⁻¹⁸)) ≈ 97.
This is sixteen times as many as the float64random64, so, this corresponds to at least 30PB/week of event data. This is over an exabyte a year, well over $20M in storage costs alone.
Bonus: visualization strategies!
Kudos to Evan Wallace’s Float Toy for visualizations of the binary float16/float32/float64 formats! Kudos to Bartek Szopka’s ieee-754-visualization for a slightly more math-oriented approach!
Store your epoch times as 64-bit floats
Computing with a float64 is cheap, you get sub-microsecond precision nowadays, you don’t need to pre-coordinate about milliseconds versus microseconds versus (sencond,nanosecond) pairs et cetera et cetera, as long as you’re not counting individual nanoseconds you should be great.
Also obviously store your human times as ISO 8601 strings (among many other reasons: the list of time zones is unbounded).