Diffusion Local Time is a functional timepiece that explores surrealism in generative AI. It was on display in the exhibition at Art Hack Day in San Francisco.
In this installation, the underlying technology and the politics remained the same (a Latent Consistency Model ControlNet running locally) and the local computer was upgraded to a high-end desktop computer, for a sub-minute render latency. Because of the higher compute budget, the resolution increased to 1440x810. For a more cohesive visual, the random seed to generate images was set to change every hour, and most of the digits stayed rendered with the same boulders and sky halos from minute to minute. Further, the number of clock faces reduced to only one, the desert night, to avoid large luminance shifts from afternoon cliffs to desert night to desert sunrise.
The reception was widely positive; reactions included enjoying how calming it was (under a thousandth the framerate of other video art), how beautiful the images were, and why there were so many Milky Ways.
]]>Diffusion Local Time is a timepiece with a Generative AI display. A Raspberry Pi locally generates and displays pareidolic clock faces every several minutes, using open-source code and open-source typography and freely-available models.
The clock faces are generated from four text prompts that default to California landscapes, and this is easily field-serviceable to new clock faces, like “kittens in the park”, for a 3PM viewing.
Using a latent consistency model derived from Stable Diffusion 1.5, and the Monster Labs QR Code Monster, a Raspberry Pi 4 can generate a 480x360 image in 9.5 minutes, suitable for a residential display. The newer Raspberry Pi 5 can generate a 480x360 image in under 6 minutes. On a Mac Studio with an M1 Ultra chip, the controlnet takes 1.1 seconds on GPU.
Diffusion Local Time was designed to use a greyscale 4:3 eink HDMI monitor, and can easily be adapted to fit other dimensions and other display technologies.
While the defaults in software and hardware for Diffusion Local Time prioritize fast and cheap at the expense of image quality, with 22.1 seconds of runtime on the GPU of an M1 Ultra you can get a much more detailed and beautiful result:
The Raspberry Pi is a well-tested deployment target, and 8 GB of memory is enough to run the Huggingface Diffusers code unchanged.1 But math takes time, and adding smaller numbers up is faster. There is no GPU on a Raspberry Pi to run 16-bit floating point math, so the easiest way to reduce precision is to go to quantized 8-bit integer precision. This is 4x less data to read in from main memory into each layer: on the Pi 4 this gave a 10% speedup, 480 to 430 seconds, even with the overhead to dequantize into floating point numbers, but on the Pi 5 this lead to no time savings, potentially because of improved memory bandwidth, so this is by default off.
The image on the bottom is in 8-bit precision, while on top is in full 32-bit precision.
This is an especially egregious example of a common phenomenon, quantization or not. Not all of the results are plausible: water reflections often obey the control image over optics, sea foam and surf often obey the control image over wave dynamics, rocks often obey the control image over gravity. As with ChatGPT, and all current language models, the output of these LLMs is something that seems statistically plausible without actual fact, and we do the work of upholding both sides of the conversation in our dialogue.
This is challenging, as an artistic project, because the tuning knob for how much to effect image synthesis with the control image, is a real-valued number, and is not a measure of the legibility of the control image: a starry image of a lake in a hot desert night tends to have more contrast than redwoods, so the conditioning scale differs, whereas the artistic intent is to have equivalent legibility among all clock faces.
The images on the bottom come from just 1GB of statistical models.
These LLMs are powerful. The imminent danger is how they can facilitate generating misinformation and erode the idea of consensus reality, which damages our collective ability to fight many of our modern problems.2 An urgent danger is their capacity to supplant paid stock art (using the work of photographers and artists, often those same photographers, who currently rely on the sales of their work to sustain themselves). Another urgent danger is the many biases and problems in the training data. Society is generally unprepared for the effects of this technology.
This artwork tries to grapple with these problems. There is minimal harm of a landscape altered with pillars and cairns and seafoam and extra milky ways. There are few pareidolic artists, and every image that Diffusion Local Time starts with a different random seed, based on creation time. The biases of mainstream photography of the American southwest are unfortunately replicated in this work (for example, these landscapes are gorgeous for forest bathing, but people have lived around there for millenia, and these landscapes rarely include people) and counteracting this is an ongoing effort.
The tooling that has made this artwork is powerful, and part of the point of this work is to raise awareness of the potential for its creative use and its malignant abuse. The “local” in Diffusion Local Time, especially on a Raspberry Pi, was a choicce to indicate the broad distribution of power, not only mediated by paid internet services, but locally, unmoderated, even at sub-$100 price points: this power is widely available, in such abundance that it is usable for artistic purposes, and we deserve to be appropriately cautious.
In a world where most people’s primary interaction with a timepiece is through a mobile phone, which defaults to a 3 or 4 digit time display (8:30 instead of 🕣), a numerical display of time is common, but also ridiculous, in exactly the sense of “ape-descended life forms … so amazingly primitive that they still think digital wristwatches are a pretty neat idea”: to use the explanation of Douglas Adams himself, in rejecting an editorial change from digital wristwatches to cellular phones:
there is something inherently ridiculous about digital watches, and not about cellular phones. Now this is obviously a matter of opinion, but I think it’s worth explaining. Digital watches came along at a time that, in other areas, we were trying to find ways of translating purely numeric data into graphic form so that the information leapt easily to the eye. For instance, we noticed that pie charts and bar graphs often told us more about the relationships between things than tables of numbers did. So we worked hard to make our computers capable of translating numbers into graphic displays. At the same time, we each had the world’s most perfect pie chart machines strapped to our wrists, which we could read at a glance, and we suddenly got terribly excited at the idea of translating them back into numeric data, simply because we suddenly had the technology to do it. So digital watches were mere technological toys rather than significant improvements on anything that went before. I don’t happen to think that’s true of cellular comms technology. So that’s why I think that digital watches (which people still do wear) are inherently ridiculous, whereas cell phones are steps along the way to more universal communications. They may seem clumsy and old-fashioned in twenty years time because they will have been replaced by far more sophisticated pieces of technology that can do the job better, but they will not, I think, seem inherently ridiculous. 3
Let me know what you think! leebutterman@gmail.com
By replacing the new default scaled dot product attention with the existing Sliced Attention Processor”, runtime goes down from 9.5 minutes to 6 minutes. Interestingly, the default attention processor changed during development of this timepiece, which surfaced this regression, on this relatively rare deployment platform for large attention-based models. ↩
If we cannot come together for a problem that is recent, and has impact within days, and has a clearly known inexpensive solution (the US government spent $32B in the three decades before the pandemic through March 2022, about the cost of a low-end phone for each adult in the United States), how will we come together for a problem that is hundreds of years old, impacts systems with huge inertia, and has many unknown solutions that will be extremely expensive for each of us individually? ↩
That quote was from 1992, and in under twenty years there was an iPhone with an app store in roughly its modern shape, while many fewer people (especially as a percentage!) wear a digital wristwatch whose primary function is to keep time. ↩
QR Code Controlnet + Stable Diffusion, a hyperlegible font from the Braille institute, a g5.xlarge
that generates an image in 5.7 seconds, and some prompts.
A few months ago nhciao posted several pieces of art that scanned as QR codes.
Recently, Angry Penguin made a Huggingface space that implemented that workflow for arbitrary prompts and images, with fully available source.
The luminance controlnet model and the upscaler are both publicly available on Huggingface, so it is possible to run those locally with only the cost of keeping the box running.
The numbers of the clock face need to be legible, and typographers have investigated this exact problem! This is the Atkinson font which was initially designed for letterform visibility, and its numbers are very legible as well. Even at the 1024x1024 final resolution the forms are legible, filtered through the various prompts.
g5.xlarge
The relationship that one has with a clock depends on how fast it goes. Running this controlnet on a Mac Studio in cpu mode (I could not get MPS to be stable) runs in a few minutes (at residential energy prices). Running on a g5.xlarge
on AWS runs in 5.78 seconds in float16 precision (at $1/hr). This latency is low enough to generate several types of images for a few instances of each clock face per minute.
detail of a new hieronymus bosch painting
Hieronymus Bosch artwork can be blobby and weird with light figures on a dark background, and the images often fit the control image and are similarly spooky at first glance.
beautiful detail of a train map
The train maps can be fascinating to look at, sometimes installed in an actual train, sometimes a low depth-of-field close up, always nonsensical after thorough observation. Need to work on this more. (Very open to suggestions!)
watercolor of a leafy pedestrian mall at golden hour with multiracial genderqueer joggers and bicyclists and wheelchair users talking and laughing
This comes from the sort of city I want to live in: leafy, close-knit, welcoming, without danger from wheeled vehicles.
There are current living artists who are making art in these styles and they would love you to support their work and their vision! (Send me more suggestions!)
All of the pieces come together in https://github.com/lsb/stable-diffusion-clock. The viewer could probably be improved from a static HTML page that refreshes itself every few seconds. The images are mostly deterministic: their seed comes from the epoch time at generation.
Apologies to people using this as a clock who are not also in Pacific time.
Some open questions I’ve been thinking of:
g5.xlarge
.)Tell me what you think! leebutterman@gmail.com
]]>This is a browser-based search engine for Wikipedia, where you can search for “the reddish tall trees on the san francisco coast” and find results like “Sequoia sempervirens” (a name of a redwood tree). The browser downloads the database, and search happens offline. To download two million Wikipedia pages with their titles takes roughly 100MB and under 50 milliseconds to see the final results. This uses sentence transformers to embed documents, product quantization to compress embeddings, pq.js
to run distance computation in the browser, and transformers.js
to run sentence transformers in the browser for queries.
Yes.
Real-time search over millions of documents is happening in real-time completely offline. Results stream back every 10ms on a mobile device, and search results update gradually as the database is sequentially scanned.
The distance computation over 2M embeddings takes 250ms in total, over 20 iterations, and we can display intermediate results with a faceted top-10 computation that takes 8ms. To display intermediate results, we run batches of 100k distance computations at a time, and compute the top-k and repaint after a (30ms) timer runs out.
We order embeddings by compressed page size: more information-dense pages are the first to be analyzed and returned in a top-10 ranking, and might be more useful in a search result. Note that the search results continue to stream in and update the top results, but most of the lower-page-size pages do not rank in the top 10, so the search appears faster than if we did not update the UI until everything returned.
70% of the final search results were in the first 670K embeddings, which in total rendered in 116 milliseconds (note the topk timing at the bottom left, which counts distance calculations as positive times and topk calculations as negative times):
Note that changing the facet for the onomatopoeia search (changing the first letter of the page to return) avoided running a new embedding, and returned in under 25ms. Changing the number of results from top 10 to top 20 or top 100 is similarly instantaneous.
The database is small enough to support casual use cases of up to a million embeddings without special treatment.
Note that, for high performance, we use Arrow instead of JSON. Arrow can store our 8-bit integer product quantization arrays compactly, and Arrow can store an array of strings as an array of indexes into one buffer, which is a significant savings over a million Javascript string objects.
There is no GPU acceleration, only WebAssembly, so far. ONNX is a convenient compile target. WebGPU is still very new, and is an eagerly-anticipated future direction.
There are a lot of sentence transformers to choose from! There is a leaderboard of sentence embeddings: https://huggingface.co/blog/mteb
The all-minilm-l6-v2
model has reasonable performance https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2 and is small and available in ONNX weights https://huggingface.co/Xenova/all-MiniLM-L6-v2/ for transformers.js https://github.com/xenova/transformers.js .
6M pages * 384-dimension embeddings * 32-bit floats is over 9GB. Even a million embeddings in float16 precision is 800MB. This is too large for casual usage.
As a first approximation, to choose the top million, one approach would be to choose the pages with the most information: compress each page and see the number of bytes that come out. Lists would be overrepresented (lists are less compressible than general text), there’s no appreciation of the link structure of webpages, but it’s cheap to compute and easy to start with.
FAISS (https://faiss.ai) is a highly popular embedding search engine serverside, with a lot of tuning knobs for creating different styles of search indices. Autofaiss (https://github.com/criteo/autofaiss) will usually recommend using Product Quantization, after creating IVF indices or HNSW indices (Pinecone has a great intro to vector indexing https://www.pinecone.io/learn/vector-indexes/).
Product quantization is exceptionally simple to implement: creating a ‘distance table’ is under 5 lines of numpy and using that to find distances is a one-liner.
Often times, you will want to search in some product subcategories, like finding only PDFs in a web search, or results in ancient Latin. Splitting up the distance computation from computing a top-10 ranking allows us to fudge the distances in flight before ranking. For million-scale search, this is highly feasible. In this search of Wikipedia, there is one search facet: the first character of the page. Because the top-k ranking is separate from distance computation we can avoid recomputing query embeddings and distances to explore different facet values in real time.
ONNX has a specific opcode that does exactly the product quantization step! That opcode is GatherElements. Unfortunately, the PyTorch ONNX export does not use this special opcode for the model as written. Thankfully, there is abundant support for reading and writing ONNX outside of a PyTorch compilation step.
A useful graphical editing tool for ONNX is ONNX-modifier, at https://github.com/ZhangGe6/onnx-modifier , which presents a friendly interface to add elements into the dataflow graph of any exported ONNX model.
By taking the multiple steps in the PyTorch model that gets compiled to ONNX, and replacing all of those with one ONNX opcode, distance computation is roughly 4x faster.
As mentioned, the Arrow format is much more compact in memory and much more compact on disk to store the embeddings and the metadata (page titles).
Because the Arrow array format only stores one-dimensional data, and because we have 48 dimensions of embedding data, and because we do not want to store embedding data wrapped in another data format, we need two separate schemas, one for the metadata (with a hundred thousand rows each), and one for the embeddings (with a hundred thousand * 48 rows each), and we reshape the embeddings at load time.
Storing the product quantization codebook in JSON is under 1.0MB, so it is less crucial to optimize this part.
Lots of the library functions in the full Wikipedia search app should migrate into reusable pq.js components. A lot of the ONNX shapes are pre-baked, so it would be useful to support different quantization levels and different embedding dimensions. Give a shout!
]]>Try it out at https://huggingface.co/lsb/wav2vec2-base-pemlsb-la!
This is the first Latin speech recognition system! It is powered by a new voice dataset of 88.3 hours of Latin speech, mostly synthetically generated from Poeta Ex Machina. The self-supervision of using speech synthesis to train speech recognition (like SynthASR) offers a few exciting new directions (like examining the inductive biases of neural language models with artificial language).
Modern deep neural network statistical modeling relies on fewer hand-engineered features and larger piles of data. Self-supervised learning is increasingly useful in many applications, where a task can be framed as learning mechanically-generated labels. These labels are generated at usually much lower cost and usually much greater scale than human-generated labels. Self-supervised learning often amounts to learning the inverse of a mechanical process: image recoloring for black-and-white photographs is learned as the inverse of stripping images of their color. Super-resolution is learned as the inverse of downsampling images. Language modeling is learned as the inverse of deleting a word in a sequence (at the end (‘causal’) or in the middle (‘masked’)). A self-supervised speech recognition approach would be to start with a pile of text, generate synthetic speech, and learn to recognize human speech based on that synthetic speech, similar to SynthASR.
Many speech recognition systems rely on meticulously labeled sound files, with accurate timing data for each letter. The relatively-new wav2vec uses unlabeled (text, sound) pairs, which allow it to consume much more data (per $ of acquired sample data, like Common Voice). However, spoken Latin is rare, and much more challenging1 to acquire than (say) Spanish or Japanese, so this self-supervised approach is crucial.
So: can Latin speech recognition learn from Latin speech synthesis? We can first create a dataset of Latin text, and then we can create a dataset of that text synthesized into speech, and we can try.
it requires High quality synthetic Latin speech in a classical pronunciation comes from Poeta ex Machina. Poeta ex Machina has a full database of scansions of single words, and we can use Poeta ex Machina to synthesize lines of (for example) Vergil for a multi-word corpus; all of Vergil’s extant works are all in dactylic hexameter, comprising over 21 hours of text.
ancient-latin-passages
The ancient-latin-passages
dataset is a compendium of 19MB of Latin text, written roughly between 50BC and 150AD, from a wide variety of Classical authors on the Latin Library. This dataset was used to create poetaexmachina-mp3-recitations, and we can synthesize much more poetry and add to that dataset. This is publicly available at https://huggingface.co/datasets/lsb/ancient-latin-passages .
poetaexmachina-mp3-recitations
All of poetaexmachina-mp3-recitations
is divided into three parts: the 1-grams, individual words, from Poeta ex Machina’s internal database of word scansions, comprising 66.9 hours of recited speech; the lines of dactylic hexameter, all from Vergil, comprising 21.4 hours of recited speech; and recitations from yours truly of Cicero and Catullus, comprising half a minute of recited speech. This is publicly available at https://github.com/lsb/poetaexmachina-mp3-recitations, with one recitation per text file + mp3 file.
In contrast to older speech recognition systems that require speech waveforms expensively annotated with timing data per letter, wav2vec2 is designed to learn timing data from unannotated pairs of an entire waveform and an entire text (usually under 10 seconds of audio).
The community and infrastructure around wav2vec2 means that there are many wav2vec2 models trained on various modern languages. We can take a large pre-trained model whose training data is close to the target data distribution, and use it as a foundational starting point, instead of starting training from scratch. Poeta ex Machina uses an Italian voice, partly for its phonetic inventory (English, for instance, does not have sufficient phonetic inventory: we believe that ancient Latin trilled or flapped its Rs (medi(us)-dies = meridies, like British English rhyming edible with terrible)), partly for sentimental/aesthetic reasons (would Spanish work? Russian? Xhosa?). For similarly phonetic and sentimental reasons, and availability, we use a wav2vec2 model trained on the Italian dataset of Vox Populi, and fine-tune from there. Informal test results found that the word error rate improved faster when fine-tuning from this Italian-trained model, compared to the English-trained model. An obvious future direction is starting from other initial monolingual models, or multilingual models. Another obvious future direction is upgrading from the 5-gram post-processing model to other text models (transformers? sub-word tokenization strategies?).
We can make our prediction task by normalizing the orthography of the Latin text, by stripping punctuation and macrons, and normalizing letters invented after 500AD (“j”, “v”, “w”) by substituting “j” with “i” and “v” with “u”, and only using lower case.
Wav2vec2 uses Connectionist Temporal Classification to infer its transcription: at each 20ms timestep we can predict a token, either a letter or a special character like a break, and we can merge identical predictions between breaks. The Huggingface wav2vec2 library has built-in support for an additional 1-to-5-gram language model, for post-processing the audio predictions with a stochastic 🦜. Tuning the post-processing model is very much an open question, especially for Latin.
At the end of initial training, the word error rate was 4.13% on the validation set of data, only slightly more than 1 in every 25 words incorrect.
Given the rate of improvement when including even a small amount of human-generated training data, this is very much a work in progress, especially when experimenting with data augmentation.
There are examples of using artificial languages to examine the inductive biases of neural language models (https://arxiv.org/pdf/2106.01044.pdf), and using artifically generated speech can be similarly useful here. By varying the voice pitch or timbre, or experimenting with background acoustics, or by introducing speech disfluencies, it would be possible to compare the inductive biases of speech recognition for different types of speakers (and then (ideally!) engineer those away). Using generated speech takes away one variable for trans-linguistic comparison of the model (“how well does this perform against English versus against Polish/Sanskrit/Tagalog/Toki Pona/etc”).
The careful observer will ask, why does one need speech recognition at all, if spoken Latin is very rare. I have a truly marvelous rationale which this endnote is too space-constrained to contain. ↩
Foretell the future through the text of a book in a few easy steps!
People have used Vergil’s Aeneid, the Bible, and many other books for exactly this purpose.
We will use a similar concept: instead of divination through books, we will perform divination through fingerprints: dactyloglyphomancy.
From https://text.bargains/amulet/:
An amulet is a kind of poem that depends on language, code, and luck. To qualify, a poem must satisfy these criteria:
Its complete Unicode text is 64 bytes or less.
The hexadecimal SHA-256 hash of the text includes four or more 8s in a row.
…
And, while this isn’t part of the formal definition, it’s important to say that an amulet of any rarity should be judged by its overall effect, with consideration for both its linguistic and typographic qualities. In particular, an amulet’s whitespace, punctuation, and diacritics should all be “load bearing”.
This is presumably so that an amulet will not be stuffed with zero-width nonjoiners, or be something like “lol butts 61140978758” (02c9ecef4bfda53a315201bcb728128888888888eed3b65d7bc0bcf5dae0ec2e) or just “2021-10-31T09:52:20.358328” (e61d881e1299dd7927c588888888884ac755686ace0a92012c89b5b6f46494c0).
Turn this on its head! Choose a phrase of importance, ask a question, in the form of a meaningful hexadecimal string, and find which emoji you can interpolate into the phrase to unlock the hexadecimal string in the fingerprint, and declare the choice of emoji significant to the question you have asked.
First, we choose an Oblique Strategy.
Bridges:
-build
-burn
Then, we ask a question.
What will be relevant for 2020, and 2021?
So we will search for a hash that matches 2020
and 2021
.
Then we run several parallel threads in Rust to discover the hash ed7f23f8436df202021e1fce3d2f42627d5a7f1a01a23557c8c05ff6d1063e16, which corresponds to
Bridges:
-build 💕
-burn 📱
Thus, the theme for 2020 and 2021 is build bridges of 💕, and burn bridges with one’s 📱. Fair enough.
Next, we choose another Oblique Strategy, for areas of inquiry.
Consult other sources: promising, unpromising
Then, we ask for near-term suggestions: search for 202106
.
With our resulting hash e68419dd6ba7e410a9c21509b6522f747b202106b41ace99623a800d9c87a5fa we find
Consult other sources: 🔭 promising, 🦖 unpromising
advising astronomy/astrology and ornithology/palaeontology.
Perhaps we require more timeless advice about defense against harmful desires. We can interpolate into Jenny Holzer’s famous
💆 PROTECT 🌄 ME FROM WHAT I 🏍 WANT 🧞
and perceive massages and mountain sunrises as a guard against motorcycles and bottled spirits, in an amulet that is legendary (2b40eb46e603c68049485e8a3df8888888862eebfab56ea00553305f506ec819).
We can use older poetry as well. Sappho’s “Some men say an army on horseback” is beautiful, and Anne Carson’s translation begins:
Some men say an army of horse and some men say an army on foot
and some men say an army of ships is the most beautiful thing
on the black earth. But I say it is
what you love.
The original Greek for the last “it is what you love” is κῆν’ ὄττω τις ἔραται which we can interpolate into
κῆν’ ὄττω ⛺ τις 🌘 ἔραται 🌄
for a timely amulet celebrating camping and mountain sunrises and night skies (bed5d962d2d2db72c4acf46799620210601e582e3bb65ff84698d5e5ee019915).
Try out https://github.com/lsb/dactyloglyphomancy yourself and divine new insights about your future!
]]>In 407 BCE, about 300 years after writing comes back to Ancient Greece, two sheets of paper to write an expense ledger cost 1⅓ drakhmai, roughly six grams of silver. We currently have “ledger”-sized paper (~A3 paper), and two sheets of normal-weight (80 gsm) ledger paper weighs ~20g. Papyrus cost a third of its weight in silver!
But! As for ancient baskets of goods, the going salary of an architect was a drakhma a day! A day’s pay for your writing material! And architects nowadays can work at $50/hr, for (say) a $500/day salary. Gold is $50/g nowadays: for 20g of paper, it’s comparable to if modern copy paper were worth half its weight in gold.
(Sources say that this high cost is from Egyptian monopoly pricing, with exorbitant taxes at every step, and it’s understandable why: it’s hard to administer an empire keeping debt records only in oral-formulaic poetry, or on wax tablets, or clay, or on the beach like Archimedes, and Egypt on the decline after fighting off the sea peoples probably didn’t want the competition.)
In the late Antique period, in England, in 1379, 126 books cost £113, and each £ was roughly 250g of silver, and hardcovers these days weigh around half a kilo, so, the books were roughly worth half their weight in silver. Nowadays, you can upload a 500 page pdf and get printed a ream of A4 paper individually-produced bound hardcover for $24, which is 2.5kg, and silver is ~$1/g, so $25 instead of $2500, only 100x cheaper now compared to several decades before movable type.
The copypaper-worth-its-weight-in-gold metaphor got me thinking about other examples of 25000x price improvement. The hard drive space to store one gigabyte of information cost $100k in 1985, $1k in 1995, and $1 in 2005. Tech twenty years in the future would literally have been worth its weight in gold, had it been available. (Also, it took a year or two to hand-copy a bible; a $500 printer these days can print a page a second, and finish a thousand pages of a bible in twenty minutes, a 25000x time reduction.
(a quick pechakucha talk I gave regarding ancient Greece losing writing was well received)
]]>UNIX, since the 1970s, has had an internal notion of time that is the number of seconds after 1 Jan 1970 UTC.
This is often expressed as an integer, a signed integer. Many other APIs exist that specify fractional time, also as integers: clock_getres expresses seconds and nanoseconds as 32-bit integers, Java expresses time in milliseconds as a 64-bit integer, and a Date in JavaScript internally keeps track of milliseconds since 1970, PHP returns time in microseconds. Ruby keeps Time as nanoseconds and uses arbitrary-precision integers.
Instead of inventing a complex data structure yourself, use one implemented in hardware: the 64-bit float!
The float64 format has a sign bit, 11 exponent bits (representing exponents from ≈-1000 to ≈1000), and 52 explicit mantissa bits (representing a mantissa with precision of ≈ a quintillionth), as visualized by User:Codekaizen:
such that 1620620620 (in May 2021) is represented as 0b0100000111011000001001100010110101010011000000000000000000000000, or 0x41D8262D53000000.
The next largest floating point number is 0x41D8262D53000001, or 1620620620 + 2⁻²². This is a granularity of a quarter of a microsecond. Instead of many different APIs to try to represent fractional time, keep time as a float64, to adequately represent time with granularity of well under a microsecond for the next several decades, and only compute on this representation of epoch time.
Part of the problem of storing time as a 32-bit signed integer number of seconds after 1 Jan 1970: we have no more integers after 19 Jan 2038 that fit in 32 bits!
Signed integers roll over and turn negative when they overflow, at their current precision. Floats get half as precise when they overflow their current precision.
In 2038, float64s that represent time will degrade to a granularity of half a microsecond.
On 7 February 2106, when seconds after 1970 will exceed 2³², the floating point representation will have the precision of one microsecond, and maintain exactly the same bit structure.
At the extinction of the dinosaurs, 65 million years ago, when the epoch time was negative 2 quadrillion (-2051244000000000 for 65Mya), the precision is a quarter of a second.
Even through the 90s, long after many system calls became formalized, floating point math was much more expensive than integer math. Also, while some of the earliest computers had floating-point support (C has a float and a double, because it initially ran on a computer that did!), there was no standard for what you could expect from a “float” or a “double”: K&R C explicitly warns you that a “double” could be 72 bits, and only in 1984 was there a floating point standard that people could ask for by name (IEEE-754), at which point many system APIs had settled.
Floating point, especially when you least expect it, can be surprising: 0.1 (as expressed in the base 2 of a float64) + 0.2 (as expressed in the base 2 of a float64) generally equals 0.30000000000000004 (both 0.1 and 0.2, in float64 representations, are almost 2⁻⁵⁷ greater than their exact base 10 representations).
For this reason, financial computations in floating point are strongly discouraged.
Time is not money!
Whereas money can be contractually expressed as hundredths or millionths of a base currency ($, €, et cetera), time is not exact! Facebook increased the accuracy of their computers’ time from milliseconds to within hundreds of microseconds and it was a big deal.
Whereas you can reasonably divide a financial sum 3 ways, and you want to ensure that the parts sum to the whole, you will generally not be multiplying the time after 1970 by a number and making sense out of it, because 1970 is just an arbitrary zero-point.
Generally, to compute durations, you will be performing arithmetic on times. On computers that can adjust the system clock multiple microseconds at a time, sub-microsecond precision is entirely sufficient.
Furthermore, float64s are entirely adequate for storing both the number of seconds after 1970, and also the number of seconds of a particular duration, and when these numbers are smaller, the granularity increases: the granularity at a billion is a billionth of the granularity at one, so continuing to compute in float64 is a great idea, no type conversions required.
Time stored as a float64 makes a lot of sense, especially when used in a fixed-length id!
Let us say that you want (probably) unique ids, which you can sort lexicographically (run through sort
) and get a rough ordering in time.
The big-endian representation of float64 supports this sort order: recall that 1620620620 (May 2021) in a float64 is 0x41D8262D53000000, and 0x41D8262D53000001 is 1620620620 + 2⁻²². All positive numbers sort in ascending order, as do all negative numbers.
When time is accurate to hundreds of microseconds, time storage at sub-microsecond precision is entirely adequate.
If you use all 128 bits of the UUID, disregarding UUID’s backwards compatibility built in for 1980s computers, you have 4M different float64s per second, and you have 64 full bits of randomness.
Based on the math powering the Birthday Problem, for a 50% chance that two 64-bit random strings are equal, you would need roughly 5 billion 64-bit random strings, every quarter of a microsecond.
If you are okay with a quarter of a percent chance of any of these float64+random64 UUIDs colliding in twenty years, then the probability of collision per timeslice needs to be one in a quintillion, 10⁻¹⁸: (1-10⁻¹⁸)^(4000000 * 86400 * 365 * 20) ≈ 99.75% , which is to say, the odds of not colliding per timeslice, 1-10⁻¹⁸, multiplied together for the timeslices in a second for the seconds in a day for the days in a year for twenty years.
If you are making 6 of these UUIDs every quarter-microsecond, the space to store only the ids is 16 bytes/id * 6 ids/tick * 4M ticks/s * 86400 s/day * 30 day/month ≈ one petabyte per month, only for UUIDs.
If these UUIDs are connected to event data, and your event data is at least 10x the size of the id of the event, that is over 2PB/week.
Most use cases do not have 2PB/week of new data! Using this float64+random64 scheme is entirely enough to identify most types of events as they happen, with a very low chance of collision.
The float64 corresponding to the current epoch time will have its highest-order byte equal to 0x41, from 2 Jan 1970 until 16 Mar 2242. If we only store the lower 56 bits, we can have 8 more bits of randomness per timeslice.
The number of random72s that we can make every quarter-microsecond tick to retain the odds of collision at 10⁻¹⁸ is 97: √(2 × 2⁷² × -ln(1-10⁻¹⁸)) ≈ 97.
This is sixteen times as many as the float64random64, so, this corresponds to at least 30PB/week of event data. This is over an exabyte a year, well over $20M in storage costs alone.
Kudos to Evan Wallace’s Float Toy for visualizations of the binary float16/float32/float64 formats! Kudos to Bartek Szopka’s ieee-754-visualization for a slightly more math-oriented approach!
Computing with a float64 is cheap, you get sub-microsecond precision nowadays, you don’t need to pre-coordinate about milliseconds versus microseconds versus (sencond,nanosecond) pairs et cetera et cetera, as long as you’re not counting individual nanoseconds you should be great.
Also obviously store your human times as ISO 8601 strings (among many other reasons: the list of time zones is unbounded).
]]>