How does a 16MB LLM behave?

There’s a Parameter Golf competition, where people are competing to train various LLMs that fit in 16MB; like code golf, but for parameters of an arbitrary ML model. (I’d entered a diffusion language model.)

The output of the language model is a probability distribution, and people use this probability distribution for a generative sense as well as a discriminative sense.

We can make a small multiple of various entries in the 16MB parameter golf, and compare them to each other as well as to other known benchmarks. This is an exploration of two parameter golf entries, one larger LLM from last year, and a common compressor drafted into service as a generative model as well.

Try it out!

visualization of bpb computation for parameter golf models

visualization of generation computation for parameter golf models

Four Golfers

gzip

One golf participant is gzip, which compresses text based on 1970s-era cutting edge research: compress a repeated phrase into a back-pointer and a length, and compress all of this the repeated and non-repeated phrases with a basic statistical compressor representing letters in English like ‘e’ in fewer bits than ‘z’.

Using gzip to compute the bits per byte for a particular text is straightforward. Run the text through a character at a time, and see how many bytes come out.

Using gzip for generative purposes is tricky! We can compile some potential continuations of text, like in an LLM, create some statistical distribution for probability of each, and sample from this probability distribution. The text is not brilliant, but it is a minimum baseline of taking a coarse probability model of sample text, and using it generatively.

Auto-regressive baseline

This is a 16MB LLM that was the provided starting point for further LLM development, the default train_gpt.py in the repository, very similar to the modded-nanogpt techniques. It runs in 8-bit precision, 16M parameters, one token after another like most popular LLMs.

Masked Diffusion Language Model

This is an ✨original✨ adventure making a minuscule masked diffusion language model. We’d started with a similar architecture to the autoregressive architecture, with a few tweaks to improve performance (like a lower learning rate, and using fp8e4m3 precision instead of int8).

We had a variety of architectural variations, and a variety of bit-precision variations, trying several variations to see what trains the best in 3-minute 1-H100 “scout” runs, before slightly longer runs, and then several long 8-H100 runs.

Larger models running with more steps, initially, performed worse than smaller models running on more training steps. Also, we can take significantly longer for more diffusion steps, which increased quality. Once we extended from 3-minute runs to 10-minute runs, we found that depth of models mattered more than embedding size after a sufficient training budget. We had a few dead ends with low-bitrate quantization (int4/int2/quantization-aware training was less efficient than just training on more data, for a given time budget) and ended up using fp8e4m3 for storage and bf16 for work; different compression methods didn’t help much due to the high-entropy weights.

We also used a variety of evaluation methodologies, including negative evidence lower bound as well as autoregressive mode with the chain rule, and different diffusion strategies. Even training for multiple epochs on all of fineweb did not produce results that bested the autoregressive model.

We reached our tiny research budget for this little experiment on several productive and unproductive threads. Getting better results than the autoregressive baseline (which benefits from lots of the rest of the field doing research) is going to wait for another day. After training for a significantly longer time, the diffusion model gets almost as good, and that is the non-record result.

More research is needed to evaluate the comparative value of small diffusion models versus small auto-regressive models. Often, small models are less general purpose, and even a basic 135M parameter model can work well for a variety of tasks.

SmolLM2-135M-Instruct

This is a small powerful model from Huggingface. There are multiple sizes of SmolLM2 (135M, 360M, 1.7B parameters). Unlike the parameter golf training data, which only uses fineweb, this uses more datasets with greater diversity.

SmolLM2 ties the weights of the token embedding and the language modeling head, and we take special pains to ensure that the quantized token embeddings are identical to the quantized language model head (using tricks to ensure that the protobuf data of both quantized weights overlap, even though they are used in different operators), to get the model down from the 110MB publicly to a 70MB bundle. We also take special pains to ensure that the version of the onnx runtime has the common but outside-the-ONNX-standard 4-bit quantized gather.

(This problem, to quantize an LLM to 4-bit precision, and to ensure that the weight count is what we expect it to be, is exactly the sort of well-scoped problem for a coding agent to tackle for half an hour.)

First course: entropy measurement

This is the main way that the parameter golf is scored: measuring entropy on Fineweb.

We can see, broadly speaking, that gzip starts from zero, and needs to learn all sorts of details about English, while processing the default text, and SmolLM2 has the most number of weights and the lowest entropy, it can predict what is going to be said the best.

Notably, SmolLM2 inference uses a kv cache to reuse attention computations from token to token, so on a desktop the inference for the 135M parameter model actually runs faster than for the 16M model without this kv cache, especially while at the outer ends of the 1024-token context window.

We’re training the diffusion model with a pretty basic objective and diffusion training schedules are still an active research area, like BERT masked training strategies.

Second course: generation

This is more frivolous 🙂 but tries to give a comparative sense of how using the tiny language models works in the common use case of text generation.

Without any external data and with a few kilobyte context window, gzip will have no interesting probability distribution to generate text from, hence why it is mostly repetitive garbage.

The auto-regressive and diffusion language models are both understandable as English but do not produce useful text. SmolLM2, in 80MB, produces more useful text. As before, the more text we generate, the more we rely on the attention cache in the autoregressive models compared to the diffusion model: note that we are decoding the most likely token in the window, and this may change the attention on previous tokens that are still masked.

Let me know what you think!

This project entailed everything from architecture design to training to visualization to fit a few tiny language models into a ~150MB page. It’s exciting to see new architectures for LLMs and working on many tiny experiments quickly is an interesting and productive way forward, especially with tiny language models. Let me know what you think!