Stable Diffusion can run in ONNX in the browser in under FIXME compressed, FIXME uncompressed, without significant quality loss. This small size comes from quantizing weights to 6-bit and 8-bit precision, performing operations in full 16-bit precision, and implementing the quantization in a way that can run in a serialized ONNX model.

As a demo, in the browser, this is the Stable Diffusion controlnet that compressed to under a quarter of a gigabyte, generating images in WebGPU (if available) or WASM of surreal landscapes: Diffusion Local Time, extra small

This was tricky because: Stable Diffusion is sensitive to quantized activations, which is how ONNX’s default quantization results; convolutional layers are less compressible than linear layers; and (some!) ONNX runtimes require topological sorting of models.

Compressing convolutional layers and linear layers

MatMulNBits

Matrix multiply is foundational to many operations in current ML models, and there are three different built-in operators for performing matrix multiplication in ONNX: MatMul, QLinearMatMul, and MatMulNBits.

ONNX Topological Sorting

In the ONNX runtime in Javascript and C (but not Python!), executing a model executes the directed graph of computation nodes, like “add these two vectors”, from inputs to outputs. If there is a

of convolutional layers and linear layers to an 8-bit palette and a 6-bit palette, respectively, and dequantizing at runtime into 16-bit floating point precision using ONNX operators.