The front end is a straightforward Backbone.js GUI.
Those models and views rely on querying for the existence of 1-grams, 2-grams, and 3-grams in pre-loaded data structures.
The backend needs knowledge of roughly 10 million 2-grams and roughly 40 million 3-grams.
For easiest delivery, a browser-based app makes the most sense.
We don't need complicated metadata; we just need basic set membership, and approximate set membership at that: a false positive (an ngram is in a low-entropy set when it isn't) is fine.
A probabilistic data structure like a Bloom filter is a good bet.
(Short intro to Bloom filters: you have a bit array, and keys to add to your set. Hash each key to one or more bits in the bit array, and at query time, if all the bits are set then the key is said to be in the set; if one bit isn't, then it's guaranteed not in the set.)
We can compress the Bloom filter, though, and make it simpler. Instead of using a multiple of hash functions, just use one, so for a 1 in 100 false positive rate, you'd need 100N bits.
Instead of storing the bits literally, run-length encode.
The bits that are set is pseudo-random (with a good cryptographic hash function), so the differences are distributed unformly, and Golomb coding is optimal for that. (This is described in vastly greater detail in Cache-, Hash- and Space-Efficient Bloom Filters, by Putze, Sanders and Singler.)
To deliver the Golomb-compressed sequence to the browser, I base64-encoded the binary data, and wrapped it up with some metadata in a JSON object. (For decoding, you need the number of bits used for the binary part of the unary/binary pair in a Golomb code, and it's convenient to have a count of total keys encoded, and for encoding you'd need to know the total number of bits used.)
A good number of moving pieces, a fair number of edgecases to get correct, a strong requirement to deal with large amount of input (that get hashed down), and a strong requirement for querying to be fast.
The Golomb querying ends up being very fast. The code takes a bytestring, tail-recursively pulls off 8-bit words from the front, and does the binary decoding. Because there's tail-call optimization, the recursive calls don't use linear stack space, and the referential transparency means that it's very easy to debug.
The ease of debuggability that comes with referential transparency is hard to overstate. There's a lot of state involved with Golomb decoding, and with pattern matching, each of the different states is literally a line of code, which reduces the complexity of the query code substantially.
After finishing version 1, I looked over the JS, and realized I'd written out code I would never have thought up from scratch. There's a similar story about Jonathan Sobel doing that with Scheme to C, using gotos for the first time. Similarly, I'd never written an enormous while loop, with destructive assignment to over half a dozen variables, but that was the logical translation of the Haskell code, and it worked the first time.
—Lee Butterman, San Francisco, 2012.