Author Archive

Scott Murray’s D3 Tutorials are so good that he eventually turned them into a book. Luckily for you, he kept the originals up on the web.

An Engineer’s Guide to Stock Options by Alex MacCaw breaks down what every engineer should know.

You say you’re clustering data but don’t know how many clusters to expect? Applying a Dirichlet process to infinite mixture models is one elegant solution.

 

By: Jerry Gagelman and Jem Young

Hashing is a notoriously difficult problem, but the challenge doesn’t lie where you might think. We want to share a few insights gained from a recent excursion into this domain.

First, a little background: by definition, a hash function takes as input a string of raw bytes, and returns an integer. The “size” of the hash function refers to the size of its return type. For instance, the .hashCode() method in Java returns a 32-bit integer; 64-bit and 128-bit return types are also common. For the purposes of this post, we consider only non-cryptographic hash functions.

Cryptographic hashing is used as a method to obfuscate data in such a way that it makes it impossible or extremely difficult to recreate the original contents from the result, and it’s used for everything from securing sensitive data to authenticating messages. Non-cryptographic hashing is primarily used to quickly compare or index large sets of data but isn’t considered “secure.” A couple general-purpose hash functions are the MumurHash family and CityHash. The documentation around each of these functions is a great place to start reading to get a sense of the state of the art of non-cryptographic hashing.

The Chartbeat Engineering team recently engaged in a friendly competition to produce a non-cryptographic hash function that meets the following criteria:

  • its return type needs to be 128-bits
  • it must avoid collisions on at least one billion ASCII strings of arbitrary length, usually at most one kilobyte in size
  • it must be written in JavaScript (ouch!)
  • the “minified” JavaScript function must be as small as possible, ideally under 512 bytes
  • it should be reasonably fast

Almost all discussions of how to design non-cryptographic hash functions mention two common desired characteristics: avoiding collisions and performance. When we started, most of us thought the real challenge would be creating a lightweight hash function that could successfully hash a billion items quickly because, let’s face it, a billion is a lot of things and seemed to represent a solid challenge that we could pit our various hash functions against. As it turns out, hashing a billion strings is fairly trivial, but creating an algorithm that has enough entropy to hash a trillion strings, well, that turned out to be the real challenge.

The FNV Variant

The obvious first pass for our submission was to start with an existing simple hash function and modify it in some way. Fortunately, many non-cryptographic functions are public domain. One such function is the FNV family. Pseudo-code for the FNV-1a algorithm is:

hash = offset_basis
for each octet_of_data to be hashed
    hash = hash xor octet_of_data
    hash = hash * FNV_Prime
return hash

Heuristically, the XOR statement inside the loop “mixes” the current hash state with each successive byte of the input string, and following the multiplication statement “shifts” this state. In JavaScript, this looks like

var fnv1a = function(str) {
    var hash = 0x811c9dc5;
    for (var i = 0; i < str.length; i++) {
        hash ^= str.charCodeAt(i);
        hash *= 0x01000193;
    }
    return hash;
}

The magic constants for offset_basis and FNV_prime are available directly from the above link. Sixty-four and 128-bit variants use the same pseudo-code, and constants for these values are given too. However, the details for implementing the arithmetic inside the loop get a lot messier; for instance, the reference C code released by the authors looks like…

/* multiply by the lowest order digit base 2^16 */
tmp[0] = val[0] * FNV_64_PRIME_LOW;
tmp[1] = val[1] * FNV_64_PRIME_LOW;
tmp[2] = val[2] * FNV_64_PRIME_LOW;
tmp[3] = val[3] * FNV_64_PRIME_LOW;
/* multiply by the other non-zero digit */
tmp[2] += val[0] << FNV_64_PRIME_SHIFT;
tmp[3] += val[1] << FNV_64_PRIME_SHIFT;
/* propagate carries */
tmp[1] += (tmp[0] >> 16);
val[0] = tmp[0] & 0xffff;
tmp[2] += (tmp[1] >> 16);
val[1] = tmp[1] & 0xffff;
val[3] = tmp[3] + (tmp[2] >> 16);
val[2] = tmp[2] & 0xffff;

…just for the inner loop. Converting this to JavaScript is already a pain and will put pressure on the “as small as possible” constraint.

Our thought was, what if we just skip correctness? What if in representing the hash state as an array of integers, we apply the pseudo-code successively to each element in turn? Hash functions are intended to inject randomness into data and this idea sounds pretty random. Representing 128-bits as an array of 4 integers, we get something like this:

var fnv_128_attempt = function(str) {
    var h = [0x6295c58d, 0x62b82175, 0x07bb0142, 0x6c62272e];
    for (var i = 0; i < str.length; i++) {
        h[i % 4] ^= str.charCodeAt(i);
        h[i % 4] *= 0x01000193;
    }
    /* returns 4 concatenated hex representations */
    return h[0].toString(16) + h[1].toString(16) + h[2].toString(16) + h[3].toString(16);
}

Victory?

It turns out that our hacky algorithm managed to hash a billion unique strings without collision and squeezed in at a meager 214 bytes minified and zipped. Not bad at all. In fact, every one of the hacky and not-so-hacky algorithms we submitted for our competition managed to hash a billion strings without collisions. Picking the winner based on just the size of a compressed JavaScript function seemed kind of anti-climatic, so we kicked it up a notch and introduced another obstacle: measuring the “entropy” that the hash function adds to each input string.

A simplified measure of this entropy is the avalanche property. The idea involves probability theory. Given an input string S and its hashed value, changing exactly one bit in S and reapplying the hash function will return a new hash value. If we write pij for the probability that flipping the i-th input bit results in a flip to the j-th bit of the output, the function has “good avalanche characteristics” if pij is about 0.5 for all values of i and j.

For a more visual way of understanding it, imagine the bits of input and output as dominoes along the edges of a (randomly arranged) domino formation. Knocking over one domino from the input edge should result in about half of the dominoes in the output edge falling over.

Getting back to bits and bytes, it’s fairly easy to measure the avalanche characteristic of a hash function using stochastic simulation: for a single input string, flip each bit in turn, reapply the hash function, and tally the bits that flip in the output value. Performing this for a large sample of input strings — all the same length — and averaging the tallies will yield an estimate for the probabilities pij. Finally, visualizing this probability matrix as a heat map, you can easily see zones of good and bad avalanche behavior.

Results

Some code for this kind of simulation is in the avalanche repository. Everything is written in C for performance. The original standard 32-bit FNV algorithm and our hacky variant are provided in the examples folder. The probability “matrix” p[i][j] that the simulation script computes is returned as a table in text format. We used R to produce heat maps for visualizing the avalanche characteristics of the FNV algorithm and its variant. The test data used for this this simulation was 100,000 keys prototypical for a target application — in this case web logs.

Avalanche Characteristic Visualization

Each block in the map is one entry from the array. In the color scale, red is 0.0, white is 1.0, and the mid-point is orangish. So an algorithm with good hashing properties would be basically orange throughout with minor variations. The original FNV algorithm on the left is good, although there are zones of deterministic behavior corresponding to the chunks. Our hacky variant is thoroughly bad: the red stripes denote regions where the bits of output move in lockstep with the corresponding bits of input, i.e., deterministic instead of “entropic.”

Our conclusion is, well, hashing algorithms are harder than they look. Separating ASCII strings with unique integer identifiers is pretty easy. Peering under the “mathematical hood” of what the algorithm is actually doing can be a humbling — and enlightening — exercise.