Archive for the ‘Uncategorized’ Category

There’s a funny saying I recently came across about statisticians: A statistician is someone who knows that 4 is sometimes equal to 7.  Besides being incredibly geeky — and, let’s face it, kind of stupid — this underlies a very fundamental concept in the data sciences: any quantity we measure is drawn from a distribution.

I’ve often found that this concept is difficult to convey to people, but it is fundamentally important for understanding data.  What does it mean that 4 is sometimes equal to 7, and what does it mean when quantities are drawn from a distribution?  What are some ways to visualize distributions, and how can we use this concept to powerfully reason from data?

Fundamentals

To begin, Let’s start with some quantity we’re trying to measure.. This can be any number you are interested in — the number of concurrents on your homepage, the engaged time of users, the heights of people in New York, the arrival time of the N train on your morning commute. When I say that these numbers are drawn from a distribution, I mean quite directly, that the quantity varies each time you measure it.  The 8:03 N will show up at 8:03, but sometimes 8, sometimes 8:06, more rarely 8:15. If you watch the number of concurrents or the engaged times on your Chartbeat dashboard, you’re watching this variation in real time.

FIG1

Were we to plot these numbers on a simple graph, as I’ve done for 10 numbers in our imaginary data pull above, we’d get our first insight into what a distribution is. It is simply a spread of numbers. Now if someone asked you “what number did you measure?”, looking at your measurements you’d probably be inclined to pick some number in the middle — likely the average (10.5, in this case) —  although you, in fact, measured a entire range of numbers.  But is your report of the average an accurate response? How do we know this number is representative of the quantity we’re trying to measure?

At this point, we’ve only observed 10 numbers and so we really shouldn’t be very confident in that number we report. So we wait a much longer time, and our data builds up. To view this in our graph, let’s just pile up points where they fall, so the darker the color, the more points have been piled there.

FIG2

We start to see that our report of the center value was actually quite poor.  We are seeing truly how this number is distributed. If asked for a single number to represent our measurements, we’d be likely more inclined to say something closer to 5 since this is where a lot of points are piling up.

Enter the Boxplot

But visualizing our distribution as overlayed points isn’t always very insightful.  Sometimes we want specific landmarks to help us reason using the data. What kind of landmarks are useful? Well, the average can be a very useful landmark of central tendency.  But in our example above the average sits a bit too far away from the the largest density of points to give us a feel for what is “typical”. Also, we often would rather know how any particular measurement compares to the rest of the measurements.  Are 90% of values below this value?  75%? 10%?  These are the so-called quantiles of the data, also known as percentiles.  One way to show these kind of landmarks is to draw a box that spans from the 25th percentile to the 75th percentile.  The bounds of the box are somewhat arbitrary, but these bounds are convenient; we expect half of our data to fall within these bounds, so it gives us a feel for where the “meat” of the distribution is, and it’s spread.

FIG3

This represents that we’re likely to find data points between 4 and 12.

So now we have a simple visualization of how our data is distributed that lets us reason about approximately how often we are likely to measure different values.   We can improve this by adding a line in the box for a measure of central tendency.  To keep our landmarks consistent, we’ll stick with percentiles and mark the median (50th percentile).  But we still don’t really know what happens outside that box, so, just to get some insight into the range of our data, let’s draw lines out to the minimum and maximum values we measured.

FIG4

This visualization is known as a boxplot (or a box-and-whisker plot).  In my opinion, it is one of the most powerful ways to visualize data, especially when trying to compare quantities. It gives is a pretty comprehensive view of our quantity of interest, both in terms of what values are typical and how spread out these values are.

From Boxplots to Densities

Boxplots, while awesome, can sometimes mask subtleties in the data.  A canonical example is the case where you have some behavior that is multi-modal.  For instance, a day/night behavior or weekday/weekend behavior. Your measurements sometimes cluster around one value when measured in one situation, and around another value when measured in another.  Also, many times we’re interested in answering a question about the probability of measuring certain numbers, not just their relationship to other numbers.  This is where the concept of a probability distribution comes in.  Let’s go back to to second figure, where we had overlayed a bunch of points.  We’ll start simply by counting up the values and make a histogram of how often we measured values in a certain range.

FIG5

Histograms let us know the raw counts of certain values grouped into bins.  But raw counts aren’t always useful.  We often want more of a ratio, a probability.   For quantities that are discrete, this is pretty straightforward.  For instance, let’s say we were interested in whether a die was loaded. After some large number of rolls, we could make a distribution and could ask: what’s the probability that we roll a four? For each of the six bins, we’d simply divide the number of counts by the total number of times we rolled the die.  The resulting histogram, which shows the probability of discrete quantities, is called a probability mass function.  Now, if we had a value that wasn’t discrete, like height, or engaged time, we can make our bins really small, so instead of a bin corresponding to 1 unit, it corresponds to a range of 0.01 or 0.00001. Then, using the magic of calculus, we can make the bin size go to zero (wat?) and the histogram becomes a smooth curve.  The concept of the mass function still holds, but now, it is the area under the curve between two numbers that gives the probability that a number will fall in that range.  Well, the area always gave the probability, but in the mass function, it just so happened that bin sizes were equal to 1.

FIG6

This smoothed version of a histogram is called a probability density.  The curve depicts the density of where our measurements fell.  More points closer together equal a larger density, and hence a greater probability that any particular measurement will fall in the vicinity of that peak.  The y-axis has funny units — it is not quite a probability — but when you glance at one of these plots you can get a good feel for what the most probable values are.  Just look for peaks.  You also get a good feel for how spread out your data are.  In fact, I’m sure you are all familiar with the most popular probability distribution, the Normal distribution, or bell curve.   Much of the entire field of statistics is centered around reasoning about data based on how the data is distributed.  They are likely the most important concept in the data sciences.

Violins?

We can actually combine the ideas of boxplots and densities into a single, aesthetically interesting (but perhaps somewhat confusing) plot called a violin plot.  These aren’t used very often, but I find them pretty cool.  We take a probability distribution, flip it on its side, and reflect it around its axis. Then, we draw some lines for where our quartiles sit, like in a box plot. We get the power of a box plot with the insight of a probability density.

FIG7

Neat huh?

Comparing Quantities

I’ve showed a variety of ways to visualize a single variable, but the real power comes when comparing multiple variables. Below, I show how each of these visualizations can be used to compare the same measurement applied to three different groups.  All three groups have different averages and medians, but only one of the groups is truly different from the others, since the other two have distributions that overlap quite a bit.  Notice how the spread of the data is one of the most important, and actionable, pieces of information here.  I leave it up to you to decide which visualization most clearly depicts the differences.

FIG8B

The Takeaway

Why is this useful?  When you hear a number reported, you should always keep in mind that it is drawn from a distribution.  Pay attention to phrases like “margin of error,” “confidence interval”,  “variance” or “standard deviation”.  These are often more important that than the average value when reasoning with data, especially when we are comparing two or more similar numbers.  Distributions help us reason whether similarities or differences are significant.  Four, in fact, might equal seven.

If you have any questions or comments, feel free to tweet them to me @dpvalente or email me dan@chartbeat.

We’ve been ramping up our use of Riak, so coming across Charlie von Metzradt of Graphite telling us How to use Riak Incorrectly was incredibly useful.

ECMAScript 6 modules! Looking forward to the future of JavaScript?  So are we!  Take an early look at what the module syntax will look like in ES6.  Get prepared early, and maybe even start playing around!

$1,000,000 isn’t cool. You know what’s cool? Distributed systems. Murat Demirbas tells us about ZooKeeper and convenient patterns for distributed coordination.

Happy Friday!

Scott Murray’s D3 Tutorials are so good that he eventually turned them into a book. Luckily for you, he kept the originals up on the web.

An Engineer’s Guide to Stock Options by Alex MacCaw breaks down what every engineer should know.

You say you’re clustering data but don’t know how many clusters to expect? Applying a Dirichlet process to infinite mixture models is one elegant solution.

 

Opinionated AngularJS styleguide for teams – Angular style and architecture are evolving constantly but this is a very reasonable approach and guide from Todd Motto, with excellent justifications for each decision.

Revolutionary Technique That Quietly Changed Machine Vision Forever – Deep neural networks are steadily marching toward dominance when measured by their performance in image recognition competitions, and the article points to the obvious: at this rate, they will soon outperform humans.

A Fun Crypto Puzzle  A brain teaser that’s so deviously simple it’s almost impossible to figure out. Give it some deep thought before you google a solution!

Happy Friday!

Medium’s CSS is actually pretty f***ing good – @fat tells a good story about medium’s evolution of how they write css. Oh, and while he was evolving it he created a new font.

The Robo-Chemist – Researchers are now taking leaps towards building an robotic chemist. Using a combination of software, robotics, and lots of data teams are trying to solve some fundamental questions about how to build a chemical brain.

Utopian for Beginners – Joshua Foer’s look at John Quijada’s conlang (an invented language) called Ithkuil that attempts to eliminate all ambiguity. There are shades of Einstein in the story, as Quijada, a 53 year old DMV worker, completed a momental feat of language design which spontaneously generated an international community and netted hi adoring fans. He just didn’t anticipate the Russian super-race movement led by a Ukrainian terrorist that would adopt his work.

Happy Friday!

Linux Performance Tools – If you’re one of those poor souls oncall and want to get a quick rundown of what tools you have available to you to troubleshoot different types of performance issues on Linux, take a look at Brendan Gregg’s talk from LinuxCon.

3D CGI With Ikea – Turns out Ikea furniture might be more ephemeral than you thought! Did you figure those immaculately designed rooms Ikea catalogs showcase were built on sets? Well in some cases they’re built in CAD, even before they start manufacturing some products.

How We Built the Next Generation Online Travel Agency using Amazon, Clojure, and a Comically Small Team – Think it’s a drag having to deal with a giant monolithic app? Well try compiling your database into your deploy binary. Think you won’t be able to convince the suits to use a “toy language” called Clojure? Well try doing it after firing your entire engineering team. Colin Steel’s raucous essay on what transpired as Hotelicopter rebooted their tech all while trying to get acquired will leave you wishing for daisies and puppy kisses.

Happy Friday!

By: Jerry Gagelman and Jem Young

Hashing is a notoriously difficult problem, but the challenge doesn’t lie where you might think. We want to share a few insights gained from a recent excursion into this domain.

First, a little background: by definition, a hash function takes as input a string of raw bytes, and returns an integer. The “size” of the hash function refers to the size of its return type. For instance, the .hashCode() method in Java returns a 32-bit integer; 64-bit and 128-bit return types are also common. For the purposes of this post, we consider only non-cryptographic hash functions.

Cryptographic hashing is used as a method to obfuscate data in such a way that it makes it impossible or extremely difficult to recreate the original contents from the result, and it’s used for everything from securing sensitive data to authenticating messages. Non-cryptographic hashing is primarily used to quickly compare or index large sets of data but isn’t considered “secure.” A couple general-purpose hash functions are the MumurHash family and CityHash. The documentation around each of these functions is a great place to start reading to get a sense of the state of the art of non-cryptographic hashing.

The Chartbeat Engineering team recently engaged in a friendly competition to produce a non-cryptographic hash function that meets the following criteria:

  • its return type needs to be 128-bits
  • it must avoid collisions on at least one billion ASCII strings of arbitrary length, usually at most one kilobyte in size
  • it must be written in JavaScript (ouch!)
  • the “minified” JavaScript function must be as small as possible, ideally under 512 bytes
  • it should be reasonably fast

Almost all discussions of how to design non-cryptographic hash functions mention two common desired characteristics: avoiding collisions and performance. When we started, most of us thought the real challenge would be creating a lightweight hash function that could successfully hash a billion items quickly because, let’s face it, a billion is a lot of things and seemed to represent a solid challenge that we could pit our various hash functions against. As it turns out, hashing a billion strings is fairly trivial, but creating an algorithm that has enough entropy to hash a trillion strings, well, that turned out to be the real challenge.

The FNV Variant

The obvious first pass for our submission was to start with an existing simple hash function and modify it in some way. Fortunately, many non-cryptographic functions are public domain. One such function is the FNV family. Pseudo-code for the FNV-1a algorithm is:

hash = offset_basis
for each octet_of_data to be hashed
    hash = hash xor octet_of_data
    hash = hash * FNV_Prime
return hash

Heuristically, the XOR statement inside the loop “mixes” the current hash state with each successive byte of the input string, and following the multiplication statement “shifts” this state. In JavaScript, this looks like

var fnv1a = function(str) {
    var hash = 0x811c9dc5;
    for (var i = 0; i < str.length; i++) {
        hash ^= str.charCodeAt(i);
        hash *= 0x01000193;
    }
    return hash;
}

The magic constants for offset_basis and FNV_prime are available directly from the above link. Sixty-four and 128-bit variants use the same pseudo-code, and constants for these values are given too. However, the details for implementing the arithmetic inside the loop get a lot messier; for instance, the reference C code released by the authors looks like…

/* multiply by the lowest order digit base 2^16 */
tmp[0] = val[0] * FNV_64_PRIME_LOW;
tmp[1] = val[1] * FNV_64_PRIME_LOW;
tmp[2] = val[2] * FNV_64_PRIME_LOW;
tmp[3] = val[3] * FNV_64_PRIME_LOW;
/* multiply by the other non-zero digit */
tmp[2] += val[0] << FNV_64_PRIME_SHIFT;
tmp[3] += val[1] << FNV_64_PRIME_SHIFT;
/* propagate carries */
tmp[1] += (tmp[0] >> 16);
val[0] = tmp[0] & 0xffff;
tmp[2] += (tmp[1] >> 16);
val[1] = tmp[1] & 0xffff;
val[3] = tmp[3] + (tmp[2] >> 16);
val[2] = tmp[2] & 0xffff;

…just for the inner loop. Converting this to JavaScript is already a pain and will put pressure on the “as small as possible” constraint.

Our thought was, what if we just skip correctness? What if in representing the hash state as an array of integers, we apply the pseudo-code successively to each element in turn? Hash functions are intended to inject randomness into data and this idea sounds pretty random. Representing 128-bits as an array of 4 integers, we get something like this:

var fnv_128_attempt = function(str) {
    var h = [0x6295c58d, 0x62b82175, 0x07bb0142, 0x6c62272e];
    for (var i = 0; i < str.length; i++) {
        h[i % 4] ^= str.charCodeAt(i);
        h[i % 4] *= 0x01000193;
    }
    /* returns 4 concatenated hex representations */
    return h[0].toString(16) + h[1].toString(16) + h[2].toString(16) + h[3].toString(16);
}

Victory?

It turns out that our hacky algorithm managed to hash a billion unique strings without collision and squeezed in at a meager 214 bytes minified and zipped. Not bad at all. In fact, every one of the hacky and not-so-hacky algorithms we submitted for our competition managed to hash a billion strings without collisions. Picking the winner based on just the size of a compressed JavaScript function seemed kind of anti-climatic, so we kicked it up a notch and introduced another obstacle: measuring the “entropy” that the hash function adds to each input string.

A simplified measure of this entropy is the avalanche property. The idea involves probability theory. Given an input string S and its hashed value, changing exactly one bit in S and reapplying the hash function will return a new hash value. If we write pij for the probability that flipping the i-th input bit results in a flip to the j-th bit of the output, the function has “good avalanche characteristics” if pij is about 0.5 for all values of i and j.

For a more visual way of understanding it, imagine the bits of input and output as dominoes along the edges of a (randomly arranged) domino formation. Knocking over one domino from the input edge should result in about half of the dominoes in the output edge falling over.

Getting back to bits and bytes, it’s fairly easy to measure the avalanche characteristic of a hash function using stochastic simulation: for a single input string, flip each bit in turn, reapply the hash function, and tally the bits that flip in the output value. Performing this for a large sample of input strings — all the same length — and averaging the tallies will yield an estimate for the probabilities pij. Finally, visualizing this probability matrix as a heat map, you can easily see zones of good and bad avalanche behavior.

Results

Some code for this kind of simulation is in the avalanche repository. Everything is written in C for performance. The original standard 32-bit FNV algorithm and our hacky variant are provided in the examples folder. The probability “matrix” p[i][j] that the simulation script computes is returned as a table in text format. We used R to produce heat maps for visualizing the avalanche characteristics of the FNV algorithm and its variant. The test data used for this this simulation was 100,000 keys prototypical for a target application — in this case web logs.

Avalanche Characteristic Visualization

Each block in the map is one entry from the array. In the color scale, red is 0.0, white is 1.0, and the mid-point is orangish. So an algorithm with good hashing properties would be basically orange throughout with minor variations. The original FNV algorithm on the left is good, although there are zones of deterministic behavior corresponding to the chunks. Our hacky variant is thoroughly bad: the red stripes denote regions where the bits of output move in lockstep with the corresponding bits of input, i.e., deterministic instead of “entropic.”

Our conclusion is, well, hashing algorithms are harder than they look. Separating ASCII strings with unique integer identifiers is pretty easy. Peering under the “mathematical hood” of what the algorithm is actually doing can be a humbling — and enlightening — exercise.

We’re back, after a month hiatus! Did you miss us?

Yak Shaving – We’ve encountered a bug wherein you find another bug which is due to another bug which is due to another bug and before you know it, you’re knee deep trying to fix something that at first glance bears no relation to the original problem. This is known colloquially as “yak shaving” and one that many engineers (especially Ops) are familiar with.

Priceonomics – Started as a pricing guide to various consumer items but quickly realized that they were very good at writing blog posts. Their articles are well researched and cover a variety to topics such as how Slurpees were invented to why expensive hotels charge for wifi.

Code’s Worst Enemy – Classic Steve Yegge on code bloat, priceless for its irreverence and arguing for a future programming language that’s Lisp-like but based on the JVM, years before Clojure made its public appearance.

And why this post was late:

romy

Happy Friday!

Continuing our posts from the last time, we wanted to share a few more weekliest links from the reading lists of Chartbeat engineers:

  • Optimizing a system for performance or scalability or reliability is something we all eventually have to face, but we are always warned about the ills of pre-mature optimization. In contrast, the Mature Optimization Handbook [pdf] is a quick read with lots of wisdom around striking this balance.
  • Since no blog or forum these days is complete without a few — preferably contrasting — strong opinions about the Julia programming language, we think Evan Miller’s Why I’m betting on Julia is worth another read. We don’t have anything in production running Julia yet, but some recent hackweek work has sparked some interest.
  • Finally, we wanted to round out this week with a post about the Evolution of Open Redirect Vulnerability, a reminder of how something as seemingly simple as incorrect string parsing can be exploited for evil.

Happy Friday!

IMG_20140424_161537