There’s a funny saying I recently came across about statisticians: A statistician is someone who knows that 4 is sometimes equal to 7. Besides being incredibly geeky — and, let’s face it, kind of stupid — this underlies a very fundamental concept in the data sciences: any quantity we measure is drawn from a distribution.
I’ve often found that this concept is difficult to convey to people, but it is fundamentally important for understanding data. What does it mean that 4 is sometimes equal to 7, and what does it mean when quantities are drawn from a distribution? What are some ways to visualize distributions, and how can we use this concept to powerfully reason from data?
To begin, Let’s start with some quantity we’re trying to measure.. This can be any number you are interested in — the number of concurrents on your homepage, the engaged time of users, the heights of people in New York, the arrival time of the N train on your morning commute. When I say that these numbers are drawn from a distribution, I mean quite directly, that the quantity varies each time you measure it. The 8:03 N will show up at 8:03, but sometimes 8, sometimes 8:06, more rarely 8:15. If you watch the number of concurrents or the engaged times on your Chartbeat dashboard, you’re watching this variation in real time.
Were we to plot these numbers on a simple graph, as I’ve done for 10 numbers in our imaginary data pull above, we’d get our first insight into what a distribution is. It is simply a spread of numbers. Now if someone asked you “what number did you measure?”, looking at your measurements you’d probably be inclined to pick some number in the middle — likely the average (10.5, in this case) — although you, in fact, measured a entire range of numbers. But is your report of the average an accurate response? How do we know this number is representative of the quantity we’re trying to measure?
At this point, we’ve only observed 10 numbers and so we really shouldn’t be very confident in that number we report. So we wait a much longer time, and our data builds up. To view this in our graph, let’s just pile up points where they fall, so the darker the color, the more points have been piled there.
We start to see that our report of the center value was actually quite poor. We are seeing truly how this number is distributed. If asked for a single number to represent our measurements, we’d be likely more inclined to say something closer to 5 since this is where a lot of points are piling up.
Enter the Boxplot
But visualizing our distribution as overlayed points isn’t always very insightful. Sometimes we want specific landmarks to help us reason using the data. What kind of landmarks are useful? Well, the average can be a very useful landmark of central tendency. But in our example above the average sits a bit too far away from the the largest density of points to give us a feel for what is “typical”. Also, we often would rather know how any particular measurement compares to the rest of the measurements. Are 90% of values below this value? 75%? 10%? These are the so-called quantiles of the data, also known as percentiles. One way to show these kind of landmarks is to draw a box that spans from the 25th percentile to the 75th percentile. The bounds of the box are somewhat arbitrary, but these bounds are convenient; we expect half of our data to fall within these bounds, so it gives us a feel for where the “meat” of the distribution is, and it’s spread.
This represents that we’re likely to find data points between 4 and 12.
So now we have a simple visualization of how our data is distributed that lets us reason about approximately how often we are likely to measure different values. We can improve this by adding a line in the box for a measure of central tendency. To keep our landmarks consistent, we’ll stick with percentiles and mark the median (50th percentile). But we still don’t really know what happens outside that box, so, just to get some insight into the range of our data, let’s draw lines out to the minimum and maximum values we measured.
This visualization is known as a boxplot (or a box-and-whisker plot). In my opinion, it is one of the most powerful ways to visualize data, especially when trying to compare quantities. It gives is a pretty comprehensive view of our quantity of interest, both in terms of what values are typical and how spread out these values are.
From Boxplots to Densities
Boxplots, while awesome, can sometimes mask subtleties in the data. A canonical example is the case where you have some behavior that is multi-modal. For instance, a day/night behavior or weekday/weekend behavior. Your measurements sometimes cluster around one value when measured in one situation, and around another value when measured in another. Also, many times we’re interested in answering a question about the probability of measuring certain numbers, not just their relationship to other numbers. This is where the concept of a probability distribution comes in. Let’s go back to to second figure, where we had overlayed a bunch of points. We’ll start simply by counting up the values and make a histogram of how often we measured values in a certain range.
Histograms let us know the raw counts of certain values grouped into bins. But raw counts aren’t always useful. We often want more of a ratio, a probability. For quantities that are discrete, this is pretty straightforward. For instance, let’s say we were interested in whether a die was loaded. After some large number of rolls, we could make a distribution and could ask: what’s the probability that we roll a four? For each of the six bins, we’d simply divide the number of counts by the total number of times we rolled the die. The resulting histogram, which shows the probability of discrete quantities, is called a probability mass function. Now, if we had a value that wasn’t discrete, like height, or engaged time, we can make our bins really small, so instead of a bin corresponding to 1 unit, it corresponds to a range of 0.01 or 0.00001. Then, using the magic of calculus, we can make the bin size go to zero (wat?) and the histogram becomes a smooth curve. The concept of the mass function still holds, but now, it is the area under the curve between two numbers that gives the probability that a number will fall in that range. Well, the area always gave the probability, but in the mass function, it just so happened that bin sizes were equal to 1.
This smoothed version of a histogram is called a probability density. The curve depicts the density of where our measurements fell. More points closer together equal a larger density, and hence a greater probability that any particular measurement will fall in the vicinity of that peak. The y-axis has funny units — it is not quite a probability — but when you glance at one of these plots you can get a good feel for what the most probable values are. Just look for peaks. You also get a good feel for how spread out your data are. In fact, I’m sure you are all familiar with the most popular probability distribution, the Normal distribution, or bell curve. Much of the entire field of statistics is centered around reasoning about data based on how the data is distributed. They are likely the most important concept in the data sciences.
We can actually combine the ideas of boxplots and densities into a single, aesthetically interesting (but perhaps somewhat confusing) plot called a violin plot. These aren’t used very often, but I find them pretty cool. We take a probability distribution, flip it on its side, and reflect it around its axis. Then, we draw some lines for where our quartiles sit, like in a box plot. We get the power of a box plot with the insight of a probability density.
I’ve showed a variety of ways to visualize a single variable, but the real power comes when comparing multiple variables. Below, I show how each of these visualizations can be used to compare the same measurement applied to three different groups. All three groups have different averages and medians, but only one of the groups is truly different from the others, since the other two have distributions that overlap quite a bit. Notice how the spread of the data is one of the most important, and actionable, pieces of information here. I leave it up to you to decide which visualization most clearly depicts the differences.
Why is this useful? When you hear a number reported, you should always keep in mind that it is drawn from a distribution. Pay attention to phrases like “margin of error,” “confidence interval”, “variance” or “standard deviation”. These are often more important that than the average value when reasoning with data, especially when we are comparing two or more similar numbers. Distributions help us reason whether similarities or differences are significant. Four, in fact, might equal seven.