Archive for the ‘Uncategorized’ Category

Michael Jordan, who co-authored the first paper on Latent Dirichlet Allocation, talks about the relationship between frequentist and Bayesian statistics.

The New Yorker’s bio of Yitang Zhang, who recently proved the Bounded Gaps Conjecture — the most important mathematical discovery so far this decade.

Tali Garsiel’s “How Browsers Work” is an epic exploration of the internals of modern web browsers, and the culmination of years of research (including reading the Webkit and Gecko codebases). While almost 4 years old now, it is still super-relevant for any front-end developer. The web browser is your platform — get to know it better!

Feeling a little older? Maybe the universe didn’t start 13.8 billion years ago, maybe it was hanging around forever as a quantum potential waiting to make a big bang.

Alvaro Videla schools us on the origin of harmful gotos and other programming myths.

Happy Friday!

 

Schrodinger’s unlucky Cat can be alive! The Cat has been used to represent possible quantum states of a particle. In Schrodinger’s conceptual model, the unlucky Cat is placed in a container with a vial of cyanuric acid that breaks open if the container is opened. It is thus assigned both dead and alive states since we are unable to observe it’s state without opening the container. Similarly, quantum particles are assigned multiple states, represented by probabilities, and cannot be determined because the act of measuring perturbs the state. Berkeley researchers have in part been able to tease out this uncertainty in part by accurately predicting quantum particle trajectories. Synopsis here.

Tired of manually choosing mates through Tinder? By analyzing a small set of previous decisions about which people he liked, Justin Long constructed an averaged face that met his tastes and automatically liked/disliked future pictures. With this tool, you can rest easy knowing that no more potential booty calls will slip through your fingers.

Happy cursed Friday everyone!

Don’t Use Bandits. Multi-armed bandit testing is very popular these days, but it’s critically important that it be applied correctly. Chris Stucchio does a great job of summing up some of the caveats that needs to be kept in mind when applying bandit algorithms.

Jeff Carter postulates that large banks can and should be replaced by smaller companies and give he gives several examples of startups that are doing finance much better than most of the big players.

The best computer vision algorithms can classify objects in real world imagery with impressively high accuracy. But what if you were to feed them carefully crafted garbage? Researchers used evolutionary algorithms to concoct a set of abstract images that tricked the best computers. (Read the summary at KDnuggets if you don’t have time for the paper.)

Happy Friday!

 

dnn-example

Chartbeat Engineering internally came up with a set of best practices on naming things (watch for a future blog post on this!). Here’s a follow up that points out that tests need sane names too.

SoundCloud published details on their open-source service monitoring system, Prometheus, which is doing some interesting stuff with multi-dimensional querying.

The folks at Atlassian have come up with a better pull request: compare changes against the tip of master and see conflicts as they occur. File that in the department of “duh why weren’t we doing that before?”

Good to know that simple, easy to guess passwords are tapering off. Still shocking to see passwords such as “12345” or “qwerty” are still common.
This is the weekly puzzle at io9. They ask a couple variations of the “two jugs sizes 3 gallons and 5 gallons, and you want to measure out 4 gallons exactly”.  (You know, like in Die Hard 3.)
Baidu has supposedly topped Google in Image Classification. Their approach is “better” data and training their nets via GPU (speed improvements).
Happy Friday!

Think of the boy bands when scheduling your developer conference.

Really awesome overview of different TCP implementations.

‘Why Clojure?’ answered by a Pythoneer. Immutability encourages pushing application state to the db, the REPL and clojure class loader, and benefits of the JVM. Chartbeat has traditionally been predominantly a Python shop, but we’re introducing Clojure in certain parts of our code base, so this is particularly interesting to us.

Rendezvous Hashing accomplishes the same goal as consistent hashing and was developed independently around the same time. The algorithm is intuitively simpler and has a straightforward implementation, though its performance characteristics are dependent on object size. That said, it might be a good replacement for some consistent hashing use cases.

Happy Friday everyone!

Welcome back! This Weekliest Links is a doozy, including some things we read over break. Get yourself a cistern of coffee and cozy up to a blazing fire if you’re getting the same chill we are in NYC and check ’em out:

“[Haskell] is the most advanced of the obsolete languages,” starts Gerald Sussman in We Really Don’t Know How To Compute. And then he promptly shames the expressiveness of our tools as compared to biology.

Sometimes “boring” can be a strength. The Boring Front-End Developer specifically targets the Frontend but its concepts can be applied to a much wider audience. As Chartbeat continues to grow, it will be increasingly important to find the correct balance between “cool” and “boring” when it comes to our engineering practices.

Michael Nielsen answers, from the technical perspective, how the Bitcoin protocol actually works. This long read debunks the “anonymous myth”, reveals what miners are actually doing, and explains “double spending”.

Medium’s CSS is actually pretty f***ing good. @fat tells a good story about medium’s evolution of how they write css. Oh, and while he was evolving it he created a new font.

Researchers are now taking leaps towards building an robotic chemist. Using a combination of software, robotics, and lots of data teams are trying to solve some fundamental questions about how to build a chemical brain.

Given a list of thousands of patterns and a text string, how fast can you find matches in the text?

This is the problem you need to solve if you want do basic robot-traffic filtering in real time using a list of known robots and browsers. We did this at Chartbeat as part of the many requirements for receiving accreditation by Media Rating Council (MRC), an independent body that is responsible for setting and implementing measurement standards. They let everyone know that our metrics are valid, reliable, and effective. Specifically during this process, we looked at how robot/browser traffic real-time filtering was implemented using Aho-Corasick string matching, and how to modify the algorithm to deal with exception patterns.

What Counts as Robot Traffic?

Every month the Interactive Advertising Bureau (IAB) an organization that sets standards for the online ads industry, releases a list of known robots and browsers with string matching rules, which when applied to a user agent string of an HTTP request, decides if the request is from a robot or a browser. Some string-matching rules only apply at the beginning of the UA string and others anywhere within the string. Also, some patterns have exceptions like ‘bigbot’ is a robot string except if it’s a substring of ‘bigbottle’ or ‘bluebigbottle’.

The MRC requires our metrics to be 100% clean of robot traffic, following the guidelines set by the IAB. We require it of ourselves too.

Filtering Traffic in Real Time

Our real-time metrics are computed by memoryfly, a custom in-memory data store that receives pings (HTTP GET requests) from people viewing content on the web.

This traffic regularly peaks past 250K requests/second.

Before filtering for robot traffic, memoryfly only looked at User-Agent (UA) strings to count how many people are on mobile, tablet, or desktop devices. With a growing list of robot patterns — to potentially thousands of patterns — and the high throughput of requests, it’s best to keep the user-agent processing time down.

So, for this work, our goal was to keep the user agent string processing to O(n + M) time, where n is the length of the UA string and M is number of robot/browser patterns.

String Matching

There are many known string matching algorithms, and a good number that can match in O(n) time. Most string matching algorithms preprocess the pattern once and built up a data structure that makes future lookups fast. Knuth-Morris-Pratt (KMP) does this using a partial match table that is built early on by preprocessing the pattern string. The table tells you where the next place to go once a mismatch occurs and you never move backwards in the text string. Using KMP or any other single pattern matching linear time algorithm, the runtime would be O(n * M). With a small number of patterns this would be ok, but it’s questionable when we have hundreds and potentially thousands of patterns to match against.

Here’s a KMP failure table for the pattern ”bobobotbot”

Pattern : b | o | b | o | b | o | t | b | o | t
-----------------------------------------------
Failure : 0 | 0 | 1 | 2 | 3 | 4 | 0 | 1 | 2 | 0

Prefix trees

Some of the robot/browser matching rules count a match only at the beginning of the UA string – a prefix match. A prefix tree or trie of patterns would be perfect for this case, having UA string match runtime of O(n + M) time matching against multiple patterns. But most of the robot pattern rules didn’t fit this (less than 5% of robot patterns were prefix matches only).

Aho-Corasick

Aho-Corasick is a multi-pattern string matching algorithm that runs in O(n + M) time. It’s kind of a combination of using a prefix tree and KMP. It uses a state machine containing the patterns, and runs the text string through the state machine in a single pass to find any matches.

  • The state machine consists of states, and goto and failure transitions.
  • The states and goto transitions/edges are the same as a prefix tree.
  • The failure transitions/edges tell you what state to go to next in order to continue matching.
  • These failure edges help skip unnecessary character match attempts based on the current state of the automaton.

Like KMP, you never move backwards for matches and the state machine’s transition table of failure and goto edges act like KMP’s partial match table, which tells you where the next best place to go once a mismatch occurs. KMP is equivalent to AC using a single pattern.

Goto edges are just prefix tree edges, where the current character in the text matches a character in a pattern. A failure edge of any state s, is an edge to state t, such that the string that represents the path from the start state to t is the longest prefix that is also the suffix of the string that represents the path from the start state to s. Like with patterns “fasten” and “astor”:

[0] -'f'- [1] -'a'- [2] -'s'- [3] -'t'- [4] -'e'- [5] -'n'- [6]
|                                      //
|                                    //
|                                  //
|                                //
|                              //
|- -'a'- [7] -'s'- [8] -'t'- [9] -'o'- [10] -'r'- [11]

After matching “fast”, but failing to match the ‘e’, you can jump to state right after the ‘t’ in astor and continue from there.

find(pattern_state_machine, text_string)
{
    current_state = pattern_state_machine.start
    list matches
    foreach current_letter in text_string {

        // look for a trie edge we can continue
        // matching along
        while(current_state.goto[current_letter] == NULL) {

            // failure state transition
            current_state = current_state.failure
        }

        // make the goto state transition
        current_state = current_state.goto[current_letter]

        // if there are any matches at this state,
        // add them to the list
        if current_state.matches
            matches.add(current_state.matches)
    }
    return matches
}

Okay, there are two nested loops, how is the runtime linear? The state machine at any point makes either a goto transition or a failure transition. When matching a substring of the text against the pattern set, a goto transition will move the substring end forward by one and a failure transition will move the substring start forward by at least one. The outer for-loop iterates the goto transitions and the inner while-loop iterates failure transitions. Since there are no backwards moves when matching the text string, the number of goto and failure transitions are each at most n, where n is the length of the text string, making the total number of transitions 2n.

Did I lose you? It helps to look at this example:

With these robot patterns: “bot”, “otis”, “ott”, “otto”, and “tea”, we get this prefix tree representing the state machine containing patterns with just goto edges.

[0] -'b'- [1] -'o'- [2] -'t'- [3]
|
| --'o'- [4] -'t'- [5] -'i'- [6] -'s'- [7]
|         |
|         | --'t'- [8] -'o'- [9]
|
| --'t'- [10] -'e'- [11] -'a'- [12]

The failure edges are defined by this function f of type state -> state.

f 2 = 4
f 3 = 5
f 5 = 10
f 8 = 10
f 9 = 4
f _ = 0

Running the text string ”botttea” through the state machine we go through these transitions:

state text       output
0     []botttea
1     [b]otttea
2     [bo]tttea
3     [bot]ttea  bot
5      b[ot]ttea
8      b[ott]tea ott
10     bot[t]tea
0      bott[]tea
10     bott[t]ea
11     bott[te]a
12     bott[tea] tea

In each transition above, the substring within the brackets is the current match candidate. You can see how neither bracket moves backwards. The resulting matched patterns are “bot”, “ott”, and “tea”.

Handling Pattern Exceptions

The IAB robot/browser lists had some pattern matching rules with exceptions:

pattern | exceptions
------------------------------
bo      | bottle, robot
tle     | bottle
irob    |

The string “irobottles”, matches “irob”, “bot”, and “tle”, but also exceptions “bottle”, and “robot”. “bot”, and “tle” get canceled out by its exceptions, but “irob” is left, which makes “irobottles” a robot match.

To be fair, this seems like a very specific and probably rare case, but it does happen.

The original Aho-Corasick algorithm outputs all matched patterns. So if the state machine contains exception strings, they will be returned too.

To handle the exception patterns, we can change the above algorithm to return only a set of maximal pattern matches, so we avoid returning pattern matches that are substrings of other patterns. The above rules matched with text string “irobottles” will return matches “irob”, “robot”, and “bottle”, leaving out “bot” and “tle”. This helps with exception patterns because matched exception patterns will cover their substring robot/browser patterns. Then we can filter out the matches that are exception strings, to only leave valid robot pattern matches, like “irob”.

find_maximal(pattern_state_machine, text_string)
{
    current_state = pattern_state_machine.start
    list matches
    longest_match

    foreach current_letter in text_string {

        while(current_state.goto[current_letter] == NULL) {

            // after failing a goto, output our current
            // longest match. the current longest match
            // cannot be a substring of the next match
            // by definition of failure. our next valid
            // goto'd state represents at most a proper
            // suffix of the prefix representing our
            // current state.
            if (longest_match) {
                matches.add(longest_match)
                longest_match = NULL
            }

            // failure state transition
            current_state = current_state.failure
        }

        // make the goto state transition
        current_state = current_state.goto[current_letter]

        // if a pattern ends at this state, the last
        // pattern we saved is a substring of it so it's
        // ok to overwrite since it follows a goto.
        if (current_state.is_end_of_pattern)
            longest_match = current_state.pattern
    }

    // we may have saved the last pattern at the end
    if (longest_match)
        matches.add(longest_match)

    return matches
}

The longest match saved is the full pattern represented by that state, ignoring substring pattern matches because they’re not maximal. The longest match is saved on successive goto transitions and then added to the list of results once the automaton makes a failure transition, since there’s no way for the next longest match to be a superstring of the current one. Iteration over the text string is unchanged, so the runtime remains linear.

The original Aho and Corasick paper, Efficient string matching: an aid to bibliographic search, goes into more details about how to build the failure table, a few formal proofs, and some further optimizations. But hopefully this gives you a rundown of how we practically do the same in real time.

We found our overall robot traffic to be only about 0.3%, which is pretty great news. Our numbers are not skewed by robot traffic and the robots haven’t taken over…yet.

Please feel free to share thoughts and ideas, especially if you have your own solutions to this kind of problem — would be great to hear and learn from what you’ve learned.

There’s a funny saying I recently came across about statisticians: A statistician is someone who knows that 4 is sometimes equal to 7.  Besides being incredibly geeky — and, let’s face it, kind of stupid — this underlies a very fundamental concept in the data sciences: any quantity we measure is drawn from a distribution.

I’ve often found that this concept is difficult to convey to people, but it is fundamentally important for understanding data.  What does it mean that 4 is sometimes equal to 7, and what does it mean when quantities are drawn from a distribution?  What are some ways to visualize distributions, and how can we use this concept to powerfully reason from data?

Fundamentals

To begin, Let’s start with some quantity we’re trying to measure.. This can be any number you are interested in — the number of concurrents on your homepage, the engaged time of users, the heights of people in New York, the arrival time of the N train on your morning commute. When I say that these numbers are drawn from a distribution, I mean quite directly, that the quantity varies each time you measure it.  The 8:03 N will show up at 8:03, but sometimes 8, sometimes 8:06, more rarely 8:15. If you watch the number of concurrents or the engaged times on your Chartbeat dashboard, you’re watching this variation in real time.

FIG1

Were we to plot these numbers on a simple graph, as I’ve done for 10 numbers in our imaginary data pull above, we’d get our first insight into what a distribution is. It is simply a spread of numbers. Now if someone asked you “what number did you measure?”, looking at your measurements you’d probably be inclined to pick some number in the middle — likely the average (10.5, in this case) —  although you, in fact, measured a entire range of numbers.  But is your report of the average an accurate response? How do we know this number is representative of the quantity we’re trying to measure?

At this point, we’ve only observed 10 numbers and so we really shouldn’t be very confident in that number we report. So we wait a much longer time, and our data builds up. To view this in our graph, let’s just pile up points where they fall, so the darker the color, the more points have been piled there.

FIG2

We start to see that our report of the center value was actually quite poor.  We are seeing truly how this number is distributed. If asked for a single number to represent our measurements, we’d be likely more inclined to say something closer to 5 since this is where a lot of points are piling up.

Enter the Boxplot

But visualizing our distribution as overlayed points isn’t always very insightful.  Sometimes we want specific landmarks to help us reason using the data. What kind of landmarks are useful? Well, the average can be a very useful landmark of central tendency.  But in our example above the average sits a bit too far away from the the largest density of points to give us a feel for what is “typical”. Also, we often would rather know how any particular measurement compares to the rest of the measurements.  Are 90% of values below this value?  75%? 10%?  These are the so-called quantiles of the data, also known as percentiles.  One way to show these kind of landmarks is to draw a box that spans from the 25th percentile to the 75th percentile.  The bounds of the box are somewhat arbitrary, but these bounds are convenient; we expect half of our data to fall within these bounds, so it gives us a feel for where the “meat” of the distribution is, and it’s spread.

FIG3

This represents that we’re likely to find data points between 4 and 12.

So now we have a simple visualization of how our data is distributed that lets us reason about approximately how often we are likely to measure different values.   We can improve this by adding a line in the box for a measure of central tendency.  To keep our landmarks consistent, we’ll stick with percentiles and mark the median (50th percentile).  But we still don’t really know what happens outside that box, so, just to get some insight into the range of our data, let’s draw lines out to the minimum and maximum values we measured.

FIG4

This visualization is known as a boxplot (or a box-and-whisker plot).  In my opinion, it is one of the most powerful ways to visualize data, especially when trying to compare quantities. It gives is a pretty comprehensive view of our quantity of interest, both in terms of what values are typical and how spread out these values are.

From Boxplots to Densities

Boxplots, while awesome, can sometimes mask subtleties in the data.  A canonical example is the case where you have some behavior that is multi-modal.  For instance, a day/night behavior or weekday/weekend behavior. Your measurements sometimes cluster around one value when measured in one situation, and around another value when measured in another.  Also, many times we’re interested in answering a question about the probability of measuring certain numbers, not just their relationship to other numbers.  This is where the concept of a probability distribution comes in.  Let’s go back to to second figure, where we had overlayed a bunch of points.  We’ll start simply by counting up the values and make a histogram of how often we measured values in a certain range.

FIG5

Histograms let us know the raw counts of certain values grouped into bins.  But raw counts aren’t always useful.  We often want more of a ratio, a probability.   For quantities that are discrete, this is pretty straightforward.  For instance, let’s say we were interested in whether a die was loaded. After some large number of rolls, we could make a distribution and could ask: what’s the probability that we roll a four? For each of the six bins, we’d simply divide the number of counts by the total number of times we rolled the die.  The resulting histogram, which shows the probability of discrete quantities, is called a probability mass function.  Now, if we had a value that wasn’t discrete, like height, or engaged time, we can make our bins really small, so instead of a bin corresponding to 1 unit, it corresponds to a range of 0.01 or 0.00001. Then, using the magic of calculus, we can make the bin size go to zero (wat?) and the histogram becomes a smooth curve.  The concept of the mass function still holds, but now, it is the area under the curve between two numbers that gives the probability that a number will fall in that range.  Well, the area always gave the probability, but in the mass function, it just so happened that bin sizes were equal to 1.

FIG6

This smoothed version of a histogram is called a probability density.  The curve depicts the density of where our measurements fell.  More points closer together equal a larger density, and hence a greater probability that any particular measurement will fall in the vicinity of that peak.  The y-axis has funny units — it is not quite a probability — but when you glance at one of these plots you can get a good feel for what the most probable values are.  Just look for peaks.  You also get a good feel for how spread out your data are.  In fact, I’m sure you are all familiar with the most popular probability distribution, the Normal distribution, or bell curve.   Much of the entire field of statistics is centered around reasoning about data based on how the data is distributed.  They are likely the most important concept in the data sciences.

Violins?

We can actually combine the ideas of boxplots and densities into a single, aesthetically interesting (but perhaps somewhat confusing) plot called a violin plot.  These aren’t used very often, but I find them pretty cool.  We take a probability distribution, flip it on its side, and reflect it around its axis. Then, we draw some lines for where our quartiles sit, like in a box plot. We get the power of a box plot with the insight of a probability density.

FIG7

Neat huh?

Comparing Quantities

I’ve showed a variety of ways to visualize a single variable, but the real power comes when comparing multiple variables. Below, I show how each of these visualizations can be used to compare the same measurement applied to three different groups.  All three groups have different averages and medians, but only one of the groups is truly different from the others, since the other two have distributions that overlap quite a bit.  Notice how the spread of the data is one of the most important, and actionable, pieces of information here.  I leave it up to you to decide which visualization most clearly depicts the differences.

FIG8B

The Takeaway

Why is this useful?  When you hear a number reported, you should always keep in mind that it is drawn from a distribution.  Pay attention to phrases like “margin of error,” “confidence interval”,  “variance” or “standard deviation”.  These are often more important that than the average value when reasoning with data, especially when we are comparing two or more similar numbers.  Distributions help us reason whether similarities or differences are significant.  Four, in fact, might equal seven.

If you have any questions or comments, feel free to tweet them to me @dpvalente or email me dan@chartbeat.