Archive for the ‘Uncategorized’ Category

The Weekliest Links, Thanks to You Humans No Longer Read Maps Edition

March 6th, 2015 by Wes

Sometimes building the better product isn’t what you need for success. Sometimes all you need is a little gimmick that works well enough to get users in the door, and then let the product keep them there. Read about Google Maps’ origin story, and how it took some luck and a fun gimmick to make it a success.

A mentor-in-a-book for developers of all levels that provides insight into those day-to-day, non-technical challenges and questions.

Netflix performance engineer Brendan Gregg gives a great talk on tunings they make to their EC2 instances to squeeze every ounce of performance out of them.

Happy Friday!

The Weekliest Links, Michael Jordan Edition

February 27th, 2015 by Wes

Michael Jordan, who co-authored the first paper on Latent Dirichlet Allocation, talks about the relationship between frequentist and Bayesian statistics.

The New Yorker’s bio of Yitang Zhang, who recently proved the Bounded Gaps Conjecture — the most important mathematical discovery so far this decade.

Tali Garsiel’s “How Browsers Work” is an epic exploration of the internals of modern web browsers, and the culmination of years of research (including reading the Webkit and Gecko codebases). While almost 4 years old now, it is still super-relevant for any front-end developer. The web browser is your platform — get to know it better!

The Weekliest Links, Geriatric Edition

February 20th, 2015 by Wes

Feeling a little older? Maybe the universe didn’t start 13.8 billion years ago, maybe it was hanging around forever as a quantum potential waiting to make a big bang.

Alvaro Videla schools us on the origin of harmful gotos and other programming myths.

Happy Friday!

The Weekliest Links, the Luckiest Cat Edition

February 13th, 2015 by Wes

Schrodinger’s unlucky Cat can be alive! The Cat has been used to represent possible quantum states of a particle. In Schrodinger’s conceptual model, the unlucky Cat is placed in a container with a vial of cyanuric acid that breaks open if the container is opened. It is thus assigned both dead and alive states since we are unable to observe it’s state without opening the container. Similarly, quantum particles are assigned multiple states, represented by probabilities, and cannot be determined because the act of measuring perturbs the state. Berkeley researchers have in part been able to tease out this uncertainty in part by accurately predicting quantum particle trajectories. Synopsis here.

Tired of manually choosing mates through Tinder? By analyzing a small set of previous decisions about which people he liked, Justin Long constructed an averaged face that met his tastes and automatically liked/disliked future pictures. With this tool, you can rest easy knowing that no more potential booty calls will slip through your fingers.

Happy cursed Friday everyone!

The Weekliest Links, Bandito Edition

February 6th, 2015 by Wes

Don’t Use Bandits. Multi-armed bandit testing is very popular these days, but it’s critically important that it be applied correctly. Chris Stucchio does a great job of summing up some of the caveats that needs to be kept in mind when applying bandit algorithms.

Jeff Carter postulates that large banks can and should be replaced by smaller companies and give he gives several examples of startups that are doing finance much better than most of the big players.

The best computer vision algorithms can classify objects in real world imagery with impressively high accuracy. But what if you were to feed them carefully crafted garbage? Researchers used evolutionary algorithms to concoct a set of abstract images that tricked the best computers. (Read the summary at KDnuggets if you don’t have time for the paper.)

Happy Friday!

dnn-example

The Weekliest Links, Learn to Love the Tests Edition

January 30th, 2015 by Wes

Chartbeat Engineering internally came up with a set of best practices on naming things (watch for a future blog post on this!). Here’s a follow up that points out that tests need sane names too.

SoundCloud published details on their open-source service monitoring system, Prometheus, which is doing some interesting stuff with multi-dimensional querying.

The folks at Atlassian have come up with a better pull request: compare changes against the tip of master and see conflicts as they occur. File that in the department of “duh why weren’t we doing that before?”

The Weekliest Links, 12345 Edition

January 23rd, 2015 by Wes

Good to know that simple, easy to guess passwords are tapering off. Still shocking to see passwords such as “12345” or “qwerty” are still common.

This is the weekly puzzle at io9. They ask a couple variations of the “two jugs sizes 3 gallons and 5 gallons, and you want to measure out 4 gallons exactly”. (You know, like in Die Hard 3.)

Baidu has supposedly topped Google in Image Classification. Their approach is “better” data and training their nets via GPU (speed improvements).

Happy Friday!

The Weekliest Links, Boy Band Edition

January 16th, 2015 by Wes

Think of the boy bands when scheduling your developer conference.

Really awesome overview of different TCP implementations.

‘Why Clojure?’ answered by a Pythoneer. Immutability encourages pushing application state to the db, the REPL and clojure class loader, and benefits of the JVM. Chartbeat has traditionally been predominantly a Python shop, but we’re introducing Clojure in certain parts of our code base, so this is particularly interesting to us.

Rendezvous Hashing accomplishes the same goal as consistent hashing and was developed independently around the same time. The algorithm is intuitively simpler and has a straightforward implementation, though its performance characteristics are dependent on object size. That said, it might be a good replacement for some consistent hashing use cases.

Happy Friday everyone!

Weekliest Links, You Know Knothing Ben Bitdiddle Edition

January 9th, 2015 by Wes

Welcome back! This Weekliest Links is a doozy, including some things we read over break. Get yourself a cistern of coffee and cozy up to a blazing fire if you’re getting the same chill we are in NYC and check ’em out:

“[Haskell] is the most advanced of the obsolete languages,” starts Gerald Sussman in We Really Don’t Know How To Compute. And then he promptly shames the expressiveness of our tools as compared to biology.

Sometimes “boring” can be a strength. The Boring Front-End Developer specifically targets the Frontend but its concepts can be applied to a much wider audience. As Chartbeat continues to grow, it will be increasingly important to find the correct balance between “cool” and “boring” when it comes to our engineering practices.

Michael Nielsen answers, from the technical perspective, how the Bitcoin protocol actually works. This long read debunks the “anonymous myth”, reveals what miners are actually doing, and explains “double spending”.

Medium’s CSS is actually pretty f***ing good. @fat tells a good story about medium’s evolution of how they write css. Oh, and while he was evolving it he created a new font.

Researchers are now taking leaps towards building an robotic chemist. Using a combination of software, robotics, and lots of data teams are trying to solve some fundamental questions about how to build a chemical brain.

Robot Traffic Filtering in Real Time

October 17th, 2014 by Immanuel

Given a list of thousands of patterns and a text string, how fast can you find matches in the text?

This is the problem you need to solve if you want do basic robot-traffic filtering in real time using a list of known robots and browsers. We did this at Chartbeat as part of the many requirements for receiving accreditation by Media Rating Council (MRC), an independent body that is responsible for setting and implementing measurement standards. They let everyone know that our metrics are valid, reliable, and effective. Specifically during this process, we looked at how robot/browser traffic real-time filtering was implemented using Aho-Corasick string matching, and how to modify the algorithm to deal with exception patterns.

What Counts as Robot Traffic?

Every month the Interactive Advertising Bureau (IAB) an organization that sets standards for the online ads industry, releases a list of known robots and browsers with string matching rules, which when applied to a user agent string of an HTTP request, decides if the request is from a robot or a browser. Some string-matching rules only apply at the beginning of the UA string and others anywhere within the string. Also, some patterns have exceptions like ‘bigbot’ is a robot string except if it’s a substring of ‘bigbottle’ or ‘bluebigbottle’.

The MRC requires our metrics to be 100% clean of robot traffic, following the guidelines set by the IAB. We require it of ourselves too.

Filtering Traffic in Real Time

Our real-time metrics are computed by memoryfly, a custom in-memory data store that receives pings (HTTP GET requests) from people viewing content on the web.

This traffic regularly peaks past 250K requests/second.

Before filtering for robot traffic, memoryfly only looked at User-Agent (UA) strings to count how many people are on mobile, tablet, or desktop devices. With a growing list of robot patterns — to potentially thousands of patterns — and the high throughput of requests, it’s best to keep the user-agent processing time down.

So, for this work, our goal was to keep the user agent string processing to O(n + M) time, where n is the length of the UA string and M is number of robot/browser patterns.

String Matching

There are many known string matching algorithms, and a good number that can match in O(n) time. Most string matching algorithms preprocess the pattern once and built up a data structure that makes future lookups fast. Knuth-Morris-Pratt (KMP) does this using a partial match table that is built early on by preprocessing the pattern string. The table tells you where the next place to go once a mismatch occurs and you never move backwards in the text string. Using KMP or any other single pattern matching linear time algorithm, the runtime would be O(n * M). With a small number of patterns this would be ok, but it’s questionable when we have hundreds and potentially thousands of patterns to match against.

Here’s a KMP failure table for the pattern ”bobobotbot”

Pattern : b | o | b | o | b | o | t | b | o | t
-----------------------------------------------
Failure : 0 | 0 | 1 | 2 | 3 | 4 | 0 | 1 | 2 | 0

Prefix trees

Some of the robot/browser matching rules count a match only at the beginning of the UA string – a prefix match. A prefix tree or trie of patterns would be perfect for this case, having UA string match runtime of O(n + M) time matching against multiple patterns. But most of the robot pattern rules didn’t fit this (less than 5% of robot patterns were prefix matches only).

Aho-Corasick

Aho-Corasick is a multi-pattern string matching algorithm that runs in O(n + M) time. It’s kind of a combination of using a prefix tree and KMP. It uses a state machine containing the patterns, and runs the text string through the state machine in a single pass to find any matches.

The state machine consists of states, and goto and failure transitions.
The states and goto transitions/edges are the same as a prefix tree.
The failure transitions/edges tell you what state to go to next in order to continue matching.
These failure edges help skip unnecessary character match attempts based on the current state of the automaton.

Like KMP, you never move backwards for matches and the state machine’s transition table of failure and goto edges act like KMP’s partial match table, which tells you where the next best place to go once a mismatch occurs. KMP is equivalent to AC using a single pattern.

Goto edges are just prefix tree edges, where the current character in the text matches a character in a pattern. A failure edge of any state s, is an edge to state t, such that the string that represents the path from the start state to t is the longest prefix that is also the suffix of the string that represents the path from the start state to s. Like with patterns “fasten” and “astor”:

[0] -'f'- [1] -'a'- [2] -'s'- [3] -'t'- [4] -'e'- [5] -'n'- [6]
|                                      //
|                                    //
|                                  //
|                                //
|                              //
|- -'a'- [7] -'s'- [8] -'t'- [9] -'o'- [10] -'r'- [11]

After matching “fast”, but failing to match the ‘e’, you can jump to state right after the ‘t’ in astor and continue from there.

find(pattern_state_machine, text_string)
{
    current_state = pattern_state_machine.start
    list matches
    foreach current_letter in text_string {

        // look for a trie edge we can continue
        // matching along
        while(current_state.goto[current_letter] == NULL) {

            // failure state transition
            current_state = current_state.failure
        }

        // make the goto state transition
        current_state = current_state.goto[current_letter]

        // if there are any matches at this state,
        // add them to the list
        if current_state.matches
            matches.add(current_state.matches)
    }
    return matches
}

Okay, there are two nested loops, how is the runtime linear? The state machine at any point makes either a goto transition or a failure transition. When matching a substring of the text against the pattern set, a goto transition will move the substring end forward by one and a failure transition will move the substring start forward by at least one. The outer for-loop iterates the goto transitions and the inner while-loop iterates failure transitions. Since there are no backwards moves when matching the text string, the number of goto and failure transitions are each at most n, where n is the length of the text string, making the total number of transitions 2n.

Did I lose you? It helps to look at this example:

With these robot patterns: “bot”, “otis”, “ott”, “otto”, and “tea”, we get this prefix tree representing the state machine containing patterns with just goto edges.

[0] -'b'- [1] -'o'- [2] -'t'- [3]
|
| --'o'- [4] -'t'- [5] -'i'- [6] -'s'- [7]
|         |
|         | --'t'- [8] -'o'- [9]
|
| --'t'- [10] -'e'- [11] -'a'- [12]

The failure edges are defined by this function f of type state -> state.

f 2 = 4
f 3 = 5
f 5 = 10
f 8 = 10
f 9 = 4
f _ = 0

Running the text string ”botttea” through the state machine we go through these transitions:

state text       output
0     []botttea
1     [b]otttea
2     [bo]tttea
3     [bot]ttea  bot
5      b[ot]ttea
8      b[ott]tea ott
10     bot[t]tea
0      bott[]tea
10     bott[t]ea
11     bott[te]a
12     bott[tea] tea

In each transition above, the substring within the brackets is the current match candidate. You can see how neither bracket moves backwards. The resulting matched patterns are “bot”, “ott”, and “tea”.

Handling Pattern Exceptions

The IAB robot/browser lists had some pattern matching rules with exceptions:

pattern | exceptions
------------------------------
bo      | bottle, robot
tle     | bottle
irob    |

The string “irobottles”, matches “irob”, “bot”, and “tle”, but also exceptions “bottle”, and “robot”. “bot”, and “tle” get canceled out by its exceptions, but “irob” is left, which makes “irobottles” a robot match.

To be fair, this seems like a very specific and probably rare case, but it does happen.

The original Aho-Corasick algorithm outputs all matched patterns. So if the state machine contains exception strings, they will be returned too.

To handle the exception patterns, we can change the above algorithm to return only a set of maximal pattern matches, so we avoid returning pattern matches that are substrings of other patterns. The above rules matched with text string “irobottles” will return matches “irob”, “robot”, and “bottle”, leaving out “bot” and “tle”. This helps with exception patterns because matched exception patterns will cover their substring robot/browser patterns. Then we can filter out the matches that are exception strings, to only leave valid robot pattern matches, like “irob”.

find_maximal(pattern_state_machine, text_string)
{
    current_state = pattern_state_machine.start
    list matches
    longest_match

    foreach current_letter in text_string {

        while(current_state.goto[current_letter] == NULL) {

            // after failing a goto, output our current
            // longest match. the current longest match
            // cannot be a substring of the next match
            // by definition of failure. our next valid
            // goto'd state represents at most a proper
            // suffix of the prefix representing our
            // current state.
            if (longest_match) {
                matches.add(longest_match)
                longest_match = NULL
            }

            // failure state transition
            current_state = current_state.failure
        }

        // make the goto state transition
        current_state = current_state.goto[current_letter]

        // if a pattern ends at this state, the last
        // pattern we saved is a substring of it so it's
        // ok to overwrite since it follows a goto.
        if (current_state.is_end_of_pattern)
            longest_match = current_state.pattern
    }

    // we may have saved the last pattern at the end
    if (longest_match)
        matches.add(longest_match)

    return matches
}

The longest match saved is the full pattern represented by that state, ignoring substring pattern matches because they’re not maximal. The longest match is saved on successive goto transitions and then added to the list of results once the automaton makes a failure transition, since there’s no way for the next longest match to be a superstring of the current one. Iteration over the text string is unchanged, so the runtime remains linear.

The original Aho and Corasick paper, Efficient string matching: an aid to bibliographic search, goes into more details about how to build the failure table, a few formal proofs, and some further optimizations. But hopefully this gives you a rundown of how we practically do the same in real time.

We found our overall robot traffic to be only about 0.3%, which is pretty great news. Our numbers are not skewed by robot traffic and the robots haven’t taken over…yet.

Please feel free to share thoughts and ideas, especially if you have your own solutions to this kind of problem — would be great to hear and learn from what you’ve learned.

Chartbeat Engineering Blog

Archive for the ‘Uncategorized’ Category

The Weekliest Links, Thanks to You Humans No Longer Read Maps Edition

The Weekliest Links, Michael Jordan Edition

The Weekliest Links, Geriatric Edition

The Weekliest Links, the Luckiest Cat Edition

The Weekliest Links, Bandito Edition

The Weekliest Links, Learn to Love the Tests Edition

The Weekliest Links, 12345 Edition

The Weekliest Links, Boy Band Edition

Weekliest Links, You Know Knothing Ben Bitdiddle Edition

Robot Traffic Filtering in Real Time

What Counts as Robot Traffic?

Filtering Traffic in Real Time

String Matching

Prefix trees

Aho-Corasick

Handling Pattern Exceptions

Company

Our products

Say hello