Author Archive

We’re moving our Engineering Blog to Medium! Some of the old content has been migrated, but all the tasty freshness will be appearing there going forward. And do expect tastiness — we’ve been working hard at building up a nice queue of thought provoking and illustrative content that I guarantee you will write your mother about.

You’re doing data replication wrong. There. I said it.

Suppose you’ve got a six node database cluster (ex. HDFS, Cassandra, Riak), and you’d like to store three copies of some piece of data in the cluster. Which three nodes should you pick?

One possibility is that you pick a random three nodes. This is the strategy that HDFS takes by default.

Another possibility is that you order the nodes and then randomly pick three consecutive nodes. This is the strategy Facebook’s modified HDFS takes, as well as Riak.

A third possibility is you set up your nodes in two sets of three: [A, B, C] might be one set, and [D, E, F] might be another. The data then gets randomly assigned to one of those sets, so all data that lives on A is also on B and C, and all data that lives on D is also on E and F. This is how you might set up data replication and sharding with MySQL or MongoDB.

Question: what happens in the case that three random nodes fail? Cidon et al explored this in their paper introducing copysets. In it, they introduce both the notion of a copyset (places you copy data), and also a random algorithm called Copysets that computes the copysets. Forget about the Copysets algorithm, because Cidon et al followed up with a paper a few years later introducing Tiered Replication, which solves the copyset generation problem and then some. This post is about Tiered Replication and our Python implementation, trepl.

So what does happen when a random three nodes fail? Let’s go back and look at the three different strategies and do a little bit of math.

Random Replication

If you have #bigdata and you’re randomly picking your replication sets, chances are you’ll have data that resides on every single possible set. To see why this is the case, we can easily compute the number of possible three node configurations from a six node cluster. This is a combinatorics 101 function: nCr or “6 choose 3” which is equal to 20:

Note that we’re counting combinations, not permutations, so we don’t care about the ordering of the sets.

Since there only 20 different sets, but we have #bigdata, then chances are lots of data gets assigned to each one of those sets. Now if a random three nodes fail, we’ve guaranteed that we’ve lost data! We’ve lost roughly 1/20 or 5% of our data actually.

Ring Placement

Now suppose instead we go the Riak route and place our data on sequential nodes in a ring. Our sets then consist of the following:

So all data gets copied to one of those six sets. Now what happens when three random nodes fail? Well we know that there are 20 possible combinations of three nodes, so the probability that one of the six sets is taken out is 6/20, or 30%, and when it happens, we lose 1/6 of our data. The expected data loss then, is the probability of loss times the magnitude of loss, or 1/20 = 5%.

Pair Placement

Here we’ve got exactly two sets.

The probability that one of those two sets gets wiped out when three nodes fail is 2/20, or 10%. That’s nice and low, but when it happens, we lose half our data! So the expected amount of loss is 2/20 * 1/2 = 5%.

Starting to see a pattern? The expected amount of data loss in each of these scenarios is the same, however we can control the probability of failure by varying the configuration of the sets. This is super interesting in my opinion, because not all failures are the same. If we have to go to our customers and fess up that we’ve lost 5% of data, that might from a PR perspective be just as bad as admitting that we lost 17% of our data. In that case, we should prefer ring placement over random replication, because ring placement (though it’s expected to be just as bad as random) has a lower failure rate.

Restoring Data

It turns out that the replication strategy also dictates the speed at which you can restore from a failed node. This is easiest to see when comparing pair versus random replication. With pair placement, suppose that A fails and we bring it back up from a blank slate. All of the data that belongs on A is on B and C, so A can copy half of it from B, and the other half from C. The number of nodes A can restore from is called the scatter width, in this case two.

With random replication, A participates in a set with every other node. When we bring it back up, it can restore its data from every other node, so it has a scatter width of five. Not only does this mean that each node undergoes less strain, but the restoring process is also quite a bit faster since it can be transferring data from the replicas in parallel.

Tiered Replication

The three parameters of note in the example I gave are: six nodes, three copies, and some scatter width. We can parameterize these as N, R, and S. The Tiered Replication algorithm computes a set of copysets each of which is size R, the sets are drawn from the N nodes, and each node has a minimum scatter width of S. This gives the programmer the ability to pick a comfortable failure and restore scenario: do we want to minimize probability of failure but suffer from slow restores? Or minimize damage done and have fast restores?

TR is a straightforward greedy algorithm. I won’t go into detail here (read the paper or the code), but the basic intuition is this:

  • All nodes start out with a scatter width of zero.
  • To create a copyset, we pick the R nodes with the lowest scatter width.
  • Repeat this process until every node has a scatter width of at least S.

The “tiered” part of Tiered Replication allows you to throw in an arbitrary constraint when creating a copyset. This is useful for implementing strategies such as rack or availability zone awareness, or “tiering” your storage (R-1 copies of data sit on fast expensive nodes, and then one copy sits on a slow but cheap node only used for restoring from failures).

Trepl

Trepl is our Python implementation of TR, available on pypi.

You can supply a constraint function as well, and in fact we ship with one useful for rack awareness, as well as tiering.

We currently use Trepl to determine partition placements with Kafka (sorry, that tool isn’t open source!), and we’ll also be extending the tool to set up smarter layer 7 load balancing with haproxy. We also use it in our beloved WADE, and can also envision possible applications for Mesos. Let us know if you come up with another novel use!

Swailing is a handy little Python logging library we created in the process of developing WADE. It’s usually a handy thing to put in lots of logging info while developing a system (say, an API service), and what generally ends up happening is that there’s some piece of info that gets output on every request. In your dev environment, where you’re sending just a few requests per second perhaps, that’s not an issue, but then when you push this thing out and it starts handling thousands of requests per second in prod, well it’s like you just opened up the floodgates and all of a sudden you’re drowning in supposedly useful messages.

The general “best practice” is to use different priority levels, such as WARNING, ERROR, INFO, etc. Critical errors get written out with high priorities (ERROR), while development debugging info gets written out at low priorities (INFO). Then when you deploy, you set a high threshold and let only errors through. There are several issues with this approach.

The first is that sometimes, you get the priority levels wrong. If you accidentally emit a log at INFO when it should have been at ERROR, then when the system breaks in production, you’re missing important info for analysis. Conversely, you might emit a log at ERROR when in fact it should be INFO, and then get flooded by all sorts of useless information. The worst thing, though, is that it’s often times simply not clear while developing an application which level you should be using for some particular error condition. The internet is a noisy and chaotic input source, so we have to deal with and sometimes accept all sorts of errors.

The second issue is that even if you get the level exactly right, you still might deal with a flood of errors. Sometimes services exhibit a manageable trickle of errors, but then sometimes when you hit a condition (database down for example) that causes every single operation to fail. In that situation, all you really care is that you get some of the log output; most of it is redundant.

What we really want is a system that gives you complete log information under a manageable stream of errors, but then throttles when the gates of logging hell open up.

The Swailing library deals with this using the Token Bucket algorithm. It goes like this:

  1. You’ve got a bucket full of tokens. It has a fixed capacity of C tokens. In order to output a log line, you have to spend a token. If you have no tokens, no log outputting for you!
  2. Your bucket fills at a constant rate of R tokens per second. But the bucket can never fill above C tokens.

This is simple and elegant, both great properties of an algorithm. A useful reframing of the parameters is this: you can burst log up to C times over any period of time; if you stay under R lines per second, you get to output everything you want.

The reverse of Token Bucket is interesting as well. It’s called the Leaky Bucket: suppose you start with an empty bucket of size C which leaks at a fixed rate of R per second. Every time something occurs, you fill it, and if it ever reaches capacity C, you trigger a special condition. (Note: Swailing doesn’t use this algorithm.)

A concrete example is error detection. Suppose that you expect there will naturally be some small number of timeouts per second to your reverse proxy, because again the internet is the wild wild west, and sometimes browsers are slow or lose connectivity. You might set a threshold of, say, 100 errors in one second to alert on. But what if the errors are bursty and you usually see 1000 occur, but only once every minute? If you rejigger your threshold to 1000, then it’s possible that a constant 999 errors per second goes undetected.

With the Leaky Bucket, you’d set C to 1000 and R to 100. A burst to 1000 would immediately fill the bucket and trigger the condition. A burst of less than that would not, so long as there aren’t too many consecutive bursts. And a constant error rate of less than 100 goes undetected, since the bucket leaks at 100 per second.

The second thing that the Swailing library does is provide a handy facility for emitting PostgreSQL style debug lines, which adds some nuance to standard severity levels. In addition to how *bad* something is, we now have a way to control how much *detail* is associated with the log output.

There are three detail levels: primary, detail, and hint. Primary log lines always get emitted if higher than the severity threshold, but they’re meant to be very basic information. The detail line provides specifics as to what caused the error. The hint line is a speculative corrective action. If this were a web proxy, for instance, a complete log message might be:

With Swailing, the detail field is usually JSON so that we can programmatically parse the log file. The primary line is meant for quick visual scanning (or graphing in a system like ELK), so sometimes that’s all you really need. Usually the hint is only shown during development, but it may be useful for production as well.

Swailing has been pretty useful in WADE, and is starting to see more adoption with internal services at Chartbeat. We have a disgusting amount of logs coming out of our system, on the order of terabytes per day. Providing a tool that automatically cuts this down will go a long way towards allowing us to better manage and debug our systems.

Hey peeps, I’ll be presenting at the O’Reilly Software Architecture conference here in NY in a few days! The talk will be on various sharding / hashing algorithms for data replication and load balancing. It’s more interesting than it sounds, I swear. Consistent hashing is just the tip of the iceberg 🙂

We’ll be sure to put up some blog posts on pieces of the talk in the coming weeks, so don’t fret if you’re missing the conference. But if you’re in NYC for SA, be sure to hit me up on Twitter: @weschow.

We’re a small team, about 30-ish engineers, and we process over 250,000 writes per second into our system on an average day. At this traffic level, we routinely see mysterious performance problems, so it’s important that our engineers have a better handle of how a system performs than the average web developer.

Since we emphasize a learning culture, we run regular training sessions and have processes for pushing around information and experience. One such document that came out of this is a quick and dirty guide to system performance analysis, which may be of use (or entertainment value) to some of you poor souls wondering how Mongo could be claiming more memory than exists in the machine, and whether or not you should be worried.

So you want a rigorous somewhat sane way to debug your application’s performance? Look no further, this post is for you.

Operating Systems & Process Accounting

It’s useful to have an idea of how the operating system manages processes in order to understand the metrics that it outputs.

At a super basic level, the operating system (more precisely, the kernel) is a system for coordinating resources (disk, RAM, CPU, network) between programs. You can think of it as an accounting system for what programs want to run, and an arbiter that decides who gets what and when. A process is the basic unit of thing that can do work.

A process is the manifestation of a program, similar to in OOP an instance is the manifestation of a class. There is some data that’s static, such as the program binary and libraries that its uses, and then data that is created and operated on during runtime, such as the stack and heap.

A CPU can only execute one process at a time, including the kernel, so how does it take control of the machine and give time to another process? There are two mechanisms for doing so. Either the process relinquishes control (syscalls), or the CPU forcibly stops execution and hands control to the kernel (interrupts).

Syscalls are generally functions that perform some sort of resource allocation for the process. If a process wants to open a file, then it calls the C open function, which relinquishes control to the kernel. The kernel does its filesystem magic, and returns to the process — maybe. It is possible that the kernel may decide that the best use of the system is not to return control to the process, but rather let another process have a chance at running. Kernels are typically designed to return control back to its processes in a fair way.

Interrupts are usually triggered by hardware events. The most common peripheral interrupts would be keyboard, network, and mouse events. Anytime you hit a key, the CPU stops the current process and runs a special function that the kernel has registered with the CPU for handling keypresses.

In addition to this, a super important interrupt is the kernel scheduler interrupt. In order to prevent a process from going into an infinite loop and never making a syscall, the kernel sets up periodic interrupts at an interval called the scheduler time slice or quantum. This value is typically on the order of tens or hundreds of microseconds.

Once the kernel has control, it needs to decide what to do next. Interactions with hardware usually happen in some asynchronous fashion, so if a process makes an open call, the kernel may send a bunch of commands to the disk controller, then have to wait for the controller to respond (which it does via an interrupt). In the meantime, the kernel may as well put another process onto the CPU, or perhaps send some data out on the network, or draw something on the screen.

To do its job, the kernel then needs to keep track of the state of every process. Not all processes actually need to be running on the CPU all the time. In fact, most don’t. They’re waiting for input either from a human or from some hardware resource. The kernel models this as states that a process can be in: running, runnable, uninterruptible wait, sleep, zombie (I’m probably missing something). A process that’s running is actually on the CPU, whereas a process that’s runnable is not on a CPU, but could absolutely do work if it got CPU time. In other words, it’s not waiting on a syscall to return. A process in uninterruptible wait is usually waiting for disk I/O, so I will interchangeably call this disk wait; network I/O is considered interruptible, and thus does not count towards uninterruptible wait. Finally a process that’s sleeping is not runnable and not waiting on disk, so is typically waiting for user input, in a sleep loop, or waiting on network.

 

A Detour Into Memory Management

There are two major memory utilization optimizations that modern kernels do. The first is that they recognize that a lot of code is shared. For example, two programs that load in the same library don’t both have to be holding copies of their code, since that’s static data. The second thing is that the sum total of memory all processes need can exceed the amount of actual RAM in the machine if the kernel is clever about how it allocates those physical pages of memory.

The memory model that the kernel and CPU expose to programs is that they have a never ending contiguous block of bytes starting at address 0. This is a problem if address 0 is actually RAM’s first byte, because many processes would think they have access to that location. So what the kernel does is this sleight of hand. It maintains a map of (process, virtual memory location) → physical location in RAM. So it might say: process #2’s virtual addresses 0 through 4095 map to physical RAM locations 26624 to 30719. When the process tries to access byte 4000 in its own memory space, it actually ends up accessing 30624 (26624 + 4000). But how does it accomplish this? Remember that once a program is on the CPU, it stays on the CPU until it relinquishes it or there’s an interrupt. It would be a large amount of overhead if the process were to make a syscall for every single memory access.

Instead, the CPU provides some hardware support in the form of a Translation Lookaside Buffer. The idea is that the kernel tells the CPU, ok I’m going to run process #2 and I want you to map its virtual memory access for bytes 0 – 4095 to physical memory at 26624 – 30719. That set of mappings is called the TLB, and when the process accesses location 4000, the CPU does the work of mapping that to physical RAM, completely bypassing the kernel.

This only works, though, if the TLB can account for all of a processes memory. In reality, the TLB isn’t very big, so what happens is when the process accesses a location that isn’t in the TLB, called a miss or minor page fault, the CPU generates an interrupt and asks the kernel to give it a new mapping that satisfies the process’s access.

Actually, that’s a lie. Modern CPUs have the ability to hold the page mapping themselves, called a page table, in an area that’s not as fast as the TLB, but also not as slow as calling the kernel. (I think the page table is held in regular RAM?)

Ok so what’s this clever thing the kernel can do? If a process accesses memory that isn’t in the TLB and also isn’t in the page table, this generates what’s called a major page fault. It’s simply another interrupt that runs a little bit of kernel code. Now the kernel has a chance to modify the page table and then tell the CPU to retry the memory access. What the kernel can do is maintain a mapping of virtual memory locations to blocks on a disk, so that on a major page fault, it can load in a block of data from disk into memory, update the page table, then tell the CPU to retry the access. There’s nothing that says the number of blocks the kernel could potentially allocate to memory has to be constrained by the size of memory, so the kernel can in essence fake a large addressable space by backing some of it by disk. That’s what swap is.

So that’s how paging works. The reason for the detour should be clear: page faults are an important metric because accessing the page table is orders of magnitude slower than accessing the TLB, and accessing disk is in turn orders of magnitude slower than accessing the page table.

 

Shared Memory Regions & Memory Mapped Files

This isn’t too important from performance debugging purposes, but it’s nice to know. The kernel is smart enough to know that some programs share a bunch of data that can’t be modified, normally binary object code like the executable itself. Processes can get assigned read-only chunks of memory (called pages) that are shared with other processes. In some cases, these pages can be marked copy-on-write, which means that the kernel thinks the processes aren’t going to modify the pages, but if one tries to then it makes a copy of the page before allowing the process to start messing with it.

When you run top, you’ll see VIRT, RES, and SHR columns. VIRT is the total size of the virtual address space assigned to a process, which is not necessarily the amount of physical RAM it’s taking. RES is the amount of physical RAM it’s taking. SHR is the amount of physical RAM that’s shared between processes.

Memory mapping is a really handy technique for when a process wants to load in a really big file. The way file access naively works is that the process will “open” a file, getting a file descriptor, then make a series of “read” calls getting data one chunk at a time. The “open” and each of the “reads” is a syscall, which causes the process to lose control, and the kernel to take over and reschedule things. Then the kernel reads in the data into a buffer, and then copies the data from that buffer into a buffer that’s owned by the process. Lots of wasted work here.

A memory mapped file is a way for the process to say, “hey kernel why don’t you just map the entire file into my virtual address space.” So the kernel then uses the same mechanism that it uses for major page faults and swap to assign all the data to the process in one go. That’s a single syscall, and requires no copying of data from buffer to buffer.

Mongo (previous to WiredTiger) works this way, as does Varnish. This is why you’ll see Mongo taking enormous amounts of VIRT memory, far exceeding the physical RAM on the machine. Do not be alarmed. The kernel treats Mongo’s access to data that’s not in RAM the same way as it handles all its major page faults, so this happens seamlessly. For Mongo, page faults are a good indicator of high load. The mongostat tool, in fact, measures page faults.

 

Filesystem Cache & Free Memory

If the machine has available RAM, the kernel does something smart and just copies data from disk into RAM so that if a program accesses it then it might be handy. This is called the filesystem cache. If a process needs the RAM that has been allocated to the cache, the kernel simply evicts cached data.

The swap line in “top” on our systems usually says 0k total, meaning we have swap disabled on our machines. We don’t want our machines ever hitting swap since if it does then it’s probably unresponsive. The only useful number is “cached” which is the fs cache just described.

In top, the mem line shows: total, used, free, and buffers. Total is the total amount of physical RAM, used is what’s used by all the programs (note this isn’t the sum of RES and SHR, because SHR is shared), free is free RAM, and buffers is space devoted to handling socket connections and copying between processes.

 

Threads & Context Switching

Processes get separate address spaces, so they can not interfere with each other by design. But sometimes you want them to interfere with each other, like you want them to coordinate through some shared memory region, so the kernel provides specialization of a process called a thread. A thread, to the (Linux) kernel, is a process that shares its page tables with another process, typically a child. I say (Linux) because in other OS implementations, there may be other distinctions, like how threads get scheduled. Linux has a 1:1 scheduler, which means that it handles putting threads on a CPU in the same way as processes.

The act of moving a process or thread onto a CPU is called context switching. A bunch of memory related things happen. First, the context switch is triggered by an interrupt or syscall to the kernel. The kernel runs under a special privileged mode so that it can handle interrupts, manipulate the TLB, and mess with hardware. There’s some cost associated with the switch into this mode. Next, when the kernel puts a new thread or process onto a CPU, it has to invalidate the old TLB. The CPU also evicts instructions from its fast L1 and L2 caches. L caches are small amounts of super fast memory that sits on the CPU chip. It’s faster than regular RAM because it doesn’t have to travel down the same pathway as data in RAM does to get to the CPU. So if you have a small amount of code that sits entirely in L1 or L2 cache, then that code can potentially run blazingly fast (many data compressors such as lz4, snappy and blosc are designed to fit in L1 cache). Context switches flush these caches.

Thus there is a high cost to context switches and you want to avoid them if possible. The general strategy of avoiding context switching is to batch up work. Rather than calling “read” 1,000 times asking for one byte, it’s better to call “read” once and ask for 1k. Systems like Kafka use these techniques in a smart way.

As a side note, cache locality is a really big deal, so much so that much of what you learn about basic algorithms can turn out to be wrong. For instance, inserting into the front of an array is O(n), whereas inserting into the front of a linked list is O(1). But if the entire array fits inside of L1 cache, then it could be faster than a linked list, which disperses memory around RAM and probably won’t sit entirely inside L1. We throw away constant factors on paper, but in reality they do matter.

 

Load Metrics!

Now we’ll get to some actual metrics! By now, you’ve no doubt logged into a machine and ran top and seen something like:

What does this all mean?

The load average, on Linux, is the number of processes in the running, runnable, and disk wait state (on some Unixes, disk wait doesn’t count).

  1. Page faults count as waiting for disk.
  2. The 3 load numbers you see are 1, 5, and 15 minute averages.
  3. The letters in the S column are state: R for runnable, S for sleep, D for disk wait.
  4. A process that’s churning away computing stuff with no syscalls generates a load of 1.
  5. A process that is not running, but is getting its memory paged in from disk generates a load of 1. Similarly, a process that’s spending all its time writing to a log file, or unable to write to a log file and blocking (disk full and does not detect error) generates a load of 1.
  6. A process that is waiting on a network response is not in a runnable state and is not waiting for disk, thus is counted as interruptible and generates a load of 0.

The columns of the CPU row:

  • us: user time, the amount of time that processes are spending on CPU, not counting work the kernel is doing on the process on behalf of a process.
  • sy: system time, the amount of time on CPU the kernel is spending on behalf of servicing process syscalls.
  • ni: nice time, amount of CPU time spent on low priority processes. I’ve never used this.
  • id: idle time, 100% minus the sum of the rest of the CPU percentages.
  • wa: wait time, amount of time spent waiting for disk IO
  • hi: hardware interrupt time, amount of time spent servicing hardware interrupts
  • si: software interrupt time, what it sounds like
  • st: CPU steal. This one is funny. It doesn’t exist on bare metal, but exists in a virtualized environment. The host OS is allowed to take CPU time away from the guest machine, so the guest machine reports this as stolen. You should usually see low st numbers, like < 1%, as long as you’re using adequately sized instances in EC2. High numbers are a problem. (This is a bit tricky with ec2, they do some things that may look like steal is an issue but it’s what they use to throttle your cpu cycle to the instance assigned amount.)

So let’s think about a worker application. If a worker is spending the majority of its time computing something then it will generate a load of 1. If a worker is spending most of its time writing to disk, and none of its time computing anything, it will still generate a load of 1. This is problematic — how do you tell between the two situations? That’s where the %CPU column comes in. A CPU bound worker will show near 100% CPU (or conversely, near 0% id on the Cpu line). A disk bound worker will show near 0% CPU. A network bound worker will show neither — CPU will be less than 100%, and load will be less than 1, assuming no other processes running.

Codified, then:

  1. A CPU bound process will generate load 1 and show ~100% CPU
  2. A disk bound process will generate load 1 and show ~0% CPU
  3. A network bound process will generate load < 1 and show CPU < 100%.

Thus, a worker that does no disk IO will be either #1 or #3. A database server would look like #2 if it’s serving requests from disk (a bad sign for performance), but #3 if it’s serving requests from memory. We saw Riak do #1, which is super unusual for a database. Usually the network or the disk is the limiting factor, not CPU.

Also note when MongoDB is getting a bunch of requests that are going to disk, its load will skyrocket. This is because it spawns multiple threads, so each one can potentially contribute a load of 1.

 

Tools & Tricks

top: The most generally useful tool. It gives you CPU and memory usage (hit ‘M’ to sort by resident memory), and load averages. The one thing it lacks is network usage. When you start top up, you should immediately hit ‘1’, which will show all CPU cores, and then hit ‘H’, which will show all threads. Otherwise, top will collapse CPU numbers, as well as threads into their parent process. Collapsed numbers can be misleading. For example, if a process consists of two threads that are doing 60% of CPU each on separate cores, you’ll see that the process is taking 120% of CPU unless you hit ‘H’. Consider too a process that’s two threads, one of which is CPU bound, and the other is 20%. It will also show 120% CPU in the default collapsed view. CPU boundness is really important to know about, because that’s where you should focus your efforts.

bwm-ng (or iptraf): Network monitoring. I prefer bwm-ng because it’s fewer characters to type (vs. iptraf -d eth0), and it also works in small terminals. iptraf requires something like 40 char wide terminal to run, and if you’re like me running tmux with several panes then that’s not an option. The stupid thing about bwm-ng is that it defaults to updating its numbers every half second, which is not useful since traffic is often times bursty. So I immediately hit ‘+’ a bunch of times — each one adds 0.1 seconds to the period.

vmstat: Shows useful memory stats. Run as “vmstat 1” to get 1 second updates. vmstat will show you overall memory usage and paging behavior. It also tells you how many blocks are being read in/out of disk, and CPU use. The cs column is interesting. That’s the number of context switches per second, and elevated numbers are bad.

strace: Traces process syscalls. Run as “strace -p PID”, and you get a flurry of realtime data about what syscalls a process is taking. If a process appears to be hung, it will sometimes be hung on some particular syscall, so this is how you’d find out. It’s rare this provides any insight, but when it does it’s really useful.

telnet: Sometimes you aren’t able to connect to a database or API service. The first thing to do is to check connectivity to that machine and port, and the easiest way is with telnet. Run as “telnet HOST PORT” and if there’s a clear connection then you’ll see lines like:

If there is no network connectivity then the command will hang for some period of time. This usually means either the program you’re connecting to is not up, or the security group isn’t configured to allow access from the machine you’re on.

dig: DNS lookup. Important in our environment because we don’t manage DNS the traditional way and it’s possible for a machine (usually your vm) to not have the correct IP address for a host. You can lookup a host then check its IP address against nodelist.yaml to make sure it’s correct. If not, puppet should fix it. dig and telnet are your goto tools for when it appears like you can’t connect to a remote database or server. Make sure its IP is correct, the remote is up, and telnet works.

mongostat: Tells you what’s going on with mongo on the server. Sample output (apologies for the horrible formatting):

There’s a lot to mongostat that’s more appropriate for another doc, but the one column I want to focus on is “faults”, which in this dump is 12. This is page faults, ie the number of times per second that the OS had to go to disk to lift data into memory. In the case of Mongo, this is an indicator that your working set, ie the amount of data applications need quick access to, has left RAM. Values in the hundreds or thousands are bad, and likely require either resizing the database or redesigning the database model to be more efficient. Just to be clear, this is an artifact of Mongo’s implementation. Page faults is not a good indicator of anything for other databases.

What exactly is a bad page fault number? Since a major fault by definition requires pulling data off disk, you can expect each page fault takes at least one disk seek. Rotational disks seek on the order of 3ms, so a single disk could sustain around 300 seeks per second. Striping a disk doesn’t necessarily improve this number since a disk seek takes the same amount of time, and in fact, it may be slower since you have to seek on every disk in a stripe and so you’re bounded by the slowest disk. SSDs have seek times around 0.1ms.

 

Mechanical Sympathy

Mechanical Sympathy is the ability to intuitively know what’s going on with a machine that you develop after years of debugging problems. It’s a term popularized by Martin Thompson, but coined by race car driver Jackie Stewart to describe how drivers become one with their cars. I’ll try to summarize a few things I’ve learned.

  • If you try to log into a machine or connect to an API and it hangs, either DNS is pointing to the wrong place, DNS isn’t working (connectivity issue), the remote process isn’t up, the port is blocked, or your route is blocked. If you can telnet to the ssh port (22) then you know the route is clear and DNS is working. If you run dig and it doesn’t work then DNS is broken, and that’s likely a connectivity issue.
  • If you log into a machine and typing at the command line is slow or spastic before you hit the enter key, then you’re having network connectivity problems.
  • If you log into a machine and typing shows no delay, but then you run a program like “ls” and it hangs, then it’s disk contention. You have two possibilities. Either the disk is messed up, or programs are doing a lot of disk io. If it’s a Mongo server then it’s likely Mongo since it’s hard on the disk. Check load average (hopefully top works!) — it should be high. Similarly, if your database queries are slow and running “ls” on the database server is slow, then you know your working set is too big for RAM and there’s disk contention.
  • A machine running at 100% CPU with no disk or network problems generally does not exhibit any behavior you might detect, because most Unix tools or shell commands are I/O bound, not CPU bound.
  • If your program is slow but it’s not using a lot of CPU and it’s not doing a lot of disk IO, then you’re getting blocked on external data, most likely a database.
  • If your program is hung and you attach strace to it, and you see that it’s stuck on a syscall, then it’s probably waiting for network or disk IO (it may also be waiting on a lock if it’s a multithreaded program).
  • If your program runs fine with input size N and then when you try input size 10 * N and it never completes and your system feels sluggish, your OS has likely hit swap and there is now disk contention. It will never complete. Try to redesign the program to use less memory. You can check this with top.
  • If your program isn’t outputting a lot to disk, but is waiting on disk, either the disk is full or you’re reading / writing to a ton of files simultaneously (the disk is spending all its time seeking and not actually returning much data).

 

Additional Resources

WADE, Our New Best Friend

March 11th, 2016 by Wes

We’re pleased to announce WADE, a little database project of ours. You can get all of the interesting detail from the git repo, but I’ll go into a bit here first, because what’s an engineering blog if there’s no engineering?

At Chartbeat, we’ve found ourselves often times wrestling with database scale problems when something we wanted to do with high throughput didn’t quite match the database’s data model. Usually in these cases, we fallback on a read-write-update cycle, which can kill performance and suffer from race conditions.

One annoying as hell example has to do with us maintaining a map of URLs to what we call canonical paths. You’ll often times see in your browser’s URL bar a path that has a bunch of query parameters that have no relation to the content of the page, such as utm tags or other tracking codes. When we send our data pings, we include all of those parameters, but on the server side we have to make some attempt to strip out the parts of the URL that aren’t relevant, otherwise we run the risk of not counting up all the metrics for a page.

A super simple example might be these two URLs:

We can’t just take the URLs at their face value and assign them each one page view. Rather, we want to strip off the “?referrer=twitter” and “?referrer=facebook” part and count two page views for /squirrels-are-barking-mad. The problem is compounded by the lack of a standard set of tracking codes used by publishers, so we can’t assume we’ll always know how to sanitize the data.

One solution is for our client to customize the ping so that it includes a canonical path of /squirrels-are-barking-mad, or sets an og:url meta tag within the page. That’s all well and good, but it assumes that the client actually implements this correctly, that the browser doesn’t mess something up in transit and that nobody is trying to spoof our client’s data. The internet is a messy place, so we regularly see multiple conflicting canonical paths for the same source URL.

What we do then is keep a tally of what mappings we’ve seen so far, and our system in essence votes for what the canonical path should be for any given URL.

Simple enough, but how do we do this at scale? We handle tens of thousands of such pings per second, and need to be able to vote with very low latency and high throughput. A simple way to achieve this at reasonable cost would be to use memcache or Redis as the backing store, and simply set the key to the source URL, and the value to a map of canonical path to counts, where counts are the number of times a particular source URL mapped to that canonical path. So one entry might be something like:

Meaning, we’ve seen /squirrels-are-barking-mad?referrer=twitter map to /squirrels-are-barking-mad 12,382 times, and to /squirrels-are-cute once.

With memcache, adding a count follows a read-write-update cycle:

  1. Read the opaque value at /squirrels-are-barking-mad?referrer=twitter.
  2. Deserialize it to a dict, call it “counts”.
  3. Add 1 to the value at counts[‘/squirrels-are-barking-mad’].
  4. Serialize and write the values back to /squirrels-are-barking-mad?referrer=twitter.

To do this, the client has to pull down data from the database, do the deserialization mojo, do some serialization mojo, then write it back, so there’s a full roundtrip. In addition, there’s a race between steps #1 and #4 if two clients are simultaneously trying to update the same URL on a backing store that doesn’t support transactions.

While we’re applying a continuous stream of counts, we may get asked what the canonical path is for some source URL. To figure this out, the client has to pull down and deserialize the data for a source URL, then look at the tallies and vote on which canonical path has the most weight. In this example, the a voting query would return /squirrels-are-barking-mad, since it’s clearly the correct one.

How does WADE improve on this? A database is essentially a chunk of data that transitions from state to state, where the transitions are insert/update commands. WADE is a replicated state machine which has no opinions on the structure of its state. This is entirely programmer defined, as are the state transition functions, which we call mutating operations. A WADE cluster keeps track of object state and pushes mutating operations to all of the replicas in a consistent way using the chain replication algorithm. Because operations are fully programmable, we’re able to eliminate read-write-update cycles by defining application specific transitions.

Let’s first set aside all the nice replication benefits WADE gives us for free and look just at the update operation and how the programmer defines mutating operations. We’d call the /squirrels-are-barking-mad?referrer=twitter entry an object, and define state transitions on this object. One such state transition would be an update command that adds one to a candidate canonical path. I’ve simplified the code for the purposes of this post, but it’s only marginally more complex than:

This function would get deployed to the WADE nodes, much like custom written views are deployed with a Django (or any web framework) app to a web server. Then a client connecting to the WADE cluster would issue a single update command:

We don’t have to do a full round trip, and we only send the data necessary to increment the counter.

The vote would simply be another customized WADE operation, but this time a non-mutating operation, ie a query.

The vote function would ship with WADE to the nodes in the same way as the mutating operations and run on the nodes themselves. The client would issue a vote operation, and the cluster would return only the top path, eliminating the need to transfer the entire serialized object and run the logic in the client.

There are consistency benefits as well. WADE serializes update operations and has strong consistency guarantees (at the risk of partial unavailability), so race conditions don’t exist.

The URL tracking example is one simple application that motivated the design of the system, but another is a probabilistic set cardinality (HyperLogLog) service that powers our Engaged Headline Testing product. At the moment our HLL service is built on top of Riak but suffers from high network overhead and chatter between nodes. Each HLL structure is around 10k in size, and even a small bit flipping operation requires downloading the entire serialized data. When you throw in the possibility of siblings, what we’re left with is a system that scales beautifully horizontally, but is not as resource efficient as it could be. We have yet to move our HLL service to WADE, but initial performance tests show significant improvement over our current Riak cluster. We’ll be open sourcing a HLL service on WADE in the near future.

Another nice property of WADE: the core code is simple. While I can’t really argue that thinking about distributed systems is easy, we’ve reduced the surface area of misunderstanding by enforcing a rule that the core code will always be fewer than 1,000 lines. The effect of this is that WADE doesn’t optimally handle a lot of edge cases, though (bugs notwithstanding) it maintains correctness. Any node failure might cause the cluster to be unavailable for longer than an industrial strength database might, but it’s still within acceptable limits for production use.

Also, we leave a lot of details up to the programmer. While WADE will handle replication happily, it needs to be told what nodes to replicate to. There are nuanced reasons why we might want to leave this up to the programmer that I may go into in another blog post.

There is some missing functionality (fast syncing, as described in the repo’s README, as well as a useful generic overlord), but we’ll fill those in over time as we improve our understanding of operational issues.

In the meantime, please take a look at WADE and play with the the in-memory kv store that ships with it. This is still alpha quality software, but we’re optimistic it’ll see varied use within our systems in important and high scale areas. We would love to hear feedback, and of course would be happy to accept pull requests. Just don’t bust our 1,000 LOC quota.

We’re back from a too-long hiatus! I know you missed us.

One of our engineers has been nursing a casual interest in ionizing radiation over the past couple of hackweeks and decided to build a Geiger Counter from scratch. At the heart of this device is a component called a Geiger-Muller tube that facilitates the particle detection. This site is an incredible resource in helping choose the tube for the project. It details a cost benefit analysis of easily procurable tubes along with links to spec sheets and ebay listings; helpful, especially when most of these resources are in Russian elsewhere.

A nice case-study to bookmark for when you’re trying to optimize some python code in the future.

We closed our Series C! Which means we’re looking to hire and move into a new space. Since Chartbeat strives for openness and planiness, our Creative Energy Oracle is designing out new work place following this advice.

Happy Friday!

Everything You Always Wanted To Know About GAME BOY (but were too afraid to ask). This nearly 15 year old text file contains everything (and I mean everything) you ever wanted to know about the 1989 Nintendo Gameboy. The author gushes over it’s scrappy little z80 8-bit processor, the way it stores tiles in a single byte, and the input register changes that happen when playing Ms. Packman. If you like hardware and video games, this is a fascinating read.

Which is the more efficient solution for loading scripts in a HTML document: simulating async loading with script injection or a blocking synchronous script tag? It’s frequently considered a performance best practice to use script injection for loading Javascript. However, there are some often overlooked performance gotchas with this common pattern. Ilya Grigorik gives us the rundown.

An interesting look at whether you should always jump to the distributed solution when facing large data sets. From the intro: “Rather than making your computation go faster, the systems introduce substantial overheads which can require large compute clusters just to bring under control. In many cases, you’d be better off running the same computation on your laptop.” And the follow up.

Pianist Kimiko Ishizaka open sourced her rendition of Bach.

Ralf Rottman is a scrum curmudgeon.

What’s more fun than writing Lisp? Writing games in Lisp, of course!

Happy Friday!

Back in 2014 I gave a presentation at PyGotham on a neat PostgreSQL feature called Foreign Data Wrappers. Because of a variety of factors, the presentation was a little unhinged, and I declined making it public in favor of a more detailed written up post, which I swore would appear just a couple weeks after PyGotham was finished. Well, typical of a software developer, I shipped right on time plus 7 months. You can find the post over at GitHub along with code and Vagrantfile for your enjoyment.