Swailing: helps you deal with your ridiculous log output

May 2nd, 2016 by Wes

Swailing is a handy little Python logging library we created in the process of developing WADE. It’s usually a handy thing to put in lots of logging info while developing a system (say, an API service), and what generally ends up happening is that there’s some piece of info that gets output on every request. In your dev environment, where you’re sending just a few requests per second perhaps, that’s not an issue, but then when you push this thing out and it starts handling thousands of requests per second in prod, well it’s like you just opened up the floodgates and all of a sudden you’re drowning in supposedly useful messages.

The general “best practice” is to use different priority levels, such as WARNING, ERROR, INFO, etc. Critical errors get written out with high priorities (ERROR), while development debugging info gets written out at low priorities (INFO). Then when you deploy, you set a high threshold and let only errors through. There are several issues with this approach.

The first is that sometimes, you get the priority levels wrong. If you accidentally emit a log at INFO when it should have been at ERROR, then when the system breaks in production, you’re missing important info for analysis. Conversely, you might emit a log at ERROR when in fact it should be INFO, and then get flooded by all sorts of useless information. The worst thing, though, is that it’s often times simply not clear while developing an application which level you should be using for some particular error condition. The internet is a noisy and chaotic input source, so we have to deal with and sometimes accept all sorts of errors.

The second issue is that even if you get the level exactly right, you still might deal with a flood of errors. Sometimes services exhibit a manageable trickle of errors, but then sometimes when you hit a condition (database down for example) that causes every single operation to fail. In that situation, all you really care is that you get some of the log output; most of it is redundant.

What we really want is a system that gives you complete log information under a manageable stream of errors, but then throttles when the gates of logging hell open up.

The Swailing library deals with this using the Token Bucket algorithm. It goes like this:

  1. You’ve got a bucket full of tokens. It has a fixed capacity of C tokens. In order to output a log line, you have to spend a token. If you have no tokens, no log outputting for you!
  2. Your bucket fills at a constant rate of R tokens per second. But the bucket can never fill above C tokens.

This is simple and elegant, both great properties of an algorithm. A useful reframing of the parameters is this: you can burst log up to C times over any period of time; if you stay under R lines per second, you get to output everything you want.

The reverse of Token Bucket is interesting as well. It’s called the Leaky Bucket: suppose you start with an empty bucket of size C which leaks at a fixed rate of R per second. Every time something occurs, you fill it, and if it ever reaches capacity C, you trigger a special condition. (Note: Swailing doesn’t use this algorithm.)

A concrete example is error detection. Suppose that you expect there will naturally be some small number of timeouts per second to your reverse proxy, because again the internet is the wild wild west, and sometimes browsers are slow or lose connectivity. You might set a threshold of, say, 100 errors in one second to alert on. But what if the errors are bursty and you usually see 1000 occur, but only once every minute? If you rejigger your threshold to 1000, then it’s possible that a constant 999 errors per second goes undetected.

With the Leaky Bucket, you’d set C to 1000 and R to 100. A burst to 1000 would immediately fill the bucket and trigger the condition. A burst of less than that would not, so long as there aren’t too many consecutive bursts. And a constant error rate of less than 100 goes undetected, since the bucket leaks at 100 per second.

The second thing that the Swailing library does is provide a handy facility for emitting PostgreSQL style debug lines, which adds some nuance to standard severity levels. In addition to how *bad* something is, we now have a way to control how much *detail* is associated with the log output.

There are three detail levels: primary, detail, and hint. Primary log lines always get emitted if higher than the severity threshold, but they’re meant to be very basic information. The detail line provides specifics as to what caused the error. The hint line is a speculative corrective action. If this were a web proxy, for instance, a complete log message might be:

With Swailing, the detail field is usually JSON so that we can programmatically parse the log file. The primary line is meant for quick visual scanning (or graphing in a system like ELK), so sometimes that’s all you really need. Usually the hint is only shown during development, but it may be useful for production as well.

Swailing has been pretty useful in WADE, and is starting to see more adoption with internal services at Chartbeat. We have a disgusting amount of logs coming out of our system, on the order of terabytes per day. Providing a tool that automatically cuts this down will go a long way towards allowing us to better manage and debug our systems.