Archive for the ‘EC2’ Category

Centralized logging had been on our backlog for quite some time at Chartbeat.  After taking care of some yaks to be shaved, we got to implementing Logstash about a year ago.  We by no means have the largest cluster, averaging around 25k events/sec and peaking at 50k during backfills.  We’ll share some tips from what we learned in deploying and scaling the ELK stack.

Architecture

We’re on the following versions of software as part of our ELK stack

  • Elasticsearch 1.5.2
  • Logstash 1.4.2
  • Kibana 3

Diagram of our deployment

ELK architecture

Logstash Forwarder (LSF)

There are many different ways you can pump data into Logstash given it’s number of input plugins.  Ultimately we decided to go with using logstash-forwarder rather then using the Logstash agent itself on each server for reasons highlighted below

  • Lightweight in resources both CPU and memory
  • No need to run JVM everywhere, written in Go
  • Will pause forwarding if errors are detected upstream and resume
  • Data is encrypted
  • No queueing system required

Unfortunately we ran into a lot of issues during the deployment where LSF would lose it’s connection to the indexers or lose track of it’s position in a file.  There hasn’t been a lot of major work done to the LSF project in sometime.  Etsy forked LSF (https://github.com/etsy/logstash-forwarder) and has fixed a lot of the outstanding issues and added things like multi-line support.  While we’d love to stay on the mainline version, we’ll most likely be moving to Etsy’s version in the near future.

Logstash

You should specify the --filterworkers argument when starting Logstash and give it more than the default of 1 filter worker.  Logstash uses a worker for input and output, you want to make sure you set the number of filter workers with that in mind, so you don’t oversubscribe your CPUs. Filter workers value should be 2 less than the total number of CPUs on the machine.

We gained a large performance boost by converting our logging (where we could) to JSON to avoid having to write complex Grok patterns.  For our python code we used a wrapper that utilized python-logstash to output into logstash JSON format.  For Nginx logging, since it unfortunately doesn’t natively support JSON encoding of it’s logs, we did it via a hackish way and specified a JSON format in the access_log format string

    log_format logstash_json '{ "@timestamp": "$time_iso8601", '
                         '"@version": "1", '
                         '"remote_addr": "$remote_addr", '
                         '"body_bytes_sent": "$body_bytes_sent", '
                         '"request_time": "$request_time", '
                         '"upstream_addr": "$upstream_addr", '
                         '"upstream_status": "$upstream_status", '
                         '"upstream_response_time": "$upstream_response_time", '
                         '"status": "$status", '
                         '"uri": "$uri", '
                         '"query_strings": "$query_string", '
                         '"request_method": "$request_method", '
                         '"http_referrer": "$http_referer", '
                         '"http_host": "$http_host", '
                         '"scheme": "$scheme", '
                         '"http_user_agent": "$http_user_agent" }';

Just beware, using this method won’t guarantee your data is properly JSON encoded and will lead to occasional dropped log lines when data isn’t encoded properly.

In order to monitor that logstash was sending data on all our servers, we setup a passive check on Nagios for all our hosts.  We then configured a basic cronjob on all our systems that just does the following

*/30 * * * * ubuntu echo PING  | logger -t logstash_ping

In logstash we setup a simple conditional to look for this data and send an OK to the passive check

 if [program] == "logstash_ping" {
    nagios_nsca {
      host             => "nagios.example.com"
      nagios_status    => "0"
      send_nsca_config => "/etc/send_nsca.cfg"
    }
  }

If we haven’t received any data in an hour for this passive check, it will go critical and let us know of an issue.

Writing any grok filters? Grok debug is your best friend.  We also found it useful to take complex filters and define a custom grok pattern to simplify the configuration.

Using tags can help you with your Kibana dashboards and searches later on.  We have a few places where we have conditionals in place and then use the mutate filter to add a tag to make it easy to search for those events in Kibana. Example

grok {
  match   => [ "message", "Accepted (publickey|password) for %{USER:user} from %{IP:ssh_client_ip}" ]
  add_tag => [ "auth_login" ]
}

Using the KV filter on data that you don’t control?  Consider using a whitelist of keys that it should accept.  We recently had an issue where we were running the KV filter on Nginx error logs and the delimiter we were using ended up in some unexpected places resulting in a explosion of unique keys (100s of thousands) which caused the index’s mapping to grow to a very large size and caused the master nodes to OOM and crash, bringing the cluster down.

Elasticsearch

Purchase Marvel, seriously it’s worth it and helps so much on giving insight into what’s going on within the cluster.  It’s free for development use.

Don’t give the JVM more than 32GB of RAM due to compressed pointers, see “Why 35GB Heap is Less Than 32GB – Java JVM Memory Oddities” for a nice explanation on why this is an issue.

The G1GC is currently not supported on Elasticsearch, so be aware of unexpected behavior like crashing and index corruption if you attempt to use it.

Ensure minimum_master_nodes is set to N/2+1 of your servers to avoid potential split brain scenarios.

Using AWS? You definitely want to be using the AWS plugin .  This allows Elasticsearch to use the EC2 APIs for host discovery, instead of relying on hardcoding a list of the servers.  You can filter based on tags and security groups.  The plugin also gives you the ability to setup S3 as a repository to snapshot index backups to.

We utilized client nodes in a lot of places to avoid having to maintain a list of the Elasticsearch servers.  You’ll notice in our diagram we have a client node on each indexer and the Kibana server as well.  That way the Logstash indexers can just point to localhost for their Elasticsearch output and Kibana can point to the server it’s running on and they don’t need a service restart or anything if we add or remove any nodes.

Separate out your master nodes from your data nodes , you don’t want to have a large query get run that causes your nodes to start garbage collecting like crazy and also have it bring the whole cluster down with it.  These nodes can be small since their only job is to route queries and store index mappings.  For more info on different node types see the docs on node types

Use Curator to perform maintenance on your indices such as optimizing and closing bloom filters on indices that are no longer being actively indexed to.  You can also use it to maintain retention policies on indices and delete indices off that are older than a certain number of days. We’re currently using version 2.0.1.

Elasticsearch can be made to be rack or DC aware using node attributes and setting the cluster.routing.allocation.awareness.attributes property to use the attribute.  This will ensure that a shards primary and it’s replica will not be located in the same attribute value.  We set the following

node.rack_id: $ec2_availability_zone
cluster.routing.allocation.awareness.attributes: rack_id

Node attributes can be custom and set to whatever you want.  You’ll notice we actually have 2 groups of servers in our diagram, our “fast” and “slow” nodes.  These nodes are all apart of one cluster but store different indices depending on the age of them.  As we started adding more data to the cluster, we decided to go with the approach of having new data get indexed onto beefier machines with SSDs and a lot of CPU and after 48 hours, have that data be migrated off to servers with a larger amount of spinning disk storage and a ton of RAM for querying.  We can scale these 2 node types independently and add more “slow” nodes if we want longer retention periods.  Elasticsearch makes it easy to do this through the use of routing allocation tags.  Using Curator we setup a nightly cronjob for each index that does the following

curator --master-only allocation --prefix logstash-nginx- --older-than 2 --rule node_type=slow

This will tag the index to only live on nodes with the node attribute node_type set to slow.  ES will then begin to migrate these shards to those nodes.

You cannot just restart an Elasticsearch node without causing a re-shuffling of shards around.  This can be very IO intensive for larger amounts of data.  Elasticsearch will think the node is down, it has no concept of a grace period and will immediately start re-routing shards even if the node comes right back up.  In order to restart a node in a faster, less painful fashion, you can toggle routing allocation off and on during your restart as follows

curl -XPUT localhost:9200/_cluster/settings -d '{"transient":{"cluster.routing.allocation.enable": "none"}}'

restart node, wait for it to come up

curl -XPUT localhost:9200/_cluster/settings -d '{"transient":{"cluster.routing.allocation.enable": "all"}}'

Take a look at indices.recovery.concurrent_streams and indices.recovery.max_bytes_per_sec to speed up shard recovery time.  The default values are pretty low in order to avoid interfering with searches or indexes but if you have fast machines, there’s no reason not to raise them from their defaults.

Looking ahead

In the next couple of months we’re going to look to upgrade to Logstash 1.5 release which fixes a few bugs we’re currently experiencing.  The puppet module has been pretty great for managing Logstash but currently does not support the new plugin system in the 1.5 release.  Moving to Kibana 4 will also come soon, but unfortunately we spent a lot of time building dashboards in Kibana 3 that will need to be completely redone in the new version and spend some time re-training everyone in the new UI.

EDIT 2/20/14: Updated to reflect correct response time metric

In part 1 of our post, one of the items we discussed was our issues with using DNS as a load balancing solution.  To recap, at the end of our last post we were still setup with Dyn’s load balancing solution and our servers were receiving a disproportionate amount traffic to them.  Server failures were still not as seamless as we wanted, due to the issues with DNS TTLs not always being obeyed and our response times were a lot higher than we wanted them to be, hovering around 200-250ms.

In part 2 of this post, I’ll cover the following

  • How we improved our issues with server failure and response time by using Amazon’s ELB service
  • The performance gains we saw from enabling HTTP keepalives.

  • Future steps for performance improvements

But first before I dive into the ELB, there’s one topic I left out of my last post that I wanted to mention.

TCP Congestion Window (cwnd)

In TCP congestion control, there exists a variable called the “congestion window” , commonly referred to as cwnd.  The initial value of cwnd is often referred to as  “initcwnd”.    After the initial TCP handshake is done, we begin to send data, the cwnd determines how many bytes we can send before the client needs to respond with an ACK.  Let’s look at a graphic of how different initcwnd values affect TCP latency from a paper Google released.

Latency vs initcwnd size

At Chartbeat, we’re currently running Ubuntu 10.04 LTS (I know, I know, we’re in the process of upgrading to 12.04 as this is being written), which ships with Kernel 2.6.32.  Starting in Kernel 2.6.39, thanks to some research from Google, the default initcwnd was changed from 3 to 10.  If you are serving up content greater than 4380 bytes (3 * 1460), you will benefit from increasing your initcwnd due to the ability to have more data in flight (BDP or bandwidth delay product) before having to reply with an ACK.  The average response size from ping.chartbeat.net is way under that, at around 43 bytes, so this change had no benefit to us at the time when the servers were not behind the ELB.  We’ll see why increasing the initcwnd helped us later in the post when we discuss HTTP keepalives.

ELB (Elastic Load Balancer)

The options for load balancing traffic on AWS are fairly limited.  Your choices are

  • An ELB

  • DNS load balancing service such as Dyn

  • Homegrown solution using HAProxy, nginx or <insert favorite load balancing software here>

Each of these solutions have their limitations and depending on your requirements, some may not be suitable at all for you.  I won’t go into all of the pros and cons of each solution here since there are plenty of articles on the web discussing these already.  I’ll just go over a few that directly affected our choice.

In choosing a homegrown solution, support for high availability and scalability is difficult.  Currently with AWS, there’s no support for gratuitous ARP, which is traditionally used in handling of fail overs both in software and hardware load balancers.  In order to work around this issue, you can utilize Elastic IPs and homegrown scripts to move the Elastic IP between instances when it detects a failure.  In our experience we’ve seen lag times from 30 seconds to a few minutes when moving an Elastic IP.  During this time, you would be down hard and not serving up any traffic.  The above solution also only works when all your traffic can be handled by one host and you can accept the small period of downtime during fail over.

But how would you handle a situation where your traffic was too high for one host?  You could launch multiple instances of your home grown solution but you would then need to handle balancing the traffic between these instances.  We already discussed in part 1 the issue we had with using DNS to handle the balancing of traffic.  The only other solution would be to actually use an ELB in front of these instances.  If we went with this solution, it meant adding another layer of latency to the request.  Did we really need to do something like this?

The reason why most people end up going with a solution like HAProxy is because they have more advanced load balancing requirements.  ELB only supports round robin request balancing and sticky sessions.  Some folks require the ability to do request routing based on URI, weight based routing or any of the other various algorithms that HAProxy supports.  Our requirements for a load balancing solution were fairly straightforward:

  • Evenly distribute traffic (better than our current DNS solution)

  • Highly available

  • Handle our current traffic peak(200k req/sec) and scale beyond that

  • End-to-End SSL support

ELB best met all these requirements for us.  A homegrown solution would have been overkill for our needs.  We didn’t need any of the advanced load balancing features, SSL is currently only supported in HAProxy’s development branch (1.5.x) and requires using stunnel or nginx for support in the stable branch (1.4.x) and we didn’t need to add any additional layers that would increase our latency even further.

Moving to ELB

The move to using an ELB was fairly straight forward.  We contacted Amazon support and our technical account manager to coordinate pre-warming the ELB.  According to the ELB best practices guide, ELBs will scale gradually as your traffic grows (should handle 50% of traffic increase every 5 minutes), but if we suddenly switched 100% of our traffic to the ELB, it would not be able to scale quickly enough and start throwing errors.  We weren’t planning on doing a cutover in that fashion anyway, but to be safe we wanted to ensure the ELB was pre-warmed ahead of time even as we slowly moved over traffic.   We added all the servers into the ELB and then did a slow roll out utilizing Dyn’s traffic director solution, which allowed us to weight DNS records.  We were able to raise the weight of the ELB record and slowly remove the individual server’s IPs from ping.chartbeat.net to control the amount of traffic flowing through the ELB.

Performance gains

We saw large, immediate improvements in our performance with the cutover to the ELB.  We saw less TCP timeouts  and a decrease in our average response time.

 We went from roughly 200 ms average response times, to 30 ms response times.  That’s a 85% decrease in response time! (EDIT 2/20/2014) Thanks to Disqus commenter Mxx for pointing out, we incorrectly measured the response time here.  Moving behind the ELB changed the metric from being a measure of response time between our servers and clients, to a measurement of response time between the ELB and our servers.  Comparing external data from Pingdom, we still saw a decrease in response time of about 20% from peak traffic times, going from 270ms to 212ms.  Apologies for the earlier incorrect statement.

Our traffic was now more evenly distributed than our previous DNS based solution.  We were able to further distribute our traffic shortly after, when Amazon released  “Cross-Zone load balancing

 Enabling cross-zone load balancing got our request count distribution extremely well balanced, the max difference in requests between hosts sits currently around 13k requests over a minute.

Request Count Distribution ELB cutover

KeepAlives

With our servers now behind the ELB we had one last performance tweak we wanted to enable, HTTP keepalives between our servers and the ELB.  Keepalives work by allowing multiple requests over a single connection.  In cases where users are loading many objects off your site, this can greatly reduce latency by removing the overhead of having to re-establish a connection for each object you are loading off the site.  CPU savings are seen on the server side since less time will be spent opening and closing connections.  All this sounds pretty great, so why didn’t we have it enabled before hand?

There are a few cases where you may not want keepalives enabled on your web server.  If you’re only serving up one object from your domain, it doesn’t make much sense to keep a connection hanging around for more requests.  Each connection uses up a small amount of RAM.  If your web servers don’t have a large amount of RAM and you have a lot of traffic, enabling keepalives could get you in a situation where you will consume all RAM on the server, especially with a high default timeout for the keepalive connection.  For Chartbeat, our data comes from clients every 15 seconds, holding a connection open just to get a small amount of data every 15 seconds would be a waste of resources for us.  Fortunately we were able to offload that to the ELB which enables keepalive connections by default for any HTTP 1.1 client.

With our servers  no longer being directly exposed to the clients, we could re-visit enabling keepalives.  We are doing a high amount of requests between the ELB and our servers , with the connections coming from a limited set of servers on Amazon’s end.  We want the ELBs to be able to proxy as much information as possible to us over one connection and keep that connection open for as long as possible.  This is where having a larger initcwnd comes into play.  Having a larger initcwnd lowers our latency and gets our bandwidth up to full speed between the servers and the ELB.  We expected to see a drop off in the amount of traffic going through the servers as well as some CPU savings.  To ensure there were no issues, we did a “canary” test with one server enabled with keepalive and put it into production.  The results were not at all what we expected.  Traffic to the server became extremely spiky and average response time increased a bit when keepalives were enabled on the canary server.  After talking to Amazon about the issue, we learned that the ELB was favoring the host with keepalive enabled.  More traffic was being sent to that host causing its latency to increase.  When the latency increased, the ELB would then send less traffic through the host and the cycle would start over again.  Once we confirmed what the issue was, we proceeded with the keepalive rollout and the traffic went back to being evenly distributed.  The amount of sockets we had sitting in TIME_WAIT went from around 200k to 15k after enabling keepalives and CPU utilization dropped by about 20%.

Keepalives and Timeouts

There are a few important things to be aware of when configuring keepalives with your ELB with regards to timeouts.  Unfortunately there’s a lack of official documentation on ELB keepalive configuration and behavior, so the information below could only be found through various posts on the official AWS forums.

  • The default keepalive idle connection timeout is 60 seconds
  • The keepalive idle connection timeout can be changed to values as low as 1 second and as high as 17 minutes with a support ticket
  • The keepalive timeout value on your backend server must be higher than that of your ELB connection timeout.  If it is lower, the ELB will re-use the idle connection when your server has already dropped the connection, resulting in the client being served up a blank response.  The default nginx keepalive_timeout value is safe at 75 seconds with the default ELB timeout of 60 seconds.

Downsides

While the ELB has worked out great for us and we’ve seen huge performance improvements from switching to using one in front of our servers, there are a few issues we’d love to see addressed in future roll-outs of the ELB:

  1. Lack of bandwidth graphs in CloudWatch.  I’m surprised the ELB has been around for this long without this CloudWatch metric.  You get charged per GB processed through the ELB, yet there’s no way to see from Amazon’s view, how much bandwidth is going through your ELB.  This could also help identify DoS attacks that don’t involve making actual requests to the ELB.

  2. No Ability to pre-warm an ELB without going through support.  Right now it’s a process of having to contact Amazon support to get an ELB pre-warmed, and answering a bunch of questions related to your traffic.  Even if this process was moved to a web form like how requests for service limit increases are done, it would be better than the current method.

  3. No ability to clone an ELB.  Why would you want that?  If you have an ELB that is handling a large amount of traffic and you are experiencing issues with it, you cannot easily replace the faulty ELB in a hurry due to the need for new ELBs to scale up slowly.  It would be extremely useful to clone an existing one, capturing it’s fully warmed configuration and then be able to flip traffic over to it.  Right now if there’s an issue, AWS support needs to get involved, and unless you are paying for higher end support, you may not get a fast enough response from support.

  4. No access to the raw logs.  A feature to send the ELB logs to an S3 bucket would be very valuable.  This would open up a bunch of doors with the ability to setup AWS Data Pipeline to fire off an EMR job or move data into Redshift.  Currently all that must be done on the servers behind the ELB.

  5. No official documentation on keepalive configuration or behavior.
  6. Ability to change the default  keepalive timeout value is not exposed through the API and requires a support ticket.

Conclusions

We learned an important lesson by not monitoring some key metrics on our servers that were having an affect on our performance and reliability.  With increasing traffic it’s important to re-evaluate your settings periodically to see if they still make sense for the level of traffic you are receiving.  The default TCP sysctl settings will work just fine for a majority of workloads but when you begin to push your server resources to there limits, you can see big performance increases by making some adjustments to variables in sysctl.   Through TCP tuning and utilizing AWS Elastic Load Balancer we were able to

  • Decrease our traffic response time by 20%
  • Decrease our server footprint by 20% on our front end servers
  • Have failed servers removed from service within seconds
  • Eliminate dropped packets due to listen queue socket overflows

Next Steps

Since the writing of this article, we’ve done some testing with Amazon’s new C3 instance types and are planning to move from the m1.large instance type to the c3.large.  The c3.large is almost 50% cheaper and gives us more compute units which in turn yields slightly better response times.

Our traffic is very cyclical which lends itself perfectly to take advantage of Amazon’s auto scaling feature.  Take a look at a graph from a weeks worth of traffic.week_total_concurrents

In the middle of the night (EDT), we see half of what our peak traffic was earlier in the day and on weekends we see about 1/3 less traffic than a weekday.

In the next coming months we’ll be looking to implement auto scaling to achieve additional cost savings and better handle large, unexpected spikes of traffic.

Additional resources:

Special thanks to the following folks for feedback and guidance on this post

Our average traffic at Chartbeat has grown about 33% over the last year and depending on news events, we can see our traffic jump 33% or more in a single day.  Recently we’ve begun investigating ways we can improve performance for handling this traffic through our systems.  We set out and collected additional metrics from our systems and we were able to reduce TCP retry timeouts, reduce CPU usage across our front end machines by about 20%, and improve our average response time from 175ms to 30ms.

History

First, a brief overview of our architecture.  Currently we are hosted on Amazon Web Services.  Every 15 seconds, we receive data from our javascript that is embedded on clients’ pages.  This data is sent to ping.chartbeat.net.  Behind ping.chartbeat.net sit a number of m1.large servers running Nginx that handle the proxying of these “pings” to our realtime infrastructure.  For this article I’ll be focusing on how we improved the performance of receiving and proxying the data sent to ping.chartbeat.net.

In 2009 when the company was first starting out, we used round robin DNS to load balance the traffic for ping.chartbeat.net.   While this scaled fine for the first year, DNS can only serve up a maximum 13 records due to UDP’s packet length limit of 512 bytes.  This obviously wasn’t going to be a long term solution for us if we wanted to grow the company.  In 2010 we switched our DNS provider to the folks at Dyn.  Dyn offers a load balancing service that handles the monitoring of the IPs to ensure they are reachable and automatically pulls them from being served up in the DNS response.   Each server was assigned an elastic IP from Amazon so that we could handle failures by just moving the IP to a new server rather than having to update the service with a new IP when a server failure occurred.  We switched ping.chartbeat.net over to Dyn’s DNS load balancing service and have used them over the last 3 years.

Cons of the setup

While Dyn’s service has been great over the last 3 years, they unfortunately have no control over the problems that exist within DNS itself.  When we experienced a server failure on our front end servers, Dyn would pull the IP address from being served up in ping.chartbeat.net.  With a low enough TTL on the record, the IP would stop being served up from DNS caches.  Unfortunately there are a large number of (misbehaving) DNS servers out there that don’t properly obey TTLs on records and will still serve up stale records for an indefinite amount of time.  This meant that we were not getting data for clients who were still being served up the stale record while that server was being replaced.  This also presented problems for when we scaled our traffic to handle large events.  When we went to scale our infrastructure back down, clients were still being served up the old IPs of the servers we had removed from rotation.  We would see traffic flowing in days after we removed the records which had TTLs set for only a few minutes.  Eventually we had to have a cut off point so we could shut down the servers but this meant dropped pings from clients, which we want to avoid as much as we can.

DNS requests being distributed evenly does not mean we will see traffic get evenly distributed across our servers.  Many users are behind proxies that would send thousands of users to one server.  We regularly saw variances of up to 1000 req/sec across our front end servers.  Not being able to efficiently balance the traffic meant we were running with more servers than necessary.

Why didn’t you just use an ELB?

At the time, ELB was still a very young service from Amazon.  ELB was launched without support for SSL termination, a feature we needed.  SSL support was launched in 2010 but  without the ability to do full end to end SSL encryption.  Full end to end encryption was launched later, in the summer of 2011.  By the time the features we fully needed were supported, AWS had experienced some major outages involving the ELB service which made us hesitant to move towards using it right away.

Improving Performance and Reliability

The first step to improving our performance was to graph and measure the response times on our front end servers.  We’re utilizing Etsy’s Logster to parse the access logs every minute and output the data to statsd which then makes its way into graphite.  We run logster under ionice and nice in order to minimize the impact the log parsing has on the performance of the server each minute.  We also write our logs to a separate ephemeral partition formatted with ext4.  Using ext4 instead of ext3, we saw better performance when rotating out log files.

In addition to graphite, we are using Ganglia to graph system level metrics.  We are graphing numerous metrics from the output of netstat -s such as number of TCP timeouts, and Socket listen queue overflows.  We also graph the counts for each state a socket can be in.  Due to the number of sockets we have on each server, it’s not possible to use netstat to view the states of all the sockets without it affecting performance and taking 5 minutes or longer to return.  Thanks to the folks in #ganglia on Freenode, we learned about using the command ss -s as an alternate way to get socket information that returns in a fraction of the time.

Some initial numbers from a front end server

  • 80k packets/sec at peak
  • 200-250ms request times
  • 2k-4k req/sec
  • 80% avg CPU, unfortunately we are on an old version of the Ganglia CPU plugin and initially didn’t realize we were not graphing steal % time , and were not getting a true picture of our CPU utilization and had believed this number was lower.
  • 350k connections in TIME_WAIT (yikes)
  • 14k interrupts/sec

Graph all the things! (then actually look at the graphs)

Now that we had all these metrics being graphed, it was important to actually understand how these metrics were affecting our reliability and performance.  Two graphs that stood out right away were our graphs of “SYNs to LISTEN sockets dropped” and “times the listen queue of a socket overflowed”.

syns_to_socket_listen_dropsAny metric that contains the words “dropped” or “overflow” with values that were increasing couldn’t be a good thing.

What do these values actually mean though?  When calling the listen function in BSD and POSIX sockets, an argument is accepted called the backlog.  From the listen man page

The backlog argument defines the maximum length to which the queue of pending connections for sockfd  may grow. If a connection request arrives when the queue is full, the client may receive an error with an indication of ECONNREFUSED or, if the underlying protocol supports retransmission, the request may be ignored so that a later re-attempt at connection succeeds.

and a very important note

If  the  backlog  argument is greater than the value in /proc/sys/net/core/somaxconn, then it is silently truncated to that value; the default value in this file is 128.  In kernels before 2.4.25, this limit was a hard coded value, SOMAXCONN, with the value 128.

We were dropping packets since the backlog queue was filling up.  Worse, clients will wait 3 seconds before re-sending the SYN, and then 9 seconds if that SYN doesn’t get through again.

Another symptom we saw when looking at /var/log/messages was this message showing up

[84440.731929] possible SYN flooding on port 80. Sending cookies.

Were we being SYN flooded?  Not an unreasonable thing to expect with the servers exposed to the internet, but it turns out this message can send you looking in the wrong direction. Couldn’t we just turn off SYN cookies to make this message stop and just cover our eyes and pretend its not happening?  Digging into this message further we learned , SYN cookies are only triggered when the SYN backlog overflows.  Why were we overflowing the SYN backlog?  We had set net.core.somaxconn and net.core.netdev_max_backlog both to 16384 and number of sockets in SYN_RECV state was no where near those numbers.  Looking at some packet captures it did not appear we were actually under a SYN flood attack of any kind.  It seemed like our settings for max SYN backlog were being ignored, but why?  While researching more information about this, we came across some folks with a similar issue and discovered Nginx had a default of 511 for the backlog value if one was not set, it did not default to using the value of SOMAXCONN.  We raised the backlog value in the listen statement (e.g listen 80 backlog=16384) to match our sysctl settings and the SYN cookie messages disappeared from /var/log/messages.  The number of TCP listen queue overflows went to 0 in our ganglia graphs and we were no longer dropping packets.

listen_queue_overflows

 

More sysctl TCP tuning

After fixing the backlog issue, it was time to review our existing sysctl settings.  We’ve had some tunings in place for a while but it had been some time since they were reviewed to ensure they still made sense for us.  There’s a lot of bad information out on the web on tuning TCP settings under sysctl that people just blindly apply to their servers.  Often times these resources don’t bother explaining why they are setting a certain sysctl parameter and just give you a file to put in place and tell you this will give you the best performance.  You should be sure you fully understand any value you are changing under sysctl.  You can seriously affect the performance of your server with the wrong values or certain options even enabled in the wrong environments.  The TCP man page and TCP/IP Illustrated: The Implementation, Vol 2  were great resources in helping to understand these parameters.

Our current sysctl modifications as they stand today are as follows (included with comments), Disclaimer: please don’t just use these settings on your servers without understanding them first

# Max receive buffer size (8 Mb)
net.core.rmem_max=8388608
# Max send buffer size (8 Mb)
net.core.wmem_max=8388608

# Default receive buffer size
net.core.rmem_default=65536
# Default send buffer size
net.core.wmem_default=65536

# The first value tells the kernel the minimum receive/send buffer for each TCP connection,
# and this buffer is always allocated to a TCP socket,
# even under high pressure on the system. …
# The second value specified tells the kernel the default receive/send buffer
# allocated for each TCP socket. This value overrides the /proc/sys/net/core/rmem_default
# value used by other protocols. … The third and last value specified
# in this variable specifies the maximum receive/send buffer that can be allocated for a TCP socket.
# Note: The kernel will auto tune these values between the min-max range
# If for some reason you wanted to change this behavior, disable net.ipv4.tcp_moderate_rcvbuf
net.ipv4.tcp_rmem=8192 873800 8388608
net.ipv4.tcp_wmem=4096 655360 8388608

# Units are in page size (default page size is 4 kb)
# These are global variables affecting total pages for TCP
# sockets
# 8388608 * 4 = 32 GB
#  low pressure high
#  When mem allocated by TCP exceeds “pressure”, kernel will put pressure on TCP memory
#  We set all these values high to basically prevent any mem pressure from ever occurring
#  on our TCP sockets
net.ipv4.tcp_mem=8388608 8388608 8388608

# Increase max number of sockets allowed in TIME_WAIT
net.ipv4.tcp_max_tw_buckets=6000000

# Increase max half-open connections.
net.ipv4.tcp_max_syn_backlog=65536

# Increase max TCP orphans
# These are sockets which have been closed and no longer have a file handle attached to them
net.ipv4.tcp_max_orphans=262144

# Max listen queue backlog
# make sure to increase nginx backlog as well if changed
net.core.somaxconn = 16384

# Max number of packets that can be queued on interface input
# If kernel is receiving packets faster than can be processed
# this queue increases
net.core.netdev_max_backlog = 16384

# Only retry creating TCP connections twice
# Minimize the time it takes for a connection attempt to fail
net.ipv4.tcp_synack_retries = 2
net.ipv4.tcp_syn_retries = 2

# Timeout closing of TCP connections after 7 seconds
net.ipv4.tcp_fin_timeout = 7

# Avoid falling back to slow start after a connection goes idle
# keeps our cwnd large with the keep alive connections
net.ipv4.tcp_slow_start_after_idle = 0

A couple of additional settings we looked at were tcp_tw_reuse, tcp_tw_recycle and tcp_no_metrics_save.

After reading about tcp_tw_recycle, right away we decided we did not want to have it enabled.  You should really never enable this as it will affect any connections involving NAT and result in dropped connections.  See http://stackoverflow.com/questions/8893888/dropping-of-connections-with-tcp-tw-recycle for more information.  This is an example of a setting I often see in TCP performance blog posts where the option is set to enabled with no explanation as to why and the dangers involved in having it enabled.

tcp_tw_reuse is a safer way to reduce the number of sockets you have in a TIME_WAIT state.  The kernel will allow re-using a socket if it’s deemed safe from a protocol standpoint when a socket is in a TIME_WAIT state. In our testing we really didn’t see a discernible difference with this option enabled so we left it disabled.

By default, the kernel keeps metrics on each TCP connection established to it.  Values like rto(retransmission timeout), ssthresh(slow start threshold), cwnd(congestion window size) are kept on each connection so that when the kernel sees that connection again, it can re-use those values in hopes of optimizing the connection.  You can view these values with the command ip route show cache.  There are cases when you may not want this option enabled.  A slow client coming from behind a NAT can end up causing non-optimal values to be cached and have them get re-used for the next person connecting from that IP.  If your clients are mostly on mobile where the connection is very erratic, you’ll want to consider enabling this option since most likely each session will have different optimal connection settings.

Some important takeaways from this first part are

  • DNS is not a great means of load balancing traffic
  • Modifying sysctl values from their defaults can be important to ensure reliability
  • Graphing metrics is your friend

In part 2 of this blog post, we’ll explore our move to using Amazon’s ELB, enabling of HTTP keep alive and look at some graphs showing the impact of the changes.

Last Thursday (October 18, 2011), we had a MongoDB meetup at Chartbeat HQ. I gave one of the presentations in this meetup, and talked about our experience of running MongoDB on EC2. Check out the talk if you are interested in learning more on how Chartbeat evolved from a single/simple cluster, master-slave configuration to leveraging multiple clusters, benefiting from EBS snapshotting, loving replica sets, dealing with disappearing instances, focusing on disk locality and much more.

The video is here.

Here are the slides:

View more presentations from Eytan Daniyalzade

Here at chartbeat, we have been running MongoDB on EC2 for more than a year. Recently we transitioned from our master/slave configuration to replica sets. The benefits of replica sets are clear, especially in an environment like EC2 where node failures are a part of the routine; it lowers the oversight cost of dealing with failures almost to zero. This post briefly describes the replica set setup we ultimately settled on, and some of the hurdles along the way.

Our Final Setup
The requirements for our architecture were ensuring seamless failover, allowing periodically snapshotting the data, and enabling distributing reads on secondaries. We also had to constrain downtime during the transition to less than a minute. Our final setup is a 4 node replica set; where there is one primary, two secondaries, and a final node that runs both a hidden secondary used for EBS snapshotting and an arbiter. We chose high-memory instance types (m2.xlarge) for the three non-hidden nodes, and a standard-memory type for the snapshot/arbiter box (m1.large). Marking the snapshot box as hidden, ensures that it never appears in the replica set, when a client asks a replica-set member for the list of nodes; hence, ensures that the box will never be queried. This being the case, we can take periodic EBS snapshots (we do hourly) of its data without worrying about performance.

In order to reach this final setup, we had to upgrade our servers from mongodb 1.6.5 to 1.8.1 (for the ‘hidden’ node capability), which was a drop-in replacement, and then 1) change individual server’s configuration to replica-set 2) extend the pymongo drivers to ensure reading from secondaries 3) and, unfortunately deal with some minor but unexpected MongoDB issues, such as failing to propagate configuration to the newly added nodes.

1) Changing The Configuration To Replica-Set
To change our master/slave setup to a replica set, we could have just added ‘–repl=foo‘, to our mongo configuration files. However, for this we would have had to incur some down-time, where we would shutdown each server, start them each with the new configuration, and wait for them to elect a master. Furthermore, in case of any unforeseen events happening with the new setup, we would not have any way of rolling back.

Instead we decided to run master/slave servers side-by-side with the replica set servers we would be bringing up, and do dual writes until we would iron out the issues. To achieve this, we first brought up a slave from an existing slave’s EBS snapshot. Once we ensured that this slave was caught-up; we converted it into a single node replica set, and restarted our applications to start doing dual-writes, to the master/slave and the one-node replica set.

Adding new nodes into this replica-set turned out to be challenging due to the size of data we are dealing with. In many configurations, one could just add a node, and get it to do a full-sync with the primary. However, due to our data-size (multiple TBs) and the relatively slow IO on Amazon, this operation would have taken 2+ days for every new node. Instead, we used our master/slave EBS snapshots to bring up nodes with almost complete data, get these nodes to slave off of our master until they were fully caught up, and then make them replica-set members, with –fastsync option. The –fastsync option there is critical, since trying to add a replica node from master/slave into replica-set would try to do a fullsync if –fastsync is not provided. With these steps, we brought up the secondaries, and had functional master/slave and replica set systems working side-by-side. Next step was starting to read from replica sets.

2) Evenly Distributing Reads To Secondaries with PyMongo
Unfortunately, PyMongo 1.9 (and all the versions up until 1.11) do not support directing reads to the secondaries. Although the drivers handle replica sets and and connecting to the primary seamlessly, all the reads and writes always go to the primary; which, is less than ideal for both the read and write performance. Further development on this front is slated for pymongo 2.0, which we could not afford waiting for. Interestingly, pymongo’s master_slave_connection.py module supports dealing with multiple connections (one to each server) and intelligently sending the writes and reads to master and slave respectively. Inspired by Aaron Westendorf‘s blog-post, we extended this module to support a) balanced reads to secondaries b) abstracting away the handling of failures in servers from the application. Now that we had the server and application setup created, we were ready to start using replica sets in production.

3) Some Miscellaneous Mongodb Issues We Worked With
During these steps we had to work with some MongoDB issues that made setup and maintenance more challenging than expected. Normally, adding a new node to the replica-set is easily handled with the ‘rs.add‘ command. This command changes the configuration on the primary; and the persisted configuration replicates to the secondaries and the new node. However, we observed multiple times that the configuration propagated to all the nodes except for the newly added node, as highlighted in the mongodb-user group; leaving the new node with a ‘BADCONFIG‘ error. The only remedy we could find for this, was manually overriding the configuration on the new node as described here. To Mongo’s credit, this issue will be fixed in MongoDB 2.0.

Although with some effort, we completed our transition to replica sets without any major problems; and have been running it in production for a month or so. And, thanks to the joy of running on EC2 🙂 , we have already started seeing the payoff of this transition, by letting our servers and the application seamlessly take care of node failures, and directing reads/writes appropriately.

UPDATE (12-26-2011): PyMongo 2.1 now supports distributing the reads to secondaries and periodically checking if the secondary connections are healthy. We are planning to do the transition to this version soon.