Here at chartbeat, we have been running MongoDB on EC2 for more than a year. Recently we transitioned from our master/slave configuration to replica sets. The benefits of replica sets are clear, especially in an environment like EC2 where node failures are a part of the routine; it lowers the oversight cost of dealing with failures almost to zero. This post briefly describes the replica set setup we ultimately settled on, and some of the hurdles along the way.
Our Final Setup
The requirements for our architecture were ensuring seamless failover, allowing periodically snapshotting the data, and enabling distributing reads on secondaries. We also had to constrain downtime during the transition to less than a minute. Our final setup is a 4 node replica set; where there is one primary, two secondaries, and a final node that runs both a hidden secondary used for EBS snapshotting and an arbiter. We chose high-memory instance types (m2.xlarge) for the three non-hidden nodes, and a standard-memory type for the snapshot/arbiter box (m1.large). Marking the snapshot box as hidden, ensures that it never appears in the replica set, when a client asks a replica-set member for the list of nodes; hence, ensures that the box will never be queried. This being the case, we can take periodic EBS snapshots (we do hourly) of its data without worrying about performance.
In order to reach this final setup, we had to upgrade our servers from mongodb 1.6.5 to 1.8.1 (for the ‘hidden’ node capability), which was a drop-in replacement, and then 1) change individual server’s configuration to replica-set 2) extend the pymongo drivers to ensure reading from secondaries 3) and, unfortunately deal with some minor but unexpected MongoDB issues, such as failing to propagate configuration to the newly added nodes.
1) Changing The Configuration To Replica-Set
To change our master/slave setup to a replica set, we could have just added ‘–repl=foo‘, to our mongo configuration files. However, for this we would have had to incur some down-time, where we would shutdown each server, start them each with the new configuration, and wait for them to elect a master. Furthermore, in case of any unforeseen events happening with the new setup, we would not have any way of rolling back.
Instead we decided to run master/slave servers side-by-side with the replica set servers we would be bringing up, and do dual writes until we would iron out the issues. To achieve this, we first brought up a slave from an existing slave’s EBS snapshot. Once we ensured that this slave was caught-up; we converted it into a single node replica set, and restarted our applications to start doing dual-writes, to the master/slave and the one-node replica set.
Adding new nodes into this replica-set turned out to be challenging due to the size of data we are dealing with. In many configurations, one could just add a node, and get it to do a full-sync with the primary. However, due to our data-size (multiple TBs) and the relatively slow IO on Amazon, this operation would have taken 2+ days for every new node. Instead, we used our master/slave EBS snapshots to bring up nodes with almost complete data, get these nodes to slave off of our master until they were fully caught up, and then make them replica-set members, with –fastsync option. The –fastsync option there is critical, since trying to add a replica node from master/slave into replica-set would try to do a fullsync if –fastsync is not provided. With these steps, we brought up the secondaries, and had functional master/slave and replica set systems working side-by-side. Next step was starting to read from replica sets.
2) Evenly Distributing Reads To Secondaries with PyMongo
Unfortunately, PyMongo 1.9 (and all the versions up until 1.11) do not support directing reads to the secondaries. Although the drivers handle replica sets and and connecting to the primary seamlessly, all the reads and writes always go to the primary; which, is less than ideal for both the read and write performance. Further development on this front is slated for pymongo 2.0, which we could not afford waiting for. Interestingly, pymongo’s master_slave_connection.py module supports dealing with multiple connections (one to each server) and intelligently sending the writes and reads to master and slave respectively. Inspired by Aaron Westendorf‘s blog-post, we extended this module to support a) balanced reads to secondaries b) abstracting away the handling of failures in servers from the application. Now that we had the server and application setup created, we were ready to start using replica sets in production.
3) Some Miscellaneous Mongodb Issues We Worked With
During these steps we had to work with some MongoDB issues that made setup and maintenance more challenging than expected. Normally, adding a new node to the replica-set is easily handled with the ‘rs.add‘ command. This command changes the configuration on the primary; and the persisted configuration replicates to the secondaries and the new node. However, we observed multiple times that the configuration propagated to all the nodes except for the newly added node, as highlighted in the mongodb-user group; leaving the new node with a ‘BADCONFIG‘ error. The only remedy we could find for this, was manually overriding the configuration on the new node as described here. To Mongo’s credit, this issue will be fixed in MongoDB 2.0.
Although with some effort, we completed our transition to replica sets without any major problems; and have been running it in production for a month or so. And, thanks to the joy of running on EC2 🙂 , we have already started seeing the payoff of this transition, by letting our servers and the application seamlessly take care of node failures, and directing reads/writes appropriately.
UPDATE (12-26-2011): PyMongo 2.1 now supports distributing the reads to secondaries and periodically checking if the secondary connections are healthy. We are planning to do the transition to this version soon.