By Andrew Bogott, Senior Site Reliability Engineer and Brooke Storm, Staff Site Reliability Engineer
We’re currently running Ceph version ‘Nautilus,’ which is the stock version packaged with Debian Buster.
General hardware overview
The details of how Ceph actually works is well beyond the scope of this blog post (More detail can be found here.). In brief (and for our present use case), it splits all data into arbitrary chunks and maintains three copies of each chunk, keeps track of where those copies are, and makes sure that all three eggs are in different baskets. The software that keeps track of the health of all this is on hosts called ‘monitor’ hosts. The hosts that actually store and replicate the data are called ‘osd hosts’.
Having three copies of everything is great — it means that there’s never a tie about state, and it means that a lot of hardware would have to die at once in order for any data to be lost. Having three copies also means that the total hardware needs are immense.
Our current cluster contains 15 osd hosts; each host contains eight 1.8 terabyte ssds. All together that’s about 216 terabytes of raw storage, which is enough to handle all the current VMs plus just a little bit of space for growth. The good news is that since Ceph worries about redundancy and striping we don’t lose anything to local RAIDs, and expanding the cluster when needed is operationally trivial.
There are also three monitor hosts. Three is, again, the lucky number that means there’s redundancy but no chance of a tied vote in case of a disagreement.
In order to safely maintain redundant data copies, blocks are constantly getting copied, deleted, and rebalanced among the OSD nodes in a probabilistic fashion according to the CRUSH algorithm. That’s a fair amount of network chatter even when things are at rest; if a given node loses power or suffers some other kind of failure, everyone will get to work replacing the lost copies and the network will get extremely busy. Although we don’t estimate our system will be anywhere near the scale needed to cause such a problem, to prevent Ceph from even being able to launch an accidental denial of service attack on the datacenter we’ve installed everything in such a way that the backend traffic cannot flood uplinks for other services. The cluster network for backend traffic isn’t even on a routed VLAN.
Traffic to hypervisors is on separate interfaces to keep things isolated and performant even during minor outages of individual OSD servers. This traffic will be rate-limited at the VM level via the VM’s flavor definition in Openstack in order to prevent individual VMs from flooding the hypervisor’s network capacity.
Planning for the worst
The WMCS team is new to Ceph, which means we’re largely unfamiliar with possible failure scenarios. It has a good reputation for stability, but as we rely on it for ever more storage cases it also becomes an ever-more intimidating disaster risk. To mitigate that we’re pursuing several approaches:
- We are hiring a short-term consultant, both to provide training and also to be on call in case problems arise that are beyond our understanding.
- We’re trying to hire a Ceph expert to join our team. If this is you, please apply!
- We’re investigating backup solutions for Ceph-hosted volumes. We will probably never have the resources to comprehensively back up every piece of data, but we do hope to have a selective backup process whereby more project-critical VMs can have short-term backups outside of Ceph in case of collapse. More research is needed here before we’ll know what level of durability we can commit to here.
About this Post
This is Part 2 of 2 posts on Ceph and Wikimedia Cloud Services. Read Part 1.