June 2021 Datacenter Switchover

In June 2021, most user traffic was switched from our primary Virginia datacenter to our secondary one in Texas. This post covers how the swtichover went and the issues that came up.

By Kunal Mehta, Site Reliability Engineer, Service Operations

In June 2021, the Wikimedia Foundation’s Site Reliability Engineering team switched most user traffic from our primary datacenter in Virginia (“eqiad”) to our secondary one in Texas (“codfw”, learn more about our different datacenters). This is an exercise we’ve done multiple times over the past five years, and this was the smoothest and fastest one yet.

The main reason we perform a datacenter switchover is to verify that in an emergency, we can switch to a different datacenter with minimal interruptions for users. All of our services and datacenters have redundant networking, power, disks, and more. Even then, freak accidents can happen, and we need to be prepared.

We also used this time to perform maintenance in Virginia that’s cumbersome to do when we’re actively serving user traffic. For example, we’re currently swapping out about 45 MediaWiki application servers for brand new hardware, giving users a slight performance boost. There’s also a large list of pending database maintenance that was waiting for the switchover to happen.

The switchover itself was divided into three primary sections: Services, Traffic (caches), and MediaWiki.

Services

At one point in time, MediaWiki was a large PHP application, but years ago, we started deconstructing it into a set of smaller services. Today, we have MediaWiki, which is still a large PHP application, and many services that provide some independent function to MediaWiki, such as maps, or math syntax, or even the WikiText parsing itself. For each switchover, we try to expand the list of services being switched. This time we included two more services in this list, notably Swift, which handles all of our media storage.

Most of these are active-active, in that they run out of both datacenters at the same time. Under normal circumstances, we choose to use these in the same datacenter as MediaWiki. During the switchover, we moved usage to Texas to ensure we have enough capacity there to handle the load. Here’s an example of traffic shifting from Virginia to Texas for the Citoid service, which fetches and generates reference templates and metadata.

Citoid in eqiad after June 2021 DC switchover — Virginia graph by Legoktm, CC BY-SA 4.0

During this process we identified a few issues:

T285707: Our helm-charts service doesn’t have a service IP, causing it to fail verification that it switched over properly. This also interrupted the verification for the rest of the services, so we had to check them by hand.
T285710: Monitoring for the Wikidata Query Service required manually switching the datacenter being monitored, causing lag to be misreported. Most Wikidata bots do check the amount of lag before editing, so they were stalled until it was manually switched.

Traffic

Most requests for articles never hit MediaWiki itself. They’re served from our edge caches, typically the one closest to you, of: Virginia, Texas, California, Amsterdam, or Singapore. We disconnected Virginia by excluding it from our geographic DNS, where all countries are mapped to datacenters, and within a few minutes, nearly all of that traffic was going to Texas instead.

Varnish during June 2021 DC switchover — Varnish traffic graph by Legoktm, CC BY-SA 4.0

We didn’t run into any issues during this step.

MediaWiki

MediaWiki is the application that powers all of our wikis. Work is ongoing to make it possible to run it in multiple datacenters at the same time, but for now, it can only be active in one at a time. The process for switching datacenters for MediaWiki is complex, but in brief entails setting the primary databases as read-only, waiting for replication to finish across into the other datacenter, and then lifting read-only mode in the new datacenter.

Because of how disruptive stopping edits is for wikis, we’ve been cutting down how long this read-only period takes, each time. This time, it only lasted 1 minute and 57 seconds, the fastest yet!

After the switch, the Turkish Wikivoyage was unavailable for a few minutes because of a typo in the configuration. An incident report was written for this, and a patch is pending review to prevent it from happening again.

Various other improvements to the automation around switching have been filed in Phabricator as well.

Next steps

We will switch back to our primary Virginia datacenter sometime in August once most maintenance has finished, allowing us to test the procedure once again. We also have Datacenter-Switchover and MediaWiki-MultiDC Phabricator projects tracking our work in this area to make Wikimedia wikis more resilient and available on a technical level.

About this post

Featured image credit: Wikimedia Servers by Victor Grigas, CC BY-SA 3.0