The Wikimedia Analytics Engineering team manages multiple systems, all gravitating around a big (for our standards) Hadoop cluster. This post describes our path to changing our Hadoop distribution in a single day, together with the lessons learned while doing it.

By Luca Toscano, Senior Site Reliability Engineer, and Joseph Allemandou, Staff Data Engineer

Introduction

The Wikimedia Analytics Engineering team is composed of data engineers, software developers, and site reliability engineers whose primary responsibility is to “empower and support data-informed decision making across the Foundation and the community.”

Some of the team’s main responsibilities are:

Maintain and curate several datasets useful for the community and the Foundation.
For example, HTTP requests landing on any of our wiki project websites are ingested and refined as Webrequest; several derivative datasets are made, for example, Pageviews, MediaRequests, and Unique Devices.
Provide compute power and storage to Wikimedia’s data scientists and analysts, via tools like Jupyter notebooks, Hive, Spark, Presto, and Druid.
Provide data exploration and visualization tools to the Foundation with Superset and Turnilo running on top of our Druid and Presto clusters.
Support the Wikistats website and its open API, providing up-to-date metrics to the broader community, together with periodical public data dumps of various data sources.

The infrastructure that makes all the above possible powers the Analytics Data Lake, and Hadoop is at its center, with other tools gravitating around it.

Image credit: Tools gravitating around Hadoop, Luca Toscano, CC BY-SA 4.0

Hadoop is “an open-source software for reliable, scalable, distributed computing.” It is composed of three main components:

HDFS: A distributed file system that provides high-throughput access to application data.
YARN: A framework for job scheduling and cluster resource management.
Map Reduce: A YARN-based distributed processing framework, for parallel processing of large data sets.

The cluster our team maintains runs around 80 nodes, adding up to over 10TB of RAM and 4300 virtual CPU cores, and storing up to 1PB of data (data is replicated 3 times across the cluster, so the actual storage available is actually more than 3PB).

In terms of software, we have used the Cloudera CDH 5 Hadoop distribution over the past years as a convenient way to avoid maintaining a lot of custom Debian packages, since all our production infrastructure runs on the Debian OS. When Cloudera released the CDH 6 distribution, they dropped the support for the Debian OS, to concentrate on Ubuntu and introduced a paywall for all the software released from their repository. This change represented a big roadblock for us, so we started to look for alternatives. After a review of the available Hadoop distributions, we chose Apache Bigtop for our use case.

Why Apache Bigtop

Among the several features we were looking for in a new Hadoop distribution, the following were the most important ones:

An active open source project, open to community contributions and suggestions.
A project committed to supporting the Debian OS, or at least considering it if we were to offer help in maintaining packages.
A way to control and rebuild packages if needed, to apply custom patches independently from upstream releases.
A project with a clear development path to Hadoop 3.
A project releasing software with a license compatible with Wikimedia’s Guiding Principles.

Apache Bigtop was the only distribution we considered able to satisfy all the above aspects. It is a top-level Apache project, and as such, it follows the mantra of community over code:

“A healthy community is a higher priority than good code. Strong communities can always rectify problems with their code, whereas an unhealthy community will likely struggle to maintain a codebase in a sustainable manner.”

While the sentences may sound obvious and straightforward, they are very powerful and imply a lot of work for the upstream development community.

All of the Apache projects need to maintain several mailing lists; users@ and dev@ are the most famous ones, respectively tracking questions from the user community and development decisions and progresses from the committers and Project Management Committees of the project. In addition to those, the Apache Bigtop project provides other mailing lists to track package builds from the Jenkins CI and Jira work logged by the community, so it is really easy for newcomers to get a quick look around and figure out how active the project is.

We sent a couple of emails at the beginning of our review to ask questions about the project, and we got a lot of feedback and suggestions from the community. For example, the support for Debian 10 (Buster) was not finalized at the time, and talking with the community made it so that it would be included in the next release, together with support for Openssl 1.1.

The Bigtop community also offers Docker images on Docker hub to be able to easily spin up a cluster¹, or to mimic the work of one of the builder nodes that periodically rebuild packages for various operating systems and architectures. This was a key discovery for us since, with one command, it was possible to rebuild any package offered by Bigtop, and therefore, we could build our own version of the distribution including our own set of patches. Last but not the least, the development community highlighted the fact that the next release was targeting Hadoop 2.10.1, but the subsequent one would be Hadoop 3 based.

Preparation steps

The test setup

Changing a Hadoop distribution is a big task. It requires planning and careful testing, especially when multiple teams and public APIs rely on your systems for their daily jobs. To facilitate testing we decided to create a testing cluster to mimic the production one on a smaller scale, to be free to test configuration and software without impacting users. The idea was the following:

Check if all the packages currently in use (from the CDH distribution) were covered in Bigtop, and if not what alternatives were available.
Import the Bigtop packages in our Debian APT repository, making them available internally.
Bootstrap a bare minimum test cluster with Cloudera CDH and plan an upgrade strategy that worked reliably. The upgrade procedure needed to also include a rollback plan that was repeatable and working every time.
Iteratively upgrade and rollback to and from Apache Bigtop, focusing on finding a procedure that could be automated and that worked without leading to inconsistencies.
Write down the procedure into an automation framework, and test again upgrade/rollback.
Test the upgrade from the point of view of satellite systems, like clients and services that need Hadoop to work. The goal was to find corner cases or inconsistencies as early as possible.
Scale up the upgrade plan to work on the production cluster.

The majority of the time was spent in points 3 and 4 since we encountered a lot of small and big roadblocks along the way. We then automated the procedure using the Wikimedia cookbook framework (an abstraction layer on top of the more generic Spicerack library).

Bugs and issues

We created automation scripts as early as possible in the process, to translate the procedure outlined above in code. While testing, several interesting bugs came up:

Due to https://issues.apache.org/jira/browse/YARN-8310, all Node Managers need to get their tmp directory wiped on every worker before the upgrade.
Due to https://issues.apache.org/jira/browse/YARN-8310, the ZKRMStateRoot znode in Zookeeper needs to be removed manually to allow the Resource Managers to start after the upgrade. This also implies that the history of the last YARN application metadata is wiped out.
Due to https://issues.apache.org/jira/browse/YARN-5774, every YARN job/app got stuck into ACCEPTED state without progressing if a specific setting for the Fair Scheduler was not set.
Due to https://issues.apache.org/jira/browse/BIGTOP-3308, on Debian 9 (our initial target OS) the presence of two versions of libssl (1.0.2 and 1.1) led to unexpected symlinks for Openssl’s libcrypto, impairing the ability of Spark to create a session and run a simple job.
Due to https://issues.apache.org/jira/browse/BIGTOP-3317, the scripts packaged for Apache Oozie to upload shared libs to HDFS didn’t work, and needed a fix and re-packaging.
Due to https://issues.apache.org/jira/browse/BIGTOP-3330, the Bigtop Oozie packages (client and server) led to conflicts when deployed together.

The above list tells a great story, namely that upgrades need to be tested multiple times in order to nail down all corner cases. The price to pay is spending more time in testing and delaying the upgrade, but not having to deal with these problems in production is very much worth the price.

Backing up 400 Terabytes

One of the big questions that our team asked during testing was: “What happens if for any unforeseen reason HDFS ends up in an unrecoverable corrupted state after the upgrade?” Even though nothing we read on the internet suggested it could happen, the worst-case scenario of losing data was unacceptable for us.

The Wikimedia Foundation has a strict data retention policy: we drop all raw data containing PII data after 90 days, keeping only a sanitized version of it. While it is challenging to not be able to recompute data after 90 days, it allows us to better guarantee the users’ privacy. That being said, all our sanitized data is stored only on HDFS², so we needed a way to mitigate a possible HDFS corruption during the upgrade.

We worked on two things:

Identify unrecoverable datasets, namely historical data that we would not be able to re-create from other datasets if lost. This was a good exercise to have a sense of the amount of data that we implicitly considered critical but never really formally categorized (size, stakeholders, etc..).
Calculate how much space a one-off backup would need, and find a solution to store it somewhere (respecting the aforementioned Data Retention guidelines).

The final count ended up being about 450 terabytes of unreplicated data to backup from HDFS. Thanks to some lucky coincidence, we had 24 new Hadoop worker nodes available. We had provisioned them to replace out-of-warranty nodes and expand the cluster, and we ended up using them to form a temporary backup cluster. Every node was equipped with 72 virtual cores (after hyperthreading), 192G of RAM, 12 4TB HDDs for the HDFS DataNode partitions/volumes, and 2 SSDs for the OS root partition. This was enough to host the backup data with a replication factor of two.

We used DistCp as the tool of choice to copy the data from the original cluster to the backup one. The incremental capability of the tool has proven tremendously useful in our situation. The first run over a folder was long, as it needed to actually copy all the data, but following runs over the same folder only copied files had changed. So we started the backup phase a week before the planned upgrade day, and ran jobs every day to synchronize folders with their new data. Some fine-tuning was needed for the first few jobs to find a good balance between folder-size³ to copy and job duration; by the day of the upgrade, the backup was storing the expected data.

The Deployment Day

Upgrade procedure

The Hadoop upstream project introduced over the years more and more functionalities to improve the overall reliability and security of YARN and HDFS.

At Wikimedia we chose to use the following:

Automatic HDFS leader failover via Zookeeper and Query Journal Manager cluster. HDFS is composed of two leaders⁴ and several worker nodes, respectively called NameNodes and DataNodes. There must be only one leader daemon working at any given time, and Zookeeper is used as a way to reach consensus about which leader is active and which one is in standby. The Query Journal Manager cluster is, in our setup, a subset of worker nodes running another daemon called JournaNode, responsible for keeping a log of every change happening to the status of the NameNode (from the last known checkpoint, namely the FSImage). When the active NameNode fails or goes down, the standby one needs a reliable way of finding all the transactions committed by the old active NameNode after the last common checkpoint. The Query Journal Manager cluster helps in offering a distributed/consistent edit log, that the new NameNode can consult to update its last FSImage/checkpoint with the missing transactions.
Kerberos authentication for all services and clients. By default Hadoop doesn’t offer any form of authentication, trusting the client to tell the truth about the user submitting a command. We introduced a Kerberos infrastructure in the past couple of years, adapting the Hadoop cluster to use it. We have also turned on all the security features to encrypt/authenticate Hadoop RPCs and related API calls flowing in the cluster (for example, the YARN Shufflers use TLS to allow reducers to pull data from mappers, etc..).

Using a Query Journal Manager cluster prevented us from doing a rolling upgrade since the upstream documentation (in this case, the Hadoop Apache project one) suggested another procedure. It was crucial for us to evaluate the extra features enabled on our cluster before upgrading since most of the time upgrade procedures assume a very basic and standard setup. As upstream recommendations were very generic, we adapted them to our internal procedures that have been battle-tested over the years for roll restarts and minor upgrades for

Cloudera’s CDH, and we came up with the following:

Force the Hadoop HDFS NameNodes to safe mode (only read operations allowed) and create a new FSImage with the latest updated status of metadata. The files are then backed up in a safe place as they represent the status of the file system’s metadata, and they are our safety net in case something goes wrong with the upgrade procedure.
Stop the Hadoop cluster (both YARN and HDFS) gracefully, starting with the YARN NodeManagers and ResourceManagers, then proceeding with a very slow stop of all the HDFS DataNodes, then with the HDFS NameNodes, and finally with all the components of the Query Journal Manager clusters, the JournalNodes (they are required by the NameNodes to work so they need to be stopped last).
Upgrade the Debian packages on the worker nodes. The upgrade includes a restart of the daemons (YARN NodeManagers, HDFS Journalnodes, and HDFS DataNodes). The HDFS DataNodes should automatically upgrade their blocks directories with a previous and current directory, sharing blocks via hard links to facilitate a rollback if needed.
Upgrade the Debian packages on one Hadoop primary leader, starting the HDFS NameNode with the upgrade flag. This should trigger an upgrade of the NameNode and Query Journal Manager’s internal states. As it happened for the DataNodes, the NameNode will also create a previous and current directory for its file system metadata and state on disk to ease the rollback (if needed).
Complete the procedure for the remaining Hadoop leader acting as standby, starting the HDFS NameNode with the bootstrapStandby option as indicated by the upstream guide (basically a clean start).

Finalize the HDFS upgrade with a specific command executed on the active HDFS NameNode.

For the rollback, we came up with the following:
Stop the Hadoop cluster as highlighted above (a good opportunity to use a common shared automation procedure).
Rollback all the Debian packages to their previous versions and slowly bring up the cluster again.
All the HDFS DataNodes will need to be started with the rollback option since they will need to restore the previous status of their blocks.
Once it is the turn of the first HDFS NameNode, start it with the “rollback” option. This leverages the aforementioned previous/current hard linked directories and restores the status of HDFS as it was right before the upgrade, including the Query Journal Manager cluster.
The backup of the FSImage should not be needed as there are the current/previous directories, but it is safer to have it, for instance, if the “rollback” command for the NameNode fails.

The above procedure only covers the Hadoop cluster part of the infrastructure. All the satellite systems like client nodes (where users run queries and dashboards), Druid nodes (that uses HDFS as deep storage), etc. need to be upgraded in a reliable way as well. We have around 30 satellite nodes in our infrastructure, and we created another automated procedure to roll out the correct packages everywhere.

The scripts used to automate the upgrade are:

While they have some Wikimedia-specific syntax, they should be easy to read and adapt for others to reuse. Last but not least, one thing that helped us a lot has been using hostname labels, facilitating running the same code in the test and production environments without special cases in the code. This helped to better prove the reliability of the upgrade procedure.

Lessons learned

The day of the upgrade was a great learning experience for the team. We had planned a downtime of four hours, and we ended up working for twelve! Our testing environment was little in comparison to our production one (3 workers vs 80), and hence, the amount of data stored was very different as well—around 500K HDFS blocks in test, versus more than 50M blocks in production. None of the documentation and blog posts we had read mentioned it; we didn’t give it the necessary importance, and we had to deal with the fallout of this decision during the upgrade day.

Let’s review what happened in detail. We stopped all the jobs in the production cluster and stopped its nodes without any issue. Then we ran the script to upgrade all the hosts (workers then leaders), which worked flawlessly over the course of an hour. We noticed a problem after we started the active NameNode: half of the DataNodes failed to register to it after the package upgrade, leading to a failure in bootstrapping the HDFS file system. The NameNodes have a protection mechanism called Safe Mode when starting that prevents them from accepting anything other than “read” commands (more precisely, RPCs) from clients until a certain threshold of DataNodes successfully reports their blocks. Obviously, the NameNode being stuck in such a state was a bad scenario for the upgrade. The whole Analytics team jumped on a call to brainstorm about the problem, and we eventually found a way to bootstrap HDFS successfully. After quite some time spent in reviewing DataNodes’ logs, we ended up finding traces of Java OutOfMemory and GC Overhead exceptions, so we bumped all DataNodes’ max heap size from 4G to 16G. It took some time as we tried with a subset of the cluster nodes first, and then slowly rolled out the same change to the whole cluster.

This is what we believe happened:

Any upgrade from 2.6 to something more recent needs to go through a restructuring of the DataNode volumes/directories, as described in HDFS-3290 and HDFS-6482.
From HDFS-8782 it seems that the procedure requires time, and until the volumes are upgraded the DataNodes don’t register to the NameNode. This is what we observed during the upgrade, some of the DataNodes took several minutes to register, but eventually, they succeeded without hitting OOMs.
HDFS-8578 was created to process the DataNode volumes/directories in parallel during upgrades, independently from the DataNode dir structure upgrade mentioned above, but this would cause OOMs with small heap sizes as described in HDFS-9536 (this seems to still be an open problem).

We knew that rollback was an option. However, we did not consider it given that we continued to make slow but steady progress. The more blocks stored on each DataNode the more likely it was that the upgrade required a bigger heap size for those DataNodes, and we had not foreseen that as we hadn’t tested on worker nodes more than 200K HDFS blocks for which a 4G max heap size was more than enough. It is worth noting that the DataNode’s heap size of the production cluster has not caused any problem before this upgrade.

After the heap size bump, we got HDFS into a healthy state, and we were able to proceed with the upgrade using the other script to upgrade clients that completed without raising any other problem. Again, we wish to raise the point that automating any step as much as possible is very important, especially to ease the pressure on whoever runs the procedure. Upgrading manually one host at a time after hours of debugging would have been a nightmare; meanwhile, only checking how automation proceeded was surely easier, particularly with all the stress involved.

Conclusion

In this blog post, we hope to have given you a good overview of what needs to be done to prepare an upgrade for Hadoop and its satellite systems, together with traps and pitfalls to avoid when testing and upgrading. We hope to have provided a good overview for people interested in migrating to Apache Bigtop, for further questions don’t hesitate to contact us!

Footnotes

Managed via Puppet, with base code provided. ↩
We are relying on HDFS data replication as a safety net to prevent data loss from hardware/network failures. We have recently started working on a permanent backup solution in T277015. ↩
It is interesting to note that not only data size matters here, but also number of files. ↩
This is the most common and simple use case, but HDFS offers also more advanced configurations like Federation that we’ll not discuss.↩

About this post

Featured image credit: African Bush Elephant, Muhammad Mahdi Karim, GNU Free Documentation License Version 1.2

1 thought on “Upgrading Hadoop in just one day”

Robin Khokhar says:

June 26, 2021 at 6:54 am

Hi Luca and Joseph,
I have used Hadoop and it’s amazing. And Thanks for sharing such a good search analysis. It is amazing. I love it.
Thanks for the great share.
have a good weekend.