[[WM:TECHBLOG]]

Saying no to proprietary code in production is hard work: the GPU chapter

Maintaining and improving one of the largest websites in the world using Open Source software requires a continuous commitment. The site is always evolving, so for every new component we want (or need!) to deploy, we need to evaluate the Open Source solutions available.

By Luca Toscano, Miriam Redi and Nuria Ruiz

The vast majority of the code that runs Wikipedia is Open Source—released under Free Software licenses. This means that the infrastructure that delivers the site’s free knowledge runs software that is not owned by any company. It is publicly available to anyone; you can read the code, and if you want, you can use it on your own server. Maintaining and improving one of the largest websites in the world using Open Source software requires a continuous commitment. The site is always evolving, so for every new component we want (or need!) to deploy, we need to evaluate the Open Source solutions available.

Our latest challenge was setting up an internal environment for machine learning model training. The fundamental difference between machine learning and traditional programming is that in the latter, you write code that, given an input, generates an output. In the former, you write a program that learns mappings between data inputs and outputs.

This data is called training data.

We use large-scale training data to derive a formula able to infer data properties to which we can supply new data to be categorized or analyzed. This formula can tell us things like “this article is about politics” or “this image contains a cat.”

This formula is called a model.

Models are computationally expensive to build; they require many operations over large sets of data. This is called training or building a model. As datasets get larger, training a model might take days or weeks. How long it takes will depend on the amount of data, the type of computations required for training, and the hardware in which those expensive computations are running. While these calculations are expensive (there are many of them), they are not complicated (simple arithmetic operations), and running those operations in parallel can speed up training dramatically.

GPUs (Graphics Processing Units), originally built to accelerate the memory-intensive work of geometrical operations in 3d environments like games, are used today in many areas that require the parallelization of matrix/vector multiplications, including Machine Learning.

Nvidia was one of the first manufacturers of commercial GPUs to produce highly performant drivers and tools since the early days, but it has historically overlooked collaborations with Open Source developers interested in providing non-proprietary/closed alternatives. This is why independent efforts to improve open-source Nvidia drivers emerged from the community.

However, another manufacturer seems to be leading the way: AMD. AMD provides a full Open Source suite for their GPUs called ROCm. While the GPU firmwares are still non-free binaries (as it also happens with some CPU firmwares), almost all the driver/software that manages the GPU is Open Source. This is why when the Wikimedia Analytics and SRE teams evaluated what brand to invest time and effort in, AMD was picked as the preferred solution.

The SRE team deploys only Debian on every host in the infrastructure. We chose Debian 10 (Buster) as a baseline for our tests since the Linux kernel version included directly supports the GPU. AMD directly develops the kernel side of the driver in the mainline Linux kernel and also ships the driver as part of its own set of Debian packages. Even if the ROCm Debian Repository supports only Ubuntu, we were able to use its packages on Debian 10 without any rebuild needed. Then we discovered tensorflow-rocm: a port of the Tensorflow Pypi Python package that met our needs.

It seemed the perfect solution, but then we quickly discovered the first issues. With the first GPU that we tested, a Hawaii FirePro W9100, Tensorflow was hanging and causing kernel stalls in most of our simple tests. Even basic testing tools provided by the ROCm suite were failing with obscure error messages. The GPU was “enabled but not supported,” meaning that there were little possibilities to make it work. We tried hard to fix the problem with upstream, but eventually, we decided to buy a new AMD GPU card.

This was in itself an interesting task. The two server vendors that we buy hosts from offer a wide selection of Nvidia cards (certified to fit and work into their chassis) but only old AMD ones. We manage our own hardware, so we had to be creative and measure space inside a server’s chassis before being sure about what card available on the market could have fit into it (pictures in https://phabricator.wikimedia.org/T216528). Size was not the only problem, power consumption and ventilation were also a concern. Eventually, we decided to buy a AMD Radeon Pro WX 9100 16GB card that ended up fitting very well in the server’s chassis.

Working outside the specifications of server vendors is not an easy task for a foundation with limited resources.

The next step was working on importing the Debian packages in our own APT repository—automating their deployment and configuration via puppet. Managing our own repository for Debian packages has a lot of advantages. One of them is the ability to build your own version of a package if it doesn’t respect your free software policy. One example is hsa-ext-rocr-dev. It contains non-free binary libraries for processing images with OpenCL. Since up to now there has been little traction to Open Source it by upstream, we are temporarily bypassing the problem by creating a “fake” Debian package via equivs, to avoid deploying those libraries. Last but not the least, Puppet code was added to our repository to automate the configuration of an AMD GPU with ROCm and its related monitoring.

This latest addition to Wikimedia’s infrastructure will enable developers, researchers and analysts at the Foundation to build machine learning models for image classification, language modeling, large-scale data processing, and more —relying on fast, reliable, and more importantly, open technology.

In the sequel of this blog post, we will talk about our latest machine learning projects. Stay tuned!

About this post

To know more please visit: https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/AMD_GPU#Use_the_Debian_packages

Featured image credit: BalticServers data center, CC BY-SA 3.0, GNU Free Documentation License 1.2 or any later version

4 thoughts on “Saying no to proprietary code in production is hard work: the GPU chapter

Leave a Reply

Your email address will not be published. Required fields are marked *