From hell to HTML: releasing a Python package to easily work with Wikimedia HTML dumps
For over 15 years, the Wikimedia Foundation has provided public dumps of the content of all wikis. They are not only useful for archiving or offline reader projects, but can also power tools for semi-automated (or bot) editing such as AutoWikiBrowser. For example, these tools comb through the dumps to generate lists of potential spelling mistakes in articles for editors to fix. For researchers, the dumps have become an indispensable data resource (footnote: Google Scholar lists more than 16,000 papers mentioning the word “Wikipedia dumps”). Especially in the area of natural language processing, the use of Wikipedia dumps has become almost ubiquitous with the advancement of large language models such as GPT-3 (and thus by extension also the recently published ChatGPT) or BERT. Virtually all language models are trained on Wikipedia content, especially multilingual models which rely heavily on Wikipedia for many lower-resourced languages.
Over time, the research community has developed many tools to help folks who want to use the dumps. For instance, the mwxml Python library helps researchers work with the large XML files and iterate through the articles within them. Before analyzing the content of the individual articles, researchers must usually further preprocess them, since they come in wikitext format. Wikitext is the markup language used to format the content of a Wikipedia article in order to, for example, highlight text in bold or add links. In order to parse wikitext, the community has built libraries such as mwparserfromhell, developed over 10 years and comprising almost 10,000 lines of code. This library provides an easy interface to identify different elements of an article, such as links, templates, or just the plain text. This ecosystem of tooling lowers the technical barriers to working with the dumps because users do not need to know the details of XML or wikitext.
While convenient, there are severe drawbacks to working with the XML dumps containing articles in wikitext. In fact, MediaWiki translates wikitext into HTML which is then displayed to the readers. Thus, some elements contained in the HTML version of the article are not readily available in the wikitext version; for example, due to the use of templates. This means that parsing only wikitext means that researchers might ignore important content which is displayed to readers. For example, a study by Mitrevski et al. found for English Wikipedia that from the 475M internal links in the HTML versions of the articles, only 171M (36%) were present in the wikitext version.
Therefore, it is often desirable to work with HTML versions of the articles instead of using the wikitext versions. Though, in practice this has remained largely impossible for researchers. Using the MediaWiki APIs or scraping Wikipedia directly for the HTML is computationally expensive at scale and discouraged for large projects. Only recently, the Wikimedia Enterprise HTML dumps have been introduced and made publicly available with regular monthly updates so that researchers or anyone else may use them in their work.
However, while the data is available, it still requires lots of technical expertise by researchers, such as how different elements from wikitext get parsed into HTML elements. In order to lower the technical barriers and improve the accessibility of this incredible resource, we released the first version of mwparserfromhtml, a library that makes it easy to parse the HTML content of Wikipedia articles – inspired by the wikitext-oriented mwparserfromhell.
The tool is written in Python and available as a pip-installable package. It provides two main functionalities. First, it allows the user to access all articles in the dump files one by one in an iterative fashion. Second, it contains a parser for the individual HTML of the article. Using the Python library beautifulsoup, we can parse the content of the HTML and extract individual elements (see Figure 1 for examples):
- Wikilinks (or internal links). These are annotated with additional information about the namespace of the target link or whether it is disambiguation page, redirect, red link, or interwiki link.
- External links. We distinguish whether it is named, numbered, or autolinked.
- Media. We capture the type of media (image, audio, or video) as well as the caption and alt text (if applicable).
- Plain text of the articles
We also extract some properties of the elements that end users might care about, such as whether each element was originally included in the wikitext version or was transcluded from another page.
Building the tool posed several challenges. First, it remains difficult to systematically test the output of the tool. While we can verify that we are correctly extracting the total number of links in an article, there is no “right” answer for what the plain text of an article should include. For example, should image captions or lists be included? We manually annotated a handful of example articles in English to evaluate the tool’s output, but it is almost certain that we have not captured all possible edge cases. In addition, other language versions of Wikipedia might provide other elements or patterns in the HTML than the tool currently expects. Second, while much of how an article is parsed is handled by the core of MediaWiki and well documented by the Wikimedia Foundation Content Transform Team and the editor community on English Wikipedia, article content can also be altered by wiki-specific Extensions. This includes important features such as citations, and documentation about some of these aspects can be scarce or difficult to track down.
The current version of mwparserfromhtml constitutes a first starting point. There are still many functionalities that we would like to add in the future, such as extracting tables, splitting the plain text into sections and paragraphs, or handing in-line templates used for unit conversion (for example displaying lbs and kg). If you have suggestions for improvements or would like to contribute, please reach out to us on the repository, and file an issue or submit a merge request.
Finally, we want to acknowledge that the project was started as part of an Outreachy internship with the Wikimedia Foundation. We encourage folks to consider mentoring or applying to the Outreachy program as appropriate.
About this post
Featured image credit: Очистка ртути перегонкой в токе газа.png in the public domain
Figure 1 image credit: Mwparserfromhtml functionality.gif by Isaac (WMF) licensed under the Creative Commons Attribution-Share Alike 4.0 International license
1 thought on “From hell to HTML: releasing a Python package to easily work with Wikimedia HTML dumps”
Thanks for this.