Announcing mwparserfromhtml, a new library that makes it easy to parse the HTML content of Wikipedia articles
Learn about a new Python library for automatically detecting and summarizing what content is changed by edits on Wikipedia.
We have recently developed WikiNav, an interactive tool to analyze and visualize reader navigation, as part of an Outreachy-internship.
Wikipedia articles are missing images, and Wikipedia images are missing captions. A scientific competition organized by the Research team at the Wikimedia Foundation could help bridge this gap. The WMF is also releasing a large image dataset to help researchers and practitioners build systems for automatic image-text retrieval in the context of Wikipedia.
How people use Search to access Wikipedia is a common question by researchers. Until now, however, there has been little data available about this relationship. To help address these questions, the Wikimedia Foundation is releasing a new, faceted dataset on search engine traffic to Wikipedia so you can ask questions like “What is the most common search engine in my country?” or “Which search engine is most-used by Android users?”
The Wikimedia Analytics Engineering team manages multiple systems, all gravitating around a big (for our standards) Hadoop cluster. This post describes our path to changing our Hadoop distribution in a single day, together with the lessons learned while doing it.
This article describes the methodology used by the Wikimedia Foundation to monitor outages on Wikipedia around the world. These events are called anomalies and could be due to various causes, among them censorship.
We have been working this past year to better identify and tag the “bot spam” traffic so we can produce top pageview lists that (mostly) do not require manual curation.
Learn about using the Mediawiki History Dataset to explore the every day experience of editors on Wikipedia.
Part 3 of 3 posts on Wikimedia’s event data platform.