Research & Analytics – [[WM:TECHBLOG]]

From hell to HTML: releasing a Python package to easily work with Wikimedia HTML dumps

Announcing mwparserfromhtml, a new library that makes it easy to parse the HTML content of Wikipedia articles

What is in an edit? Automated detection of edit types on Wikipedia

Learn about a new Python library for automatically detecting and summarizing what content is changed by edits on Wikipedia.

https://commons.wikimedia.org/wiki/File:Sextante,_Acervo_do_Museu_Paulista_da_USP_(6).jpg

Analyzing the Wikipedia clickstream just got easier with WikiNav

We have recently developed WikiNav, an interactive tool to analyze and visualize reader navigation, as part of an Outreachy-internship.

https://commons.wikimedia.org/wiki/File:Wikipedia20_Knowledge.svg

The Wikipedia image/caption matching challenge and a huge release of image data for research!

Wikipedia articles are missing images, and Wikipedia images are missing captions. A scientific competition organized by the Research team at the Wikimedia Foundation could help bridge this gap. The WMF is also releasing a large image dataset to help researchers and practitioners build systems for automatic image-text retrieval in the context of Wikipedia.

https://commons.wikimedia.org/wiki/File:Magnifying_glass_with_focus_on_paper.png

Searching for Wikipedia

How people use Search to access Wikipedia is a common question by researchers. Until now, however, there has been little data available about this relationship. To help address these questions, the Wikimedia Foundation is releasing a new, faceted dataset on search engine traffic to Wikipedia so you can ask questions like “What is the most common search engine in my country?” or “Which search engine is most-used by Android users?”

https://commons.wikimedia.org/wiki/File:African_Bush_Elephant.jpg

Upgrading Hadoop in just one day

The Wikimedia Analytics Engineering team manages multiple systems, all gravitating around a big (for our standards) Hadoop cluster. This post describes our path to changing our Hadoop distribution in a single day, together with the lessons learned while doing it.