[[WM:TECHBLOG]]

Censorship, outages and Internet shutdowns: monitoring Wikipedia’s accessibility around the world

This article describes the methodology used by the Wikimedia Foundation to monitor outages on Wikipedia around the world. These events are called anomalies and could be due to various causes, among them censorship.

By: Nuria Ruiz, Marcel R. Forns, Diego Saez and Sukhbir Singh

Censored Wikipedia, Marcel R. Forns, Derivative of https://commons.wikimedia.org/wiki/File:Wikipedia-logo-v2.svg, CC BY-SA 3.0

About four years ago the Wikimedia Foundation embarked on a collaboration with researchers at the Berkman Center for Internet & Society at Harvard University to analyze the scope of government-sponsored censorship of Wikimedia sites around the world. The goal was to identify possible instances of past censorship as well automatically identify instances of potential censorship (*while ongoing*). This study delivered many valuable findings that were published in a widely distributed paper

The idea used by the paper to detect censorship was simple: if pageview data for Wikipedia follows a predictable pattern, significant changes from this pattern might be indicative of an issue. These “significant changes” are what we call anomalies. The act of detecting anomalous events in a series of events (in this case a time series of Wikipedia pageviews) is called anomaly detection. The anomalies we are looking for are sudden drops in pageviews on a per-country basis. Now, a drop in pageview traffic could indicate an outage, a censorship event, an internet shutdown, or it could simply be that it is Christmas and a significant portion of the population is away from the computer. This last case is what we call a “false positive,” an event that presents as a deviation from the normal pattern but that does not indicate an issue. 

While the Christmas example may provide some perspective of what false positives are, in the real world there are many false positive examples that are not  immediately explainable. 

The initial implementation of the algorithm to detect anomalies worked well for a first attempt, but it suffered from having too high of a false positive rate. Also, it was designed to run daily which did not allow us to detect events like the Iran shutdown of the internet in November 2019 as early as we would have liked. We decided to devote some research to see whether we could improve our detection by doing two things: 1) run the anomaly detection algorithms more frequently (every hour) and 2) reduce the false positive rate. A false-positive event in this case can be any event that results in irregular or anomalous traffic but is not necessarily a result of a technical issue, an internet shutdown, or censorship. Such events decrease the signal-to-noise ratio, and since investigating them may involve manual correlations between traffic from different sources, it’s easy to miss an actual event that we potentially care about in the noise. Similarly, from the point of view of a service, it’s important to make a distinction between an internet shutdown and a censorship event — while we care about both of these, a censorship event affects us more directly than an internet shutdown.

First, we needed to fix a performance issue, as going through the whole Wikipedia pageview time series every hour requires quite a bit of computing power. Wikipedia’s request rate can go as high as 200,000 requests per second. This translates to tens of thousands of pageviews per second as a pageview is made of many requests: JS, CSS, image files, etc.  We chose Spark running on a Hadoop cluster to manipulate this data at scale. We needed to migrate the data manipulation from Go to Spark so the computation could run over the data in a distributed manner. Every dataset hosted in Wikimedia’s Hadoop cluster is documented publicly. If you are curious about how this data is shaped, the information is here.

Running the computation in a cluster with 50+ machines made it possible to run the anomaly detection more frequently, so with this change, we were able to run it every hour. What needed more thought was how to improve our false positive rate. The algorithms we used were pretty traditional when it came to time series manipulation and anomaly detection on time series data  (RPCA and Criteo’s RSVD library). Being very generic these algorithms will identify a drop in traffic on Christmas Eve as an anomaly because during a few days in December the pageviews of Wikipedia are much lower than they normally are. An “alarm” would be raised for what is a false positive. 

Censorship, outages, and “the Christmas problem”

However, a person looking at a drop of traffic in Wikipedia around Christmas or the opposite, an increase of traffic around the Soccer Worldcup (Yes, those are a real thing!) would notice a key piece of information that the anomaly detection algorithms missed: the “shape” of traffic does not change that much, it is the volume that does. That is, in the Christmas example, the same proportion of pageviews from, say, Madrid (Spain) is directed towards es.wikipedia; there are just fewer pageviews in es.wikipedia overall. The “geographical distribution” of pageviews per city does not change much; it is the overall volume that does. Rather than raising a “censorship alarm” when there are significant changes of volume of pageviews, we could raise an alarm when the shape of the distribution changes. 

Easy to say, but how can we measure the “shape” of Wikipedia’s traffic? 

Well, the “shape” of a distribution of pageviews can be thought of as the information the time series carries. In our case, we are mostly concerned with where do pageviews come from.  For example, see the time series below:

timestamppageviews from Madridpageviews from Barcelona pageviews from Sevillepageviews from Malaga
2019-12-02:01:013000450012003400
2019-12-02:02:013020454013004400
2019-12-02:03:013000450012003400
2019-12-20:01:012500350011003800
2019-12-24:01:01600800200650
Pageviews for es.wikipedia. Christmas

The volume of pageviews on 2019-12-24 changes significantly, but proportionally the relationship of pageviews between Malaga and Madrid has not changed. Malaga and Madrid have about the same number of pageviews on 2019-12-24 and about three times as much as Seville. The distribution of information when it comes to the location of pageviews for es.wikipedia has not changed.

However, in our second example, the distribution does change on 2019-02-12 at 7pm. The number of pageviews in Sevilla and Malaga is much too low for what it should be when compared with pageviews coming from Madrid. It could be indicative of an outage in those two cities that are located close to each other in the South of Spain.

timestamppageviews from Madridpageviews from Barcelona pageviews from Sevillepageviews from Malaga
2019-12-02:01:013000450012003400
2019-12-02:02:013020454013004400
2019-12-02:03:013000450012003400
2019-12-02:04:012500350011003800
2019-12-02:07:0130004800200650
Pageviews for es.wikipedia. Outage

It is not the drop of volume that indicates an outage but rather the relationship of the volume of pageviews among the cities. That is, the probability that given a number of pageviews for a day that a pageview might belong to (in our example) Madrid, Barcelona, Seville, or Malaga. In the second example (when there is an actual outage) the probability of a pageview for es.wikipedia coming from Madrid (0.34) is much larger on 2010-02-12  than it was the days before (hovering about 0.24).

We can quantify that mathematically by measuring the entropy of the distribution of cities and pageviews. In information theory, Entropy is the mathematical measure for the “density of information.” A constant entropy indicates that there is little variability on the underlying measures (in this case pageviews per city). The opposite, a sudden drop (or increase) of entropy indicates “more variability.” And “variability” from a standard is what is considered an anomaly. 

In our example, we can calculate the entropy of the distribution of traffic among cities using the probability that a pageview might belong to either one of the cities (P(x)). Entropy is defined as:

In our Christmas example, although the volume overall is lower the entropy does not change much, this event would not be detected as anomalous and, as such, an alarm will not be raised.

timestampProbability a pageview is from MadridProbability a pageview is fromBarcelona Probability a pageview is fromSevilleProbability a pageview is fromMalagaEntropy
2019-02-02:01:010.240.370.100.281.713
2019-02-02:02:010.220.340.100.331.722
2019-02-02:03:010.240.370.100.281.713
2019-02-20:01:010.220.320.100.341.720
2019-12-24:01:010.260.350.090.281.704
Pageviews for es.wikipedia. Christmas

Now, for our “outage” example see below how the value of entropy is significantly different.

timestampProbability a pageview is from MadridProbability a pageview is fromBarcelona Probability a pageview is fromSevilleProbability a pageview is fromMalagaEntropy
2019-02-02:01:010.240.370.100.281.713
2019-02-02:02:010.220.340.100.331.722
2019-02-02:03:010.240.370.100.281.713
2019-02-02:04:010.220.320.100.341.720
2019-02-02:07:010.340.550.030.081.357
Pageviews for es.wikipedia. Outage 

The entropy change at 2019-02-02 at 7 pm is abrupt, and this change should be considered an anomalous event. 

This idea of using an entropy-based time series (rather than raw pageview values) seemed like it would work to actually decrease the number of false positives as the anomalies detected is more meaningful. Now, that had to be proven empirically, and we did just that. After our experiments were successful, we reworked our detection code and set up the new alarms on the entropy-based time series, on Spark, running hourly. 

The methodology described here enabled the Wikimedia Foundation to put in place a pretty robust system to detect data quality issues and to monitor Wikipedia’s accessibility around the world. When an entropy anomaly for a given country is found, the system raises an alarm that notifies a team of developers. These developers check other, external sources, for similar reports to establish whether the event is already known or we are the first ones to see it. Now, the system is by no means perfect: we are still walking the balance between having a number of false positives to see all true positives. In plain terms: we need some “false” alarms to make sure we catch all events we are interested in. We are also considering more targeted alarms for types of censorship that are very narrow in their scope, for example, censorship of a specific language edition of Wikipedia in a country. Or censorship of a particular access method, like censoring the mobile version of Wikipedia while allowing access to its desktop version. Still, there is no “magic” methodology that would allow us to detect every event that happens. The best way to think about this work is that, given its nature, it is never finished nor can it be fully automated.

About this post

Featured image credit: Piata Romana – Iarna, Mihai Petre, CC BY-SA 3.0

Leave a Reply

Your email address will not be published. Required fields are marked *