Performance perception: Correlation to RUM metrics

When we set out to ask Wikipedia visitors their opinion of page load performance, our main hope was to answer an age-old question: which RUM metric matters the most to users? And more interestingly, which ones matter the most to our users on our content.

By Gilles Dubuc, Senior Software Engineer, Wikimedia Performance Team

Now that we have a lot of user input with our micro survey running for over a year, we can look at which classic RUM metrics correlate the best to users’ perception of the page load performance.


We collect user responses from an in-page micro survey asking them if the page load was fast enough. We map their responses to 1 for positive answers, -1 for negative answers and we discard neutral “I don’t know” answers. We only look at records where a given RUM metric is present, and for time-based metrics, only if the value is lower than 30 seconds. Beyond that point we know for certain that the experience was terrible or that there was an issue with metric collection.


MetricPearson coefficientSample size
top thumbnail (Element Timing for Images origin trial)-0.13828,070
domainLookupEnd – domainLookupStart-0.0965670,932
unloadEventEnd – unloadEventStart-0.0308929,854
cpu benchmark score-0.006151,696,239

Pearson correlation factors can go from 1 to -1, meaning that even our “best” correlations are actually the least terrible ones. Overall RUM metric correlation is quite poor and an indication that they only represent a small part of what constitutes the perceived performance of a page load.


There is a clear pattern of environmental properties having the worst correlation. Effective connection type, device memory, available CPU, page transfer size. This might suggest that users are aware of their device, network quality and page size (small vs big article in Wikipedia’s case) and adjust their expectations to those factors.

As for actual RUM metrics, it’s interesting to see that the top ones are not just the paint metrics, but also domInteractive. The reason they are so close to each other is probably that in Wikipedia’s case there are very close metrics in general, due to the absence of 3rd-party assets on our pages.


Thanks to this real-world opinion data, we can make a better educated guess about which RUM metric(s) matter the most to us. It also shows how sub-par existing RUM metrics are in general. We encourage the development on new metrics that capture other aspects of performance than the initial page loader/render, as this part seems well covered already, with seemingly very little difference in terms of correlation to perceived performance between them, at least in our case.

The performance perception micro survey will keep running and will allow us to benchmark future APIs. Which we intend to do with our ongoing Layout Instability API origin trial, for example, once the fixes of the bugs we discovered during the trial have been rolled out.

About this post

This post was originally published on the Wikimedia Performance Team Phame blog.

Featured image credit: Researcher at work in her laboratory, Axventura, CC BY 4.0