Debugging production with X-Wikimedia-Debug
By Kosta Harlan, Senior Software Engineer, Wikimedia Growth Team
In February 2018, a user reported that some topics created by users on Flow discussion boards were not appearing in the Recent Changes feeds, including EventStreams and the IRC-RC feed. Various automated patrol systems rely on EventStreams, so the bug meant a number of edits bypassed those systems on Flow-enabled wikis.
When approaching a bug like this, there are typically three things I do:
- Determine the steps to reproduce the bug. That was already done by the task author (thank you @Rxy!) and then confirmed by other contributors to the task (h/t @Krinkle, @Etonkovidova)
- Attempt to reproduce the issue locally and set breakpoints in code to understand why the problem occurs
- Check the production logs to look for any messages related to the bug report
Unfortunately the problem was not reproducible in the MediaWiki Vagrant development environment. Nor were there any relevant messages in the logs. Since reproducing the issue locally wasn’t possible, we merged some diagnostic code but still had nothing. Early on, @SBisson suggested a hypothesis about the code path involved in emitting the event:
if ( user is trusted ) return true else let's load the revision from replica, return true based on the the status of the revision oh it doesn't exist (yet), return false
But we could not reproduce this, nor could we identify exactly where this might occur since the code paths for this functionality had many points where execution could stop silently.
Enter X-Wikimedia-Debug
One of the useful tools in our stack is the X-Wikimedia-Debug header. I knew about this header (and its browser extensions) from verifying changes that were being SWAT’ed into production but I had not thought to use it for tracking down a production bug.
I was using the browser extension with the “Log” checkbox ticked (and still not finding anything useful in Logstash to help isolate this bug) when I realized that I could also profile the problematic request. When you check the box to profile a request, XHProf will profile the code that’s executed and make the result available for viewing via XHGui.
Typically you do this to understand performance bottlenecks in your code, as get a complete list of all functions executed during the request, along with the time and memory usage associated with each function.
I followed the steps to reproduce and then switched on the “Profile” option before posting a new topic on an empty Flow board. Now, I had a profiled request which provided me with information on all the methods called, including which method called another (click on a method call to see its parent and children method calls). From here I could follow the path traversed by Flow’s event emitting code, and see exactly where the code execution halted.
Reproducing the bug locally
With this knowledge, I went back to my local environment, this time using MediaWiki-Docker-Dev, which has database replication set up as part of its stack (MediaWiki Vagrant does not). I set some breakpoints in the code I suspected was causing the problem, and then found that in RevisionActionPermissions.php#isBoardAllowed(), we had this code:
$allowed = $this->user->isAllowedAny( ...(array)$permissions ); if ( $allowed ) { return true; } return !$workflow->isDeleted();
For a new topic on a blank flow board, $permissions is deletedtext, which would return true for privileged users. But for unprivileged users, Flow would check !$workflow->isDeleted();, and this evaluated as false because the code was querying the database replica, and the title did not exist there yet.
The submitted solution was to patch isDeleted() to query the master DB when in the context of a POST request, since we know the title would exist in the master DB. With this patch in place, events were once again emitted properly and the bug was fixed.
Conclusion
A few of my conclusions from this experience:
- If you’re having difficulty tracking down the code path, consider using the profiler in the X-Wikimedia-Debug browser extension
- Diagnostic code is helpful (even if it didn’t pinpoint the problem here) and debug level logging should be considered instead of silent returns
- Having database replication in your local development environment can help catch issues while developing and when attempting to reproduce a production issue. One can use the MediaWiki-Docker-Dev environment for this, and see also how to adjust its database replication lag.
About this post
This post was originally published on the Wikimedia Performance Team Phame blog.