Parsoid in PHP, or there and back again

In December 2019, we replaced the original version of Parsoid, written in JavaScript, with a version written in PHP, the primary programming language of MediaWiki. This new version, called Parsoid/PHP, is roughly twice as fast as the original JavaScript version. Parsoid/PHP brings us one step closer to integrating Parsoid and other MediaWiki wikitext-handling code into a single system.

With apologies to Tolkien’s full title for //The Hobbit//

By S.Subramanya Sastry, C.Scott Ananian; Parsing Team; Wikimedia Foundation

Summary

If you have edited a Wikipedia page, you may have noticed that you don’t write wiki articles in HTML, the standard language of web pages. You write them in Wikitext, a markup language specific to MediaWiki. However, you might have used VisualEditor, a friendlier editing tool that lets editors edit pages without needing to know Wikitext. Parsoid is the software that enables VisualEditor and tools like Content Translation to operate on web-standard HTML (without any specific knowledge of wikitext); Parsoid handles the translation back-and-forth between HTML and wikitext.

In December 2019, we replaced the original version of Parsoid, written in the JavaScript programming language, with a version written in PHP, which is the primary programming language of MediaWiki. This new version, which we call Parsoid/PHP, is roughly twice as fast on most requests as the original version (retroactively renamed Parsoid/JS). Parsoid/PHP also brings us one step closer to integrating Parsoid and other MediaWiki wikitext-handling code into a single system, which will be easier to maintain and extend.

The rest of this post explains why and how this effort got started, how we chose to organize the work, and how we benefited from close coordination among different teams. Future blog posts will cover in-depth technical details of the project and challenges faced and will conclude by revisiting the start of the Parsoid project in 2011 and speculating with the benefit of hindsight about the paths we didn’t take.

Background

A wikitext parser has to extract structure from markup — identifying sections, links, categories, images, and such — and then to use that structure to format the markup. As documented in the long list of alternative parsers on mediawiki.org, the quest to best convert Wikitext markup into HTML (or other formats, and sometimes back to wikitext) is an old one.

The VisualEditor project required new types of information to be extracted from wikitext, and it required (for the first time) *reversible* formatting — the wikitext was turned into HTML (augmented with these new types of structural information), manipulated by VisualEditor, and then converted back to wikitext in an extremely faithful way. This was beyond the capabilities of MediaWiki’s original parser, and so a new one was written to power VisualEditor: Parsoid.

As a result, since around 2013, MediaWiki has had two different wikitext engines in production: the default one-way parser written in PHP that is part of core MediaWiki, and the separate round-trip parser, Parsoid. Parsoid was implemented in JavaScript, and it operated as a Node.js microservice that interfaced with MediaWiki via the MediaWiki action API. We will (re)examine some of the reasons behind this architectural choice in a future blog post, but the loose coupling of Parsoid and the MediaWiki core allowed rapid iteration in what was originally a highly experimental effort to support visual editing. However, by 2015, as VisualEditor and Parsoid matured and became established, maintaining two parallel wikitext engines in perpetuity was untenable.

The two wikitext engines were different in terms of implementation language, fundamental architecture, and modeling of wikitext semantics (how they represented the “meaning” of wikitext). These differences impacted the development of new features as well as the conversation around the evolution of wikitext and templating in our projects. While the differences in implementation language and architecture were the most obvious and talked-about issues, this last concern — platform evolution — is no less important and has motivated the careful and deliberate way we have approached the integration of the two engines.

Parsoid represents the structure and meaning of Wikitext as a Web standard Document Object Model (DOM). The DOM can be efficiently queried after construction and formatted and restyled as needed for our different readers. It retains information which was elided in the legacy parser output, such as template boundaries. The DOM representation and DOM-based handling is the future of wikitext and templates, and the basis of our support for Visual Editing, Content Translation, and future tools that need to manipulate Wikimedia content without dealing directly with wikitext.

In the process, the use of WHATWG DOM moved the core platform forward to embrace modern web standards. The replacement of HTML4-based Tidy with HTML5-compliant Remex in 2018 resolved HTML-centric differences between the default wikitext engine and Parsoid. But these efforts to bring MediaWiki into compliance with modern web standards were complicated by the need to work on two different parsing engines at once, and the implementation language and architectural differences between the two. Resolving these was the crucial next step. Check out the February 2019 tech talk slides (or video) for more discussion of the parser integration project.

Constraints and challenges

Going into the port, we had a bunch of technical and non-technical constraints and challenges.

Language differences: Any porting project across programming languages is fraught with problems arising from differences in syntax, semantics, available libraries and gaps in functionality.
Unfamiliarity with PHP and MediaWiki internals: Only one of the four members of the parsing team (the main team undertaking this project) had any significant familiarity with PHP. The Parsoid team, in general, had little experience working with MediaWiki core internals.
Performance concerns: Going into the port, this issue loomed large since we were going from a JIT-compiled language with significant performance optimizations to a bytecode-interpreted language.
Deployment challenges: Given that the web request infrastructure surrounding Parsoid was significantly different in PHP and JS, we had to create a new deployment plan for Parsoid/PHP and determine whether performance concerns required an expansion of our server cluster. We were concerned bringing up Parsoid/PHP would impact the resources needed to keep Parsoid/JS in production.
Feature and bug-fix freeze during porting: We had to impose a code freeze during porting to avoid trying to catch a moving target. This would impact downstream clients and products depending on the Parsoid service.
Testing challenges: Parsoid/JS had lots of tests (parser tests in several modes, mocha tests, and mass WT → HTML → WT testing on production wiki pages), but almost all of them were integration tests, exercising the codebase as a whole. Parsoid/JS had very few unit tests focused on specific subsections of code. With only integration tests, we would find it difficult to test anything but a complete and finished port.

Project outcome goals

In order to navigate these challenges, we established specific requirements for the porting process and clear project outcome goals. We wanted the final product to be minimally disruptive to Parsoid clients and users to preserve much of the development, testing, and deployment flexibility we came to depend on when Parsoid was a Node.js service. More specifically, we wanted:

Minimal to no changes to Parsoid clients and zero disruption to wiki users. We would reproduce Parsoid’s API endpoints exactly in the ported code so that existing clients can be migrated by just reconfiguring the base URI used to find Parsoid. Further, we would aim for no changes to Parsoid HTML — we could gain confidence in the quality of the port by achieving byte-identical output for byte-identical API requests.
Minimal performance penalty. This was hard to quantify at the outset, but we hoped we could have no more than a 25% slowdown on a production workload.
Minimal disruption to testing methodologies. We wanted to preserve our existing ability to run tests independent of a MediaWiki installation and to preserve the ability of our command-line tools to parse content from remote wikis, which we’ve found invaluable when attempting to locally reproduce behavior seen in production.
Parsoid/JS and Parsoid/PHP simultaneously live in production. We wanted to be able to bring up Parsoid/PHP without taking Parsoid/JS offline. Having both versions live together for testing would allow us to gradually shift clients to the new code, reducing the risk of a sudden all-or-nothing switchover.
Porting period as short as possible. In order to minimize the impact of our code freeze on downstream clients and products, we front-loaded pathfinder work and sharply limited the port scope at the expense of future code debt.
Independent deployments of Parsoid from the core platform. This preserved the rapid iteration cycle necessary for experimental work of this sort and suggested that Parsoid should remain an independent codebase (a library or extension) instead of trying to do the work directly in the MediaWiki core repository.
Continuation of git history. We often do deep dives into the history of our codebase (and of the legacy parser) when trying to discover exactly why and when some particular behavior was added. We didn’t want to reset our history and lose that link.

In the end, we largely achieved all of these goals. We did allow some small byte-level differences in Parsoid HTML which had no effect on clients and we were able to normalize away during testing. There were also some temporary disruptions to users:

The wikitext linting tool that we built during the HTML4 Tidy → HTML5 RemexHtml transition had some hiccups. We delayed porting over some functionality needed to fully support this tool, and when we had to prematurely switch off Parsoid/JS during API cluster load shedding, linter updates were lost for about a week. We also broke the LintHint gadget that some editors relied on. Both of these issues have since been addressed.
We disabled Visual Editing on Wikitech because of network-configuration peculiarities of the cluster serving Wikitech. This is yet to be resolved.
We did have one bug report from the transition related to dirty diffs (changed content unrelated to the user’s edit) which we eventually addressed after some initial confusion.

We were overall pleased with the smoothness of the transition.

Consensus requirements

In order to achieve our development goals, we established some ground-rules for the process. Some of these were firm rules decided up-front, and others are presented as post-facto observations of guidelines that developed during the process. The port went smoothly enough that we didn’t have to stop and revisit these during the process, although we expected that surprises would be inevitable.

Code-level equivalence between PHP and JS. We decided early that the PHP code we wrote should be a faithful representation of the JS code as far as possible. We started with an automatically-generated draft that even transferred code style and whitespace to reduce the noise of meaningless differences and allow focus on “the hard parts”. All divergences (“bug fixes”, “code refactoring”, “code cleanup”) were backported to JS, tested, and deployed to production. This ensured that any “new code” arising during the port had been fully tested in production, and wasn’t going to have surprising effects in the new codebase. As a result, the JS codebase constantly evolved and we deployed new JavaScript code regularly during the entire porting phase.
Narrow porting scope. This was a challenging goal. Limiting our scope of work required deferring features and anything other than critical bug fixes, although the need to maintain PHP/JS code-level equivalence meant that we never actually had a full “code freeze” on Parsoid/JS. We constantly revisited the scope to balance the desire to “do things right from the start” with maintaining the pace, consciously accumulating future code debt (duly recorded in Phabricator) in order to ensure we could complete the port in a timely manner.
Prioritize critical path work. This meant that we ported key code early on and in parallel in order to unlock porting and testing of other code downstream. This occasionally required persistent reminders to prioritize code reviews on critical path work to unblock others.
Test early and often. While it was clear we could not wait until the port was complete to begin testing, early testing was challenging due to the lack of preexisting unit tests for Parsoid/JS. We improvised by breaking up the parsing pipeline of Parsoid/JS into chunks that could be swapped out with partial PHP ports to allow us to use our whole-system integration tests to in effect test individual pipeline stages.
Regular coordination meetings. We had core porting team meetings for an hour once a week. Attendance included the parsing team as well as anyone else actively working on the port at the time. In the final months, we added a second short (15 min) weekly meeting to work through blockers. There were also coordination meetings between various project and product managers (REST API, project resourcing, deployment planning, and rollout) as needed: once a month or less. Agenda for the meetings were kept in an etherpad which became a key means of tracking core team blockers (reviews, phab tasks, issues needing discussion).
Low-overhead process. Aside from the etherpad, we relied on a single phab board for critical tasks or deliberately deferred work. We did not file phab tasks for work that was obvious and/or ongoing. In the early stages, the mechanical porting of individual files was tracked in a spreadsheet which gave us a good bird’s-eye view of porting status. The phab board was more useful in the later stages of the porting process as we started discovering non-obvious issues, bugs, or gaps in functionality or infrastructure.

Following these ground rules enabled us to discover problem areas early, not get bogged down in getting everything technically perfect or explode our scope, and get this work done in a reasonable timeframe. While we originally anticipated a timeframe of 9 months to final rollout, it ended up taking 11 months.

Timeline

There were four main phases of the work: pre-porting, porting and QA, deployment and rollout, post-port cleanup.

1. Pre-porting phase

We made the firm decision to undertake the porting project at the end of 2017. We undertook a low-key prototyping phase starting early in 2018 to buy down risk. Our goals were to gain familiarity with PHP, get some sense of the ease/difficulty of porting and quantify possible performance issues. We extracted and ported a few individual stages from Parsoid’s parsing pipeline, then tested and benchmarked them. This gave us an early hint that perhaps performance might not be as much of a concern as we had feared.

During the Parsing Team offsite in September 2018, we discussed the porting project in more depth and arrived at a rough set of requirements for a successful port. One of the critical steps we identified was a syntactic refactoring of the Parsoid/JS codebase to reduce the impedance mismatch between JS & PHP. For example, we updated our JS code to use modern JS class syntax and a “one class per file” rule borrowed from the PHP autoloader requirements. This ensured our code would more easily translate to PHP. We also identified functional refactorings to minimize potential performance issues in the port. All of the refactored JS code was deployed to production during 2018 to ensure the refactorings didn’t introduce regressions.

2. Porting and QA phase

After the preparatory work was done and we wrapped up existing commitments, we kicked off the porting project in earnest at the end of January 2019 with a half-day inter-team working meeting which identified some key technical challenges we would have to address early on. We started porting in earnest in February, and the port was more or less done in October 2019. By the end of October, we were passing all our parser integration tests and a number of other test modes and had done some performance tuning.

One of our project goals required Parsoid/PHP and Parsoid/JS HTML to be byte-for-byte identical. To establish this, we built an HTML comparison script that sampled about 150K pages from production, compared the output, and dumped HTML differences between the two. We ended up accepting some minor meaningless differences between pages (for example, in the property order and formatting of JSON embedded in attributes) which we normalized away. We used this methodology to expose any meaningful output differences and by mid-November 2019 we’d eliminated them and were satisfied that the remaining HTML differences were insignificant.

We also did preliminary performance benchmarking in this phase which indicated that Parsoid/PHP would likely be faster than Parsoid/JS in production. That was a relief!

3. Deployment and rollout phase

The Parsing, Core Platform, and ServiceOps teams brainstormed early in 2019 about deployment and rollout strategy for Parsoid/PHP. Although the exported API wasn’t going to change, architecture differences between Parsoid/PHP and Parsoid/JS necessitated changes to the deployment process.

By the end of October 2019, we did a shadow deployment of Parsoid/PHP to the same cluster that was running Parsoid/JS. Each host was running both versions at once. We began to mirror a portion of the Parsoid/JS requests over to Parsoid/PHP although only the response from Parsoid/JS was used; the response from Parsoid/PHP was simply discarded. We eventually increased the mirrored fraction to include the entire non-user-initiated reparse traffic to Parsoid, which was over 90% of the total requests. Exposing Parsoid/PHP to a production-equivalent traffic load allowed us to discover and fix a number of crashers and bugs not encountered in our previous testing as well as quantify the performance of the port. In the end, we were able to handle the mirrored traffic through both versions of Parsoid simultaneously without expanding the server cluster at all.

By mid-November 2019 we were ready to start switching wikis over. We started with test wikis, then moved to Mediawiki.org and private wikis, in the process discovering and resolving a few more issues. After a break for Thanksgiving, we resumed deployment to larger wikis. We discovered some problems with the language variant implementation in Parsoid/PHP and switched those endpoints back to Parsoid/JS temporarily. On December 2nd and 3rd, we switched VisualEditor, Flow, Content Translation, and the Mobile Content Service on all wikis over to Parsoid/PHP. Over the next couple of weeks, we resolved the issues with the language variant and wikitext linting services and switched them to Parsoid/PHP as well. By December 20, we had switched all products on all wikis to Parsoid/PHP and stopped all traffic to Parsoid/JS.

Overall, this rollout was significantly and very pleasantly uneventful. The vast majority of wiki users didn’t notice anything at all.

Multi-team collaboration was one of the key pieces of the successful outcome to this point and will be key to the next steps. The bulk of the porting work was undertaken by the Parsing Team, Core Platform Team, Product Infrastructure Team, and a contractor. This helped bridge early knowledge gaps in the Parsing Team, short-circuit potential gotchas while porting JS code to PHP and helped us bootstrap the codebase quickly with CI, linting tools, phan, and documentation tools.

Next steps

We are going to spend some time addressing technical debt incurred during porting, then shift our focus to the unification of the two Wikitext engines. We have gone from two parsers to three, and now back to two, but we won’t be done until we have just one. We anticipate this is going to probably take another 18 months of focused and disciplined work and look forward to the point when we have a single wikitext engine in MediaWiki for all users.

About this post

This post was originally published on the Wikimedia Wikitext and Parsing Phame blog.

Featured Image Credit: Árboles Bosque Amanecer Atardecer, Larisa Koshkina, CC BY-SA 4.0