Wikimedia’s Event Data Platform, or JSON is ok too

The Wikimeda Foundation has been working with event data since 2012. Over time, our event collection systems have transitioned from being used only to collect analytics data to being used to build important user facing features. This 3 part series will focus on how Wikimedia has adapted these ideas for our own unique technical environment.

By Andrew Otto, Staff Site Reliability Engineer

In the past few years, event-driven architectures (EDAs) have been getting a lot of attention. When done right, they help fulfill some of the promises of service-oriented architectures (SOAs) while mitigating many of their headaches. Event sourcing, coupled with complex event processing in particular, allows data to be materialized into services in the format they need when they need it. This recent data architecture trend can mostly be attributed to the success of Apache Kafka, which enables these kinds of architectures at scale. Confluent, founded by the creators of Kafka, have done an amazing job writing about and evangelizing EDAs on their blog, so we won’t try to expound on them any further. Instead, this 3-part series will focus on how Wikimedia has adapted these ideas for our own unique technical environment.

The Wikimedia Foundation has been working with event data since 2012. Over time, our event collection systems have transitioned from being used only to collect analytics data to being used to build important user-facing features. A few years ago, we began an effort to refactor our special-purpose analytics events system to one that closely resembles what Confluent calls a ‘stream data platform.’ However, even though most of Confluent’s components are free to use, we chose to implement some of our own. This article will first explain this decision and also briefly describe the Open Source components we developed to build Wikimedia’s Event Data Platform.

Confluent Community License

When Wikimedia began designing our event platform, all of Confluent’s free components were Open Sourced under the Apache 2.0 license. At the end of 2018, Confluent switched all its free software to a new custom Confluent Community License, which is not strictly a Free Open Source software license. One of the Wikimedia Foundation’s guiding principles is to use Free Open Source software in the software and systems we develop. Our decisions to avoid Confluent components were not originally influenced by the CCL—we began our system design process before they changed licenses—but in hindsight they proved to be good choices. (Aside: we did want to use Confluent’s Kafka Connect HDFS and now can’t due to the CCL, but I’ll save that story for another day.)

Event schemas

The main differences between Wikimedia’s and Confluent’s event stream platforms are all related to event schemas. In an event-driven architecture, events must conform to a schema in order for the various disparate producers and consumers to be able to communicate reliably. Schemas define a contract for the data used by many services, just like an API specification would in typical SOAs. They are a required component of building dependable and extensible EDAs.

What about Avro?

Confluent’s stream data platform uses Apache Avro for event schemas and for in-flight data serialization. Avro is a natural choice for organizations that are already heavily JVM-based. Avro is language-agnostic, but the software ecosystem around it is oriented towards the Java virtual machine. In addition to needing an Avro implementation in a language to use Avro data, an Avro library also needs an Avro schema in order to serialize and deserialize Avro binary data records. Outside of a streaming context, Avro data is usually serialized in binary files. These files contain a schema header which instructs Avro readers how to read the data records in the rest of the file.

However, in a streaming context, where every message in Kafka is a single data record, how can a consumer get the Avro schema it needs to deserialize the binary record? Confluent solves this problem with its Schema Registry. The Schema Registry is a centralized service that stores Avro (and as of April 2020, JSONSchema and Protobuf) schemas. To produce an Avro message, you must use a custom Confluent serializer that wraps the binary Avro message with an envelope containing an opaque numeric schema id registered with your running installation of Schema Registry. When consuming the message, you must also use a custom Confluent deserializer that knows how to unwrap the message payload to read the schema id, and then look up that schema id from the Schema Registry.

That’s a lot of service coupling just to produce or consume data. Any time you want to produce or consume any data from Kafka, you must have on hand a Confluent client implementation that is configured properly to talk to the running Schema Registry service. This coupling might be acceptable in a closed JVM shop, but it is untenable for the Wikimedia Foundation.

Making choices

Wikimedia’s software development model is decentralized. Volunteer developers must be able to contribute to our software just as Wikimedia staff does. We try to build forkable software, data, and systems. For example, our production helm charts and puppet manifests are Open Source. If someone tried hard enough, they could reproduce the entirety of the Wikimedia technical infrastructure. Using Avro and the Confluent Schema Registry would make this difficult. Yes, we could find ways of transforming data and formats for public consumption, but why make our lives harder if we don’t have to?

JSON is arguably more ubiquitous than Avro. But it is only loosely schemaed. The name of every field is stored in each record, but types of those fields are not. JSONSchema is commonly used to validate that JSON records conform to a schema, but it can also be used to solve data integration and conversion problems (AKA ETL) as long as the schema maps well to a strongly typed data model.

In 2018, we went through a Wikimedia Technical RFC process to decide if we should commit to the Avro route or stick with JSON, and ended up choosing JSON mostly due to the fact that most consumers of event data would not need any special tooling to read the data. However, by choosing JSONSchema instead of Avro, we walked away from Confluent’s Schema Registry, as at the time it only supported Avro schemas. We also missed out on a really important feature of Avro: schema evolution.

We were left needing to implement for JSON and JSONSchema two features that are built into Confluent’s default stream data platform components: Schema evolution and schema distribution. We noticed that we weren’t the only ones that needed tools for using JSONSchemas in EDAs, so we decided to solve this problem in a decentralized and open sourced way. Schemas are programming language-agnostic data types. We felt that they should be treated just like we treat code, so we chose to distribute schemas using git. We also chose to enforce compatible schema evolution just like we’d check code changes: using code tests. The outcome was jsonschema-tools, a CLI and library to manage a git repository of semantically versioned JSONSchemas.

We also needed an easy way for developers to produce valid event data into Kafka. Confluent has a Kafka REST Proxy which integrates with their Schema Registry to do this. We wanted a generic event validating and producing service that would work with decentralized JSONSchemas. EventGate was born to do just that.

Next up

The next post in this series will explain how we manage JSONSchemas for streaming event data using jsonschema-tools. After that, expect a post on how EventGate interacts with a schema repository to validate and produce events, as well as Wikimedia’s specific EventGate customizations and features.

About this post

This post is the 1st of a 3 part series. Read the 2nd post here.

Featured image credit: Dobson Stream by Wharfedale hut with the moon, Mt Oxford area, Canterbury, New Zealand, Michal Klajban, CC BY-SA 4.0