Wikimedia’s Event Data Platform – JSON & Event Schemas

In the previous post, we talked about why Wikimedia chose JSONSchema instead of Avro for our Event Data Platform. This post will discuss the conventions we adopted and the tooling we built to support an Event Data Platform using JSON and JSONSchema.

By Andrew Otto, Staff Site Reliability Engineer

For Avro serialized streaming event data, the event’s Avro schema is needed during both production and consumption of the event. For JSON events, the schema is not strictly needed to produce or consume the data. JSON records include minimal ‘schema’ information, including the structure of the record and the field names. Field type information is what is missing.

If we are using JSON, why do we need the schemas at all? Producers and consumers don’t need it. What does?

There are two answers to this question.

We want to validate that all incoming events of a specific type conform to a schema at produce time, so that consumers can rely on it being consistent.
Schemas are useful for data integration, e.g. ingesting event data into an RDBMS or elsewhere.

To do either of these, we need to be able to find the schema of an event.

Confluent does this with opaque schema IDs which are used to look up schemas in their remote Schema Registry service. This means that each schema gets assigned its own unique integer ID inside of the Schema Registry (just like a unique ID in a database). This ID carries no meaning outside of a particular Schema Registry installation.

But we’d like the ability to lookup a schema without access to a centralized registry. Wikimedia does this for JSONSChemas with versioned URIs.

Schema URIs

JSONSchema documents themselves have their own ‘meta schemas’. The meta schema of a JSONSchema document is identified using the $schema keyword, which usually is set to the official URL of a JSONSchema version, e.g. ‘http://json-schema.org/draft-07/schema#’. JSONSchema validator implementations use this to know which version of JSONSchema to use when interpreting a JSONSchema document.

So $schema is used to identify and lookup the meta schema of a JSONSchema document.

This kind of lookup is exactly what we need for our JSON event data: we want to lookup the schema of the JSON event document. We add the $schema keyword in our event data itself and set it to the URI of the event’s JSONSchema. Here’s an example event:

{
  "$schema": "http://schema.example.org/coolsoftware   /user_create.json",
  "datafield1": "value1" 
}

With the $schema present in every event and set to a URI pointing at the event’s schema, any user of the event can get its schema at will. Wikimedia’s EventGate HTTP -> Kafka proxy uses $schema to validate events before they are produced to Kafka. We use the same field to lookup the JSONSchema and convert it to a Hive or Spark schema during ingestion into Hadoop. We’ll come back to use cases like this later.

Decentralized Schema Repositories

In the first post in this series, I described how Wikimedia code and data should be decentralized. But the example event above had a fully qualified URL for its $schema, which would require anyone who wanted to lookup that event’s schema to do so from a centralized address. This also couples the schema lookup to that remote website. If the schema website goes down, schema lookup will fail.

Git is great at decentralization, so we keep all schemas in git repositories. This means that users can find schemas wherever the schema repository is cloned: in your local file system, somewhere in a distributed file system, or somewhere at a remote web address. Since these locations will all clone the same git repository, we can assume that they have the same relative directory hierarchy and choose to only use relative $schema URIs in event data.

Our example event would instead look like this:

{
  "$schema": "/coolsoftware/user_create.json", 
  "datafield1": "value1" 
}

We’ve removed the reference to a centralized web address…but what now? This surely isn’t enough to look up a schema.

It isn’t! We will need to prefix the $schema URI with a base address or path. If we want to read the schema from a local clone of our schema repository, we can prefix the relative $schema URI with a file:// path like file:///path/to/schema_repo/coolsoftware/user-create.json. If the schema is hosted at a remote HTTP address, we can prefix it with that domain like http://schema.example.org/coolsoftware/user-create.json.

This is especially useful in development environments while developers create and change schemas and producer code. Their environments can configure a local schema base path to their clone of the schema repository. This also helps to decouple services in production. A critical service that needs to lookup schemas can have a local copy of the schema repository, whereas a less critical service that might want to do more dynamic schema lookups can use a remote web address to the schema repository.

Versioned Schemas

Code changes over time and so does data. As the data changes, so must the schemas. In an event driven architecture, event data is immutable and captures the historical record of all changes. We need a way to modify event schemas so that all historical event data of the same schema lineage will validate with the latest version of that schema. We also want to be able to look up the specific schema version an event was originally written with.

Wikimedia distributes schema files in git repositories, but we intentionally don’t use git for versioning of those files. We want every version of a schema to be readily available for on demand use by code. Instead of naming files after the schema title, we name them with their semantic version. The JSONSchema title keyword is used to name the full schema lineage, and we keep immutable versioned files in a directory hierarchy that matches the schema title.

For example, an event schema that is modeling user account creations might be titled ‘coolsoftware/user/create’. The various versions of this schema would live in our schema repository in the coolsoftware/user/create/ directory with semantically versioned filenames like ‘1.0.0.json’, etc. An example schema repository directory tree might look like:

schemas
└── coolsofware
  ├── user
  │  ├── create
  │  │  ├── 1.0.0.json
  │  │  └── 1.0.0 -> 1.0.0.json
  └── product
    └── order
      ├── 1.0.0.json
      ├── 1.0.0 -> 1.0.0.json
      ├── 1.1.0.json
      └── 1.1.0 -> 1.1.0.json

These schemas will be addressed by URIs. To hide the file format extension in those URIs, we create extensionless symlinks to each version, e.g. 1.0.0 -> 1.0.0.json.

Each schema version will identify itself using the JSONSchema $id keyword set to its specific versioned relative base URI inside the schema repository. The coolsoftware/user/create/1.0.0.json file will have "$id": "/coolsoftware/user/create/1.0.0“. (Using $id in this way also has advantages for reusing schema fragments using $ref pointers.)

Events will have their $schema field set to a relative schema URI that now includes the schema title and the schema version, matching exactly the $id field of the versioned schema. A coolsoftware/user/create event now might look like:

{ 
  "$schema": "/coolsoftware/user/create/1.0.0",
  "datafield1": "value1"
}

Code can use the $schema URI to look up the schema for an event from a local or remotely hosted schema repository.

This versioned schema hierarchy inside of a git repository allows us to decentralize the versioned schemas and refer to them with base location agnostic relative URIs.

jsonschema-tools

Each versioned file in our repository contains a full JSONSchema. But this means that whenever a developer wants to make a new version, they have to copy and paste the full previous version into their new version file and make the changes they want. This could result in subtle copy/paste bugs or undetected schema differences.

Also, every distinct schema lineage file with the same major version must be fully compatible with each other. The only type of change that is allowed for full compatibility is adding optional fields. This means no field deletion and no field renaming (which is essentially a deletion).

Rather than relying on developers to enforce these rules, Wikimedia has developed jsonschema-tools— a library, command line interface and set of tests to manage the schema development lifecycle.

Instead of manually keeping copies of each schema version, jsonschema-tools generates schema version files from a single ‘current’ schema file. A developer can modify the current schema file, update the version in the $id field, and still keep the previous version files around. It will also (by default) attempt to dereference any JSON $ref pointers so that the generated version schema files require no dereferencing or lookups at runtime. jsonschema-tools also exports a set of tests that can be run in your schema repository to enforce compatibility between the versions, as well as rules and conventions.

Continuing with the coolsoftware/user/create schema example, If a developer wants to add a new field, they would edit the coolsoftware/user/create/current.json file, add the field, and bump the version in the $id field to "$id": "/coolsoftware/user/create/1.1.0". jsonschema-tools will handle generating a static coolsoftware/user/create/1.1.0.json file (as well as some handy symlinks). If your schema repository has been configured properly, running npm test will recurse through your schema repository and ensure that schema versions with the same title are compatible with each other.

jsonschema-tools does a little more than what is described in this post. Check out the documentation for more information.

This all might seem like a lot just to manage schemas, but a big advantage here is that schemas version are decentralized and immutable. jsonschema-tools does move some of the complexity of developing schemas to the developer environment. However, once a schema change is merged, all that is needed for other developers or code to access that schema is to have a URI and a clone of the schema repository (either locally or accessible via http).

Now that we’ve made versioned and compatible schema files accessible, how do we use them to ensure that the events produced into Kafka are valid? Wikimedia has developed an HTTP event intake and validation service and library based on JSONSchema URIs: EventGate. The next post will go into how EventGate works, how Wikimedia uses it, and how you can use it together with your schema repository to build your own custom event intake service.

About this post

This is part 2 of a 3 part series. Read part 1. Read Part 3.

Featured image credit: Vue aérienne raprochée du grand récif de Gatope et son trou bleu, Kévin Thenaisie, CC BY-SA 4.0