JSONoid Discovery: Simplify Schema Discovery

Article summary

The Problem
The Tool
Usage
Understanding the Output
JSONoid Discovery

On a recent client project, my software development team faced a mismanaged legacy NoSQL database. This database contained hundreds of thousands of records across a couple of dozen containers multiplied by four environments. We got ourselves out of trouble with JSONoid Discovery.

The Problem

The biggest issue was the lack of schema enforcement or documentation throughout the life of the database. This meant every query result needed to be painstakingly validated at runtime.

Step one of migrating to a more reasonable data storage solution was understanding the current state of the data. For the first container that we tackled, we tried a manual process. We researched data access patterns, interviewed developers, and sampled the existing data. This was a difficult process that didn’t yield satisfactory results, and we still regularly found ourselves chasing edge cases.

The Tool

If we were going to solve this problem in a timely way, we needed an automated solution that removes the guesswork. Enter: JSONoid Discovery. JSONoid is a tool created by Rochester Institute of Technology’s Data Unity Lab directed by Dr. Michael Mior. It uses some clever monoids to perform schema discovery on a collection of JSON documents. The code is open-sourced under an MIT license and helpfully distributed as a Docker container.

Usage

We exported the data of an entire container from Azure using the Azure CosmosDB developer community’s Data Migration Tool. Any NoSQL database will have ways of extracting a group of JSON documents. That means it’s important to include all of the data of a single container, not a sampling, to get the most accurate output possible.

JSONoid expects newline-delimited JSON as an input. Converting a standard JSON array to newline-delimited JSON is easy with JQ.

jq -c '.[]' exported-data.json > newline-delimited-data.json

Before you can execute the tool you’ll need to pull the Docker image.

docker pull michaelmior/jsonoid-discovery

Then executing the tool is as simple as feeding the input file to the Docker container and capturing the output.

docker run --rm -i michaelmior/jsonoid-discovery < newline-delmited-data.json > output-schema.json

Understanding the Output

The output of JSONoid Discovery is a JSON schema that describes the structure and properties of the input data. The repository offers some example outputs, but these appear to come from fairly normalized data. We found that our messy input data produced some output that required further manual analysis. An anonymized and abbreviated example of our output is below.


{
  "type" : "object",
  "additionalProperties" : false,
  "fieldPresence" : {
    "id" : 1.0,
    "property1" : 0.0082625099441552,
    "property2" : 0.9872163454343528
  },
  "dependentRequired" : {
    "property1" : [ "property2" ]
  },
  "required" : [ "id" ],
  "properties" : {
    "id" : {
      "type" : "string",
      "minLength" : 36,
      "bloomFilter" : "BLOOM_FILTER",
      "distinctValues" : 131551,
      "hll" : "HLL",
      "lengthHistogram" : {
        "bins" : [ [ 35.69989444930044, 122185 ] ],
        "hasExtremeValues" : false
      },
      "format" : "uuid",
      "examples" : [ "04a4426b-7387-4fc4-81e5-c6d45adf54ad", "08160b45-4d7a-4fd6-8492-26c8f6054872", ... ],
      "maxLength" : 36
    },
    "property1" : {
      "enum" : [ "value1", "value2" ]
    },
    "property2" : {
      "type" : "string",
      "minLength" : 0,
      "bloomFilter" : "BLOOM_FILTER",
      "distinctValues" : 193,
      "hll" : "HLL",
      "lengthHistogram" : {
        "bins" : [ [ 0.0, 82 ], [ 1.9813812853544068, 1 ], ..., [ 23.06811574166162, 1 ], [ 24.002636914918774, 1 ] ],
        "hasExtremeValues" : false
      },
      "examples" : [ "", "FOO BAR", "EXAMPLE" ],
      "maxLength" : 24
    }
  },
  "$schema" : "https://json-schema.org/draft/2020-12/schema",
  "description" : "Generated by JSONoid 0.0.0-1-afeb9aee-SNAPSHOT. Not to be used for validation purposes."
}

Properties

The properties field is a map of all the properties found in the input data. For each property, JSONoid infers a type (e.g. string, number, boolean). It also provides some other interesting data like minimum and maximum length, number of distinct values, and some examples. I was impressed that it was even able to infer enums and timestamp formats for strings.

Advanced Metadata

In addition to these basic statistics, JSONoid also provides some more advanced metadata. The bloomFilter and hll (HyperLogLog) fields are probabilistic data structures that can be used for efficient membership testing and cardinality estimation, respectively. The lengthHistogram field provides a histogram of the lengths of string values, which can be useful for understanding the distribution of the data.

Field Frequency

Another important field in the output schema is fieldPresence. This is a map that indicates the frequency with which each property appeared in the input data. A value of 1.0 means the property was present in every record, while a value of 0.5 would mean it was present in half of the records. We found this data extremely valuable when deciding whether a field was an outlier from a bad write operation or a necessary part of the schema. I should note that a small number in fieldPresence does not necessarily indicate an outlier, it could just be that there is a less common subclass of the schema that uses that property.

Dependency Mapping

Finally, the dependentRequired field is used to specify dependencies between properties. In the example output, it indicates that property2 is required whenever property1 is present. This type of relationship can be important to model in the schema to ensure data integrity. We used these mappings to help determine where we had discriminated unions in our schema. By graphing the relationships we were able to see two independent subgraphs emerge from the relationships.