Flattening Nested JSON

Motivation

Once you start functional programming, you become alert to problems that lend themselves to a recursive solution. In my case the programming language is Elixir and the problem in question was posed to my friend at his job interview. He was asked to flatten a heavily nested JSON object. In Python, his language of choice, heavily nested dictionary.

The Problem

Flatten a nested JSON object. But first, a paragraph on JSON.

JSON is a data interchange format, popularized by the proliferation of APIs (thank you REST). Part of its popularity is also due to the fact that it’s text-based, human-readable, and with an incredibly simple specification which could be read and understood at the water cooler.

Here’s an example valid JSON:

'{"versions":[1],"name":"json","text":true,"creator":{"name":"Doug", "address":{"country":"USA"}}}'

Pretty printed:

{
  "creator": {
    "name": "Doug",
    "address": {
      "country": "USA"
    }
  }
  "name": "json",
  "versions": [1],
  "text": true,
}

Our goal is to get rid of the nesting. For the above example, if we do a good job, we should have this result:

{
  "creator.name": "Doug"
  "creator.address.country": "USA",
  "name": "json",
  "versions": [1],
  "text": true,
}

To be fair, this is not an exciting problem. It’s not one of those designed to make you look stupid at the white board. It feels like something you’d actually do on the job.

The Recursive Approach

This is the function we’re going to fill up as we tease apart the problem using well-crafted input.

def flatten(json):
  pass

Input Type One (Empty or One-Level Deep)

{"a": "A"}

The simplest input we could be given is a JSON object that’s either empty or only one-level deep. In that case we just return the input as the output. Or we could build a new object and return that instead. Not the most optimal solution, as we’re unnecessarily duplicating the object. But it helps keep the implementation consistent as we improve it so let’s do that.

1def flatten(json):
2  acc = {}
3  for k, v in json.items():
4    acc[k] = v
5
6  return acc

We initialize an empty object as the accumulator, which is populated as we iterate over the key-value pairs of the JSON object, generated by the items method. At the end we return our newly-built object. It’s a duplicate of the original input, but it’s different. It’s not affected by changes to the original.

Input Type Two (Our First Encounter With Nesting)

{"a": {"b": "c"}, "b": "d"}

The end result of flattening the above input should be as below. I chose . as the separator but it can be anything that makes sense, really.

{"a.b": "c", "b": "d"}

Code solution:

1def flatten(json, acc, prefix):
2  for k, v in json.items():
3    prefixed_k = (prefix + '.' + k) if prefix else k
4    if type(v) is dict:
5      flatten(v, acc, prefixed_k)
6    else:
7      acc[prefixed_k] = v
8
9  return acc

It’s quite a jump from handling the simple input to the slightly more sophisticated one. In the process our function has gained 2 more arguments (one of which we’ve met before) and 3 more lines. A couple of things, though, should remain familiar. For example, we still iterate over the items of the object to build our new object, the accumulator.

The changes which introduce the new behavior have been highlighted.

First the prefix. Since we’re flattening an object, it’s no longer safe to use their original key in the accumulator as it could overwrite an existing pair or be overwritten by a later member. Prefixing the current key with all their ancestor keys prevents this data loss.

On line 4 we do a type test of the value under consideration, if it’s a dict, we flatten it. The arguments to this call are the value under consideration, the original accumulator, and a prefix that contains its key (and all ancestor objects). It is this call to flatten that makes the solution recursive. As it tries to flatten the nested JSON, if it encounters further objects on its way, it pauses to resolve them. And then continues.

The code above is all that was needed at my friend’s interview. Well, not quite. The interviewer wanted a function that flattens a JSON object but now we demand of them to pass 2 other arguments to the function. It wasn’t a disgraceful challenge so we should be nice to them. Also, as the author of a post complaining bitterly about N-parameter functions I owe it to you to do the right thing. So we’ll bring it back to a single-parameter function by taking advantage of one of the many niceties of Python.

In Python, a function can be defined in another function. Just like a variable. Even better, it can be called (of course what’s the point of defining a function if it can’t be used), just like an initialized variable can be used. Let’s use this technique to bring our function’s arity down to one:

def flatten(json)
  def _flatten(json, acc, prefix):
    for k, v in json.items():
      prefixed_k = (prefix + '.' + k) if prefix
      if type(v) is dict:
        _flatten(v, acc, prefixed_k)
      else:
        acc[prefixed_k] = v
    return acc

  return _flatten(json, {}, '')

This should please the interviewer. This new implementation also allows us to accept and return real JSON strings (the single-quoted stuff). To do that we do some pre-processing before calling _flatten and post-process its return value. Here’s the updated and final code:

from json import loads, dumps

def flatten(json):
  def _flatten(json, acc, prefix):
    for k, v in json.items():
      prefixed_k = (prefix + '.' + k) if prefix else k
      if type(v) is dict:
        _flatten(v, acc, prefixed_k)
      else:
        acc[prefixed_k] = v
    return acc

  json_obj = loads(json)
  return dumps(_flatten(json_obj, {}, ''))

Caveat

Imperative languages don’t do recursion well. And Python is no exception. In fact it has a hard limit on how much you can recurse. By default it’s a 1000. (See $ python -c "import sys;print(sys.getrecursionlimit())"). It can be moved, of course, but the limit is there for a reason: to not crash Python. So, if you’re going to move it around, please do so gently.

You know what has no limits on it? The iterative solution. Its implementation is left as exercise to the reader.

Got comments or corrections for factual errors? There’s a Hacker News thread for that.