How to transform unstructured address data using Python & Google’s Maps API

Thilo Hüllmann
codeburst
Published in
4 min readJun 12, 2018

--

Aggregating addresses from different web sources often leads to inconsistent data sets. Structuring this data is one of the tasks nobody really wants to do because it is mundane, tedious and time-consuming. Such tasks will often be outsourced to MTurks or underpaid interns but many times it’s possible to solve those problems with code which is definitely the better option.

Problem

Let’s say you have some kind of data table with addresses that looks like this:

Missing post codes, inconsistent order

but you want it in a format like this:

Seems like a pretty painful task, right? There are plenty of unpleasant edge cases and nightmarish inconsistencies. The example above only includes a few of them.

If you’re an Excel pro and familiar with regular expressions you might find a way to tackle the problem in an automated way but this approach will most likely require a lot of nested if-else statements, still takes a lot of time and will not be readable by anyone else.

Solution

Data manipulation is way easier with Python so let’s take a step back and look at the problem. We are trying to identify parts of an address in a single string, decouple the components and save each in a new column.

Since we are smart and do not try to reinvent the wheel we look for some kind of open source project or free API that can help us with what we‘re trying to achieve. In this case a good approach is the Google Maps Geocoding API. Whatever you type into the search bar Google Maps will most certainly interpret your input correctly and so does its API.

Within certain usage limits the Geocoding API is completely free and you can very quickly set up a project and claim your API key here:

Step by step guide

1. Dependencies

First, we import some modules that help to interact with the API (requests), load data (json, csv) and keep track of the progress (tqdm)

2. Fetching addresses from CSV columns

We build a function to get addresses from a CSV and store them in an array. The only inputs are path and column.

3. Initializing

In the next step, we use the addresses_from_csv function, set the API key and initialize a new array that will be used to save the transformed addresses.

4. Calling the API

We start to loop through the addresses array with every address as query in a single API call. We use the requests library to execute our API calls.

…and this is what the response (JSON) will look like:

5. Addressing the right elements in the JSON/Dict

After we explored our output file we can identify the elements we need and append them to our new transformed array in any format we like. We just have to be careful with edge cases like missing data or inconsistently tagged data. Sometimes city appears as locality and sometimes as postal_town.

6. Writing final result to CSV

In the last step we use our transformed array and write its content to a CSV.

AAAND DONE!

Now you have a CSV with structured addresses. You can view the full code on GitHub and feel free to fork, star or open a pull request.

You can find me on LinkedIn, Twitter and GitHub.

✉️ Subscribe to CodeBurst’s once-weekly Email Blast, 🐦 Follow CodeBurst on Twitter, view 🗺️ The 2018 Web Developer Roadmap, and 🕸️ Learn Full Stack Web Development.

--

--