BigQuery Public Datasets

Yufeng G
Towards Data Science
3 min readJun 12, 2018

--

Kauai, HI

The only thing better than data is big data! But getting your hands on large datasets is no easy feat. From unwieldy storage options to difficulty getting analytics tools to run over the dataset properly, large datasets can lead to all sorts of struggles when it comes to actually doing something useful with them. What’s a data scientist to do?

On this episode of AI Adventures, we’re going to check out the BigQuery Public Datasets and explore the amazing world of open data!

BigQuery public datasets

…there’s a 1 TB per month free tier, making getting started super easy.

We all love data. Preferably the more the merrier! But as file sizes grow and complexity increases, it is challenging to make practical use of that data.
BigQuery Public Datasets are datasets that Google BigQuery hosts for you, that you can access and integrate into your applications.

This means Google pays for the storage of these datasets and provides public access to the data via your cloud project. You pay only for the queries that you perform on the data. Moreover, there’s a 1TB per month free tier, making getting started super easy.

So…how do I access all this data?

Looking at the BigQuery public datasets page, we can see there are nearly 40 public datasets. Each dataset in turn has many tables. Thousands of queries from hundreds of projects from all over the world are making use of these vast public datasets.

You can find answers to your most pressing questions about images on the web

What’s really neat is that each of these datasets comes with a bit of explanatory text that helps you get started with querying the data and understanding its structure.

Tree Census, Open Images & Trolley Buses

For example, here’s the New York City Tree Census. The page shows us how we can easily find answers to questions like “What are the most common tree species in New York City?” and “How have tree species changed since 1995 in New York City?”. These are all accessible by literally one click from the docs page which opens right into the BigQuery interface!

Another dataset that is quite amazing is the Open Images Dataset. It contains approximately 9 million URLs and metadata for images that have been annotated with labels spanning more than 6,000 categories!

You can find answers to your most pressing questions about images on the web, like “How many images of a trolleybus are in the dataset?” (Spoiler alert: It’s over 3000!)

But I digress. BigQuery Open Datasets is a great way to explore public data and practice your data analysis skills. Combined with tools like Cloud Datalab, Facets, and TensorFlow, you could do some really awesome data science. So what are you waiting for? Head on over to the public datasets page and let your analysis run wild!

For more details and examples, check out BigQuery’s public datasets documentation page and start querying away!

Thanks for reading this episode of Cloud AI Adventures. If you’re enjoying the series, please let me know by clapping for the article. If you want more machine learning action, be sure to follow me on Medium or subscribe to the YouTube channel to catch future episodes as they come out. More episodes coming at you soon!

--

--

Applying machine learning to the world. Developer and Advocate for @googlecloud. Runner, chef, musician. Opinions are solely my own.