How to use flickr api to collect data for deep learning experiments?

Gunnvant Saini
Towards Data Science
3 min readJun 27, 2018

--

Once you embark on the journey to deep learning and get past the basic toy examples such as classifying hand-written digits on MNIST data set, the next logical step is to start working on your own custom projects. The first road block that people stumble upon is the fact that data for the custom project that they want to do may not be available. If you know your way around web-scrapping you can easily collect the images that you want but scrapping websites has two pitfalls:

1. It is not legal and ethical to scrape some websites

2. Even if you were not concerned about the law or ethics, scrapping at scale can be challenging particularly if there are safeguards in place to discourage scrapping.

There are alternatives to web-scrapping. Below is a list of resources from where you can use image data for your own projects:

1. Yelp has vast collection of data made publicly available, they have made public a data set of 200,000 images, you can check out here https://www.yelp.com/dataset

2. There is also amazon reviews data hosted here http://jmcauley.ucsd.edu/data/amazon/which also includes pre-processed image features

3. Then there are data-sets such as MS COCO, Imagenet and CIFAR 10 very commonly used in academia.

If these data-sets also don’t seem to serve your purpose, then you can use flickr as your data collection source. Flickr has an api that can be used for image collection. You can search for images by tag, using this api. For example, you can search images for common day items/places and buildings such as lamp-post, pizza, church, etc. I was able to use this api to easily fetch around 600+ images for a given tag. I was able to create a script to fetch images by tag and store on computer in a folder named after the tag being searched for. I created two scripts, one fetches the url of a given tag and stores it in a csv file. The other script, reads links from the url file, fetches images and stores them in a folder.

Below is the code for the script that fetches the urls by tag, flickrGetUrl.py (You can register to get key and app secret token from here https://www.flickr.com/services/apps/create/)

To use the script above you can run the following in the terminal:

python flickrGetUrl.py pizza 500

This will create a new file in the working directory in which there will be links to the pizza images, this will be a .csv file of name pizza.csv. The first argument is the tag whose images you want and the second is the number of urls fetches to be attempted.

The second file, get_images.py, fetches the images from url and stores the images in a folder, below is the code for it:

To use the second file type the following in the terminal:

python get_images.py pizza.csv

This will fetch images from the urls stored in pizza.csv and store them in a folder called pizza.

These are some of the common ways by which one can fetch image data for running deep learning experiments. Bear in mind that the scripts above can take some time to download images, you can speed up these scripts by adding multiprocessing capability to them and running them on multicore machines.

--

--