Can python be used for Facebook chat backup? of course, it can!

Simple Walkthrough on Web Scraping and Scrapy Framework

Published in

Towards Data Science

5 min readDec 14, 2018

Hello world!!Most of us would have been in a place, where we have to scroll through our Facebook conversations for recalling things, to relive moments or to have a good laugh at our old conversations and you do those often, want to have a backup this is for you. At end of the post, you can have your conversation backuped in a structured way in csv or json format.

Disclaimer

This post is not meant for people like ‘Bob’….Sorry Bob, if this offends you in any way!!

WHY??HOW??

Before Going further I wish to answer a few questions that might hit you when reading this.

Why do this! coz we can. 😎
Why don’t we just click load more/scroll up for hrs to reach the conversation, we wish to read!, It’s super boring to scroll through conversations every time.
Why Can’t we use Facebook's default data backup feature!, We can’t get whole of the contents, There is a cap for it.
Why python!, It is a widely used language for web scraping and holds a lot of useful libraries like scrapy, request. #or it’ s the One I’m familiar with!! :-)
Why Scrapy!, it is a fairly comprehensive and easy-to-use data scraping library and it has an active community help out when stuck!

Installation/Setup

Python Installation: Windows setup file zip can be found here, Ubuntu run the command

sudo apt-get install python3

Scrapy Installation: Both Windows and Ubuntu, run the below command

pip install Scrapy

How things are done!

If you are not into how scraping is done, you can skip to the “How to use” part.

Before diving into the coding part, There were challenges in terms of technicality like access to Facebook data, dealing with javascript, ajax calls. It was quite a headache on re-creating the same ajax calls through the request library yet failed miserably.

Then, I tried to scrape the data by using the tool “Selenium”. If you are not familiar with this, Selenium can be used to automate your activities on the browser, allow you to control and use your browser as if a human is using it(like clicking the search box, entering the keyword and clicking the search button).

What my Selenium did was: Go to Facebook.com -> Login -> Go to chat list and select a conversation -> Start scraping while scrolling up. Facebook denied access to more data from their server within few pages up!! Yeah, They are good at detecting bots.

After spending hours googling found suggestions to try the ‘mobile-optimized website’ which it is actually the old-school mobile.facebook.com which doesn’t use any AJAX. Finally!! able to fetch data without interruption using scrapy.

Getting started

Before you start scraping, you will have to set up a new Scrapy project. Enter a directory where you’d like to store your code and run

scrapy startproject facebook

This will create a facebook directory with the following contents:

facebook/
    scrapy.cfg            # deploy configuration file    facebook/             # project's Python module, you'll import your code from here
        __init__.py        items.py          # project items definition file        middlewares.py    # project middlewares file        pipelines.py      # project pipelines file        settings.py       # project settings file        spiders/          # a directory where you'll later put your spiders
            __init__.py

Don’t worry if you don’t get, what’s going around, we won’t be worrying about most of them here, only changes we do are write a spider for scraping the content and changing ROBOTSTXT_OBEY = False in settings.py file as Facebook won't allow bots to log in. You can learn more about robots.txt here.

Let’s build our spider

Create a python file under the spiders directory, then import scrapy, pandas, and FormRequest which we’ll be using to feed the credentials for logging in.

Here fb_text is the name of our spider, we can write any number of spiders under our spider directory which might serve different purpose, in our case can we can write one for scraping posts and comments etc. Each spider should have its unique name.

Then we are getting the credentials for logging in through the terminal, where we’ll be running our spider

scrapy crawl fb_text -a email=”FB USER EMAIL” -a password=”FB USER PASSWORD ”

After having the credentials we feed it to FormRequest, it will fill the form (user_email and password) in the start_urls ‘https://mbasic.facebook.com' and returns the home page.

Let’s add some SuperPowers to our spider

Done defining the structure, time to give some superpowers to our spider. One is that it should able to crawl through pages to fetch contents and other is to take scrape the content/data.

Request function will send the response to the callback function, In our case, we reach the messages page then fetch the people we had conversation and their links. From that list, we’ll scrape one.

Above is the core part of the spider, it fetches the conversation between the entities with time and writes it out in a csv file. Full spider file can be found here.

For simplicity and easier understanding, items.py is not used for storing data.

How to use

Make sure to clone this repository, if you skipped the previous part. Navigate through the project’s top-level directory and launch scrapy with:

scrapy crawl fb -a email="EMAILTOLOGIN" -a password="PASSWORDTOLOGIN"

This will give last 10 recent conversations, from that select the conversation to be scrapped, the bot/spider will scrape the conversations till very last text in that conversation and return a csv file with columns ->Name, Text, Date. Check out the sample below.

Road ends here.

Github repo

In the Pipeline

Data is the source for solving any ml/ai problem, not every time we end having a well-structured data. Here’s where web scraping comes handy, with which we can scrape/fetch data from the website. Tutorials on web scraping from basics will be posted in future, make sure to follow and support.

We end this here, hope I’ve given some introduction to the word2vec embeddings. Check the other works here.