Chatterbox Part 12 — Scraping page’s model with JavaScript or extensions

This is the twelfth part of the Chatterbox series. For your convenience you can find other parts in the table of contents in Part 1 – Origins

Scraping is hard so we should avoid doing that as much as possible. Some pages allow us to get our hands on the logical model with some tricks. Let’s go with a couple of examples. Keep in mind I’m writing this in early 2022 so if you read it later then actual class names or handles may be no longer valid.

Whatsapp Web

Whatsapp uses React under the hood so first we need to have a helper method for getting React’s props:

We can use this method to get the model from the page easily. Open some chat window and then do this:

And there you go. Now, the method to parse messages (to give you an idea of how it looks like):

You can also extract everything from IndexedDB as show in Whatsapp Backup.

You can also get it from the memory dump, obviously.

Skype

You can get messages from IndexedDB using web.skype.com. However, you can also go to outlook.live.com and open the Skype pane which is implemented with Knockout. To get the model just run

Teams

Similar trick with IndexedDB:

Messenger

This one we can scrape with Chrome extension capturing the network traffic. Before moving on, a couple of ground rules.

First, we’re going to implement an extension for the devtools pane. This means that you need to open a devtools before browsing to the messanger page. You can do that automatically with Pupeteer.

First, the manifest:

Now, devtools.html:

And now the actual devtools code:

This code captures requests to graphqlbatch and the extracts messages from the response. It sends them to the facebook tab and stores in current-messages attribute of the body tag. Now, you need to grab those messages with JS:

And that’s it.

Twitter

Just like in Messenger but we’re looking for requests with inbox_initial_state in the response body. Then we can parse.

Discord

Same idea, request to /messages

Slack

Same, request to conversations.history

Summary

There are multiple ways of getting models from the page. We can read it from popular JS frameworks, extract from IndexedDB, or parse network calls. Next time we’ll see how to trace them on the fly.