If you want to know how to extract data from a website, you're in the right place! Web scraping is a powerful tool for collecting information from the internet. Python, with its rich ecosystem of libraries, makes this task easy for us.

Web scraping tutorial with Python tutorial

In this blog post, we'll cover:

  • List of tools we can use for web scraping with Python.
  • Simple web scraping for static websites.
  • Using Selenium for dynamic content or Javascript-heavy site/
  • MechanicalSoup to automate some tasks in the browser.

List of tools in Python for Web scraping

We have a lot of libraries in Python that we can use for scraping data from a website. Here are some of them:

Category Tool/Library Description
HTTP Libraries Requests Simple HTTP library for Python, built for human beings.
urllib A module for fetching URLs included with Python.
urllib3 A powerful, user-friendly HTTP client for Python.
httpx A fully featured HTTP client for Python 3, which provides sync and async APIs, and support for both HTTP/1.1 and HTTP/2.
Parsing Libraries Beautiful Soup A library for pulling data out of HTML and XML files.
lxml Processes XML and HTML in Python, supporting XPath and XSLT.
pyquery A jQuery-like library for parsing HTML.
Web Drivers Selenium An automated web browser, useful for complex scraping tasks.
Splinter Open-source tool for testing web applications.
Automation Tools Scrapy An open-source web crawling and scraping framework.
MechanicalSoup A Python library for automating interaction with websites.
Data Processing pandas A fast, powerful, flexible and easy-to-use data analysis tool.
JavaScript Support Pyppeteer (Python port of Puppeteer) A tool for browser automation and web scraping.

Feel free to suggest if you know any other tools out there!

Step by Step basic web scraping tutorial in Python

Here's a basic tutorial on web scraping in Python. For this example, we will use two popular libraries: requests for making HTTP requests and Beautiful Soup for parsing HTML.

Prerequisites:

  • Basic understanding of Python.
  • Python is installed on your machine.
  • PIP for installing Python packages.

Step 1: Install Necessary Libraries
First, you need to install the requests and BeautifulSoup libraries. You can do this using pip:

pip install requests beautifulsoup4

Step 2: Import Libraries
In your Python script or Jupyter Notebook, import the necessary modules:

import requests
from bs4 import BeautifulSoup

Step 3: Make an HTTP Request
Choose a website you want to scrape and send a GET request to it. For this example, let's scrape Google's homepage.

url = 'https://google.com'
response = requests.get(url)

Step 4: Parse the HTML Content
Once you have the HTML content, you can use Beautiful Soup to parse it:

soup = BeautifulSoup(response.text, 'html.parser')

Step 5: Extract Data
Now, you can extract data from the HTML. Let's say you want to extract all the headings:

headings = soup.find_all('div')
for heading in headings:
    print(heading.text.strip())

Step 6: Handle Errors
Always make sure to handle errors like bad requests or connection problems:

if response.status_code == 200:
    # Proceed with scraping
    # ...
else:
    print("Failed to retrieve the web page")

Notes

We need two primary tools to perform web scraping in Python: HTTP Client and HTML Parser.

  • An HTTP API Client to fetch web pages.
    e.g. requests, urllib, pycurl or httpx
  • An HTML parser to extract data from the fetched pages.
    e.g. Beautiful Soup, lxml, or pyquery

Here is a concrete example of how to use these tools in a real-world use case

How to scrape Google search results with Python
Learn how to quickly and effortlessly scrape Google search results using the SerpApi Python library. Bonus: export the data to a CSV file or a Database.

I also cover more detail about using Beautiful Soup in this post:

Beautiful Soup: Web Scraping with Python
Learn how to extract information from a web page using Beautiful Soup and Python, which makes it easy for your web scraping task!

Step by Step scraping dynamic content in Python

What if the content you want to scrape is not loaded initially? Sometimes, the data hides behind a user interaction. To scrape dynamic content in Python, which often involves interacting with JavaScript, you'll typically use Selenium.

Unlike the requests and BeautifulSoup combination, which works well for static content, Selenium can handle dynamic websites by automating a web browser.

Prerequisites:

  • Basic knowledge of Python and web scraping (as covered in the previous lesson).
  • Python is installed on your machine.
  • Selenium package and a WebDriver installed.

Step 1: Install Selenium
First, install Selenium using pip:

pip install selenium

Step 2: Download WebDriver
You'll need a WebDriver for the browser you want to automate (e.g., Chrome, Firefox). For Chrome, download ChromeDriver. Make sure the WebDriver version matches your browser version. Place the WebDriver in a known directory or update the system path.

Step 3: Import Selenium and Initialize WebDriver
Import Selenium and initialize the WebDriver in your script.

from selenium import webdriver

driver = webdriver.Chrome()

Step 4: Fetch Dynamic Content
Open a website and fetch its dynamic content. Let's use https://google.com as an example.

url = 'https://google.com'
driver.get(url)

Step 5: Print title
Here is an example of how to get a certain element on the page.

print(driver.title)

Try to run this script. You'll see a new browser pop up and open the page.

0:00
/0:03

Launch a browser with Selenium

Step 6: Interact with the Page (if necessary)
If you need to interact with the page (like clicking buttons or filling forms), you can do so:

text_box = driver.find_element(by=By.NAME, value="my-text")
submit_button = driver.find_element(by=By.CSS_SELECTOR, value="button")
submit_button.click()

Step 7: Scrape Content
Now, you can scrape the content. For example, to get all paragraphs:

paragraphs = driver.find_elements_by_tag_name('p')
for paragraph in paragraphs:
    print(paragraph.text)

Step 8: Close the Browser
Once done, don't forget to close the browser:

driver.quit()

Additional Tips:

  • Selenium can perform almost all actions that you can do manually in a browser.
  • For complex web pages, consider using explicit waits to wait for elements to load.
  • Remember to handle exceptions and errors.

Here is a blog post that covers more about Selenium

Selenium web scraping: Scrape dynamic site in Python
Learn how to scrape data from dynamic websites using Selenium in Python. Simulate browsing a website like a real person programmatically!

Here is a video tutorial on using Selenium for automation in Python by NeuralNine on Youtube.

A basic example of web scraping using MechanicalSoup

MechanicalSoup is a Python library for web scraping that combines the simplicity of Requests with the convenience of BeautifulSoup. It's particularly useful for interacting with web forms, like login pages. Here's a basic example to illustrate how you can use MechanicalSoup for web scraping:

Please note that MechanicalSoup doesn't handle javascript loaded content. That's a task for Selenium 😉

Prerequisites:

  • Python is installed on your machine.
  • Basic understanding of Python and HTML.

Step 1: Install MechanicalSoup
You can install MechanicalSoup via pip:

pip install mechanicalsoup

Step 2: Import MechanicalSoup
In your Python script, import MechanicalSoup:

import mechanicalsoup

Step 3: Create a Browser Object
MechanicalSoup provides a Browser class, which you'll use to interact with web pages:

browser = mechanicalsoup.StatefulBrowser()

Step 4: Make a Request
Let's say you want to scrape data from a simple example page. You can use the Browser object to open the URL:

url = 'https://google.com'
print(browser.get(url))

Step 5: Parse the HTML Content
The page variable now contains the response from the website. You can access the BeautifulSoup object via browser.page:

page = browser.page
print(page)

Step 6: Extract Data
Now, you can extract data using BeautifulSoup methods. For example, to get all paragraphs:

page = browser.page
pTags = page.find_all('p')
print(pTags)

Step 7: Handling Forms (Optional)
If you need to interact with forms, you can do so easily.

Given this HTML content on a page:

<form action="/pages/forms/" class="form form-inline" method="GET">
<label for="q">Search for Teams:  </label>
<input class="form-control" id="q" name="q" placeholder="Search for Teams" type="text"/>
<input class="btn btn-primary" type="submit" value="Search"/>
</form>

To submit a search query on a form:

# Select the form
browser.select_form('form')

# Fill the form with your query
browser['q'] = 'red'

# Submit the form
response = browser.submit_selected()

# Print the URL (assuming the form is correctly submitted and a new page is loaded)
print("Form submitted to:", response.url)

What if you have multiple forms on the page?

select_form and another method in MechanicalSoup usually accept a CSS selector parameter. So, whether it's id or class you can always name it specifically there.

When to use MechanicalSoup (From their documentation)
MechanicalSoup is designed to simulate the behavior of a human using a web browser. Possible use-cases include:
- Interacting with a website that doesn’t provide a webservice API, out of a browser.
- Testing a website you’re developing

Why use Python for web scraping?

Python is a popular choice for web scraping for several reasons. Here are the top three:

  1. Seamless Integration with Data Science Tools: After scraping data from the web, you often need to clean, analyze, and visualize this data, which is where Python's data science capabilities come in handy. Tools like Pandas, NumPy, and Matplotlib integrate seamlessly with web scraping libraries, allowing for an efficient end-to-end process. Here's a bit more detail on each:
  2. Rich Ecosystem of Libraries: Python has a vast selection of libraries specifically designed for web scraping, such as Beautiful Soup, Scrapy, Selenium, Requests, and MechanicalSoup. These libraries simplify the process of extracting data from websites, parsing HTML and XML, handling HTTP requests, and even interacting with JavaScript-heavy sites. This rich ecosystem means that Python offers a tool for almost every web scraping need, from simple static pages to complex, dynamic web applications.
  3. Ease of Learning and Use: Python is known for its simplicity and readability, making it an excellent choice for beginners and experienced programmers alike. Its straightforward syntax allows developers to write less code compared to many other programming languages, making the process of writing and understanding web scraping scripts easier and faster. This ease of use is particularly beneficial in web scraping, where scripts can often become complex and difficult to manage.

That's it! I hope you enjoy this tutorial!