Easy web scraping with Scrapy

13 December 2022 (updated) | 17 min read

In the previous post about Web Scraping with Python we talked a bit about Scrapy. In this post we are going to dig a little bit deeper into it.

Scrapy is a wonderful open source Python web scraping framework. It handles the most common use cases when doing web scraping at scale:

  • Multithreading
  • Crawling (going from link to link)
  • Extracting the data
  • Validating
  • Saving to different format / databases
  • Many more

The main difference between Scrapy and other commonly used libraries, such as Requests / BeautifulSoup, is that it is opinionated, meaning it comes with a set of rules and conventions, which allow you to solve the usual web scraping problems in an elegant way.

The downside of Scrapy is that the learning curve is steep, there is a lot to learn, but that is what we are here for :)

In this tutorial we will create two different web scrapers, a simple one that will extract data from an E-commerce product page, and a more "complex" one that will scrape an entire E-commerce catalog!

cover image

Basic overview

You can install Scrapy using pip. Be careful though, the Scrapy documentation strongly suggests to install it in a dedicated virtual environment in order to avoid conflicts with your system packages.

Hence, I'm using Virtualenv and Virtualenvwrapper:

mkvirtualenv scrapy_env

Now we can simply install Scrapy ....

pip install Scrapy

.... and bootstrap our Scrapy project with startproject

scrapy startproject product_scraper

This will create all the necessary boilerplate files for the project.

├── product_scraper
│   ├── __init__.py
│   ├── items.py
│   ├── middlewares.py
│   ├── pipelines.py
│   ├── settings.py
│   └── spiders
│       ├── __init__.py
└── scrapy.cfg

Here is a brief overview of these files and folders:

  • items.py is a model for the extracted data. You can define custom model (like a product) that will inherit the Scrapy Item class.
  • middlewares.py is used to change the request / response lifecycle. For example you could create a middleware to rotate user-agents, or to use an API like ScrapingBee instead of doing the requests yourself.
  • pipelines.py is used to process the extracted data, clean the HTML, validate the data, and export it to a custom format or save it to a database.
  • /spiders is a folder containing Spider classes. With Scrapy, Spiders are classes that define how a website should be scraped, including what link to follow and how to extract the data for those links.
  • scrapy.cfg is the configuration file for the project's main settings.

Scraping a single product

For our example, we will try to scrape a single product page from the following dummy e-commerce site

https://clever-lichterman-044f16.netlify.app/products/taba-cream.1/

Product screenshot on an E-commerce website

And we will want to extract the product name, the picture, the price, and the description.

Scrapy Shell

Scrapy comes with a built-in shell that helps you try and debug your scraping code in real time. You can quickly test your XPath expressions / CSS selectors with it. It's a very cool tool to write your web scrapers and I always use it!

You can configure Scrapy Shell to use another console instead of the default Python console like IPython. You will get autocompletion and other nice perks like colorized output.

In order to use it in your Scrapy Shell, you need to add this line to your scrapy.cfg file:

shell = ipython

Once it's configured, you can start using Scrapy Shell:

$ scrapy shell --nolog
[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x108147eb8>
[s]   item       {}
[s]   settings   <scrapy.settings.Settings object at 0x108d10978>
[s] Useful shortcuts:
[s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s]   fetch(req)                  Fetch a scrapy.Request and update local objects
[s]   shelp()           Shell help (print this help)
[s]   view(response)    View response in a browser
In [1]:

We can start fetching a URL by simply running the following command:

fetch('https://clever-lichterman-044f16.netlify.app/products/taba-cream.1/')

This will start by fetching the /robots.txt file.

[scrapy.core.engine] DEBUG: Crawled (404) <GET https://clever-lichterman-044f16.netlify.app/robots.txt> (referer: None)

In this case, there isn't any robot.txt, that's why we got a 404 HTTP code. If there was a robot.txt, Scrapy will by default follow its rule set. You can disable this behavior by changing ROBOTSTXT_OBEY in product_scraper/settings.py:

ROBOTSTXT_OBEY = False

Running our fetch call again, you should now have a log like this:

[scrapy.core.engine] DEBUG: Crawled (200) <GET https://clever-lichterman-044f16.netlify.app/products/taba-cream.1/> (referer: None)

Scrapy will save the response straight into the response variable, which you can directly evaluate in Scrapy Shell.

For example, play with the following commands to get more insight on the response's status code and its headers.

>>> response.status
200

>>> response.headers
{b'Content-Length': [b'3952'], b'Age': [b'100'], b'Cache-Control': [b'public, max-age=0, must-revalidate'], b'Content-Type': [b'text/html; charset=UTF-8'], b'Etag': [b'"24912a02a23e2dcc54a7013e6cd7f48b-ssl-df"'], b'Server': [b'Netlify'], b'Strict-Transport-Security': [b'max-age=31536000; includeSubDomains; preload'], b'Vary': [b'Accept-Encoding']}

Additionally, Scrapy has also saved the response body straight to your temporary system directory, from where you can view it directly in your browser with

view(response)

Note, this will probably not render ideally, as your browser will only load the HTML, without its external resource dependencies or taking CORS issues into account.

The Scrapy Shell is like a regular Python shell, so don't hesitate to run your favorite scripts/function in it.

Extracting Data

Now let's try some XPath expression to extract the product title and price:

HTML from an E-commerce website

In this case, we'll be looking for the first <span> within our <div id="my-4">

In [16]: response.xpath("//div[@class='my-4']/span/text()").get()
Out[16]: '20.00$'

I could also use a CSS selector:

In [21]: response.css('.my-4 > span::text').get()
Out[21]: '20.00$'

Scrapy doesn't execute any JavaScript by default, so if the website you are trying to scrape is using a frontend framework like Angular / React.js, you could have trouble accessing the data you want.

Creating a Scrapy Spider

With Scrapy, Spiders are classes where you define your crawling (what links / URLs need to be scraped) and scraping (what to extract) behavior.

Here are the different steps used by a Spider to scrape a website:

  • It starts by using the URLs in the class' start_urls array as start URLs and passes them to start_requests() to initialize the request objects. You can override start_requests() to customize this steps (e.g. change the HTTP method/verb and use POST instead of GET or add authentication credentials)
  • It will then fetch the content of each URL and save the response in a Request object, which it will pass to parse()
  • The parse() method will then extract the data (in our case, the product price, image, description, title) and return either a dictionary, an Item or Request object, or an Iterable

You may wonder why the parse method can return so many different objects. It's for flexibility. Let's say you want to scrape an E-commerce website that doesn't have any sitemap. You could start by scraping the product categories, so this would be a first parse method.

This method would then yield a Request object to each product category to a new callback method parse2(). For each category you would need to handle pagination Then for each product the actual scraping that generate an Item so a third parse function.

With Scrapy you can return the scraped data as a simple Python dictionary, but it is a good idea to use the built-in Scrapy Item class. It's a simple container for our scraped data and Scrapy will look at this item's fields for many things like exporting the data to different format (JSON / CSV...), the item pipeline etc.

So here is a basic product class:

import scrapy

class Product(scrapy.Item):
    product_url = scrapy.Field()
    price = scrapy.Field()
    title = scrapy.Field()
    img_url = scrapy.Field()

Now we can generate a spider, either with the command line helper:

scrapy genspider myspider mydomain.com

Or you can do it manually and put your Spider's code inside the /spiders directory.

Spider types

There's quite a number of pre-defined spider classes in Scrapy

  • Spider, fetches the content of each URL, defined in start_urls, and passes its content to parse for data extraction
  • CrawlSpider, follows links defined by a set of rules
  • CSVFeedSpider, extracts tabular data from CSV URLs
  • SitemapSpider, extracts URLs defined in a sitemap
  • XMLFeedSpider, similar to the CSV spider, but handles XML URLs (e.g. RSS or Atom)

Let's start off with an example of Spider

Scrapy Spider Example

# -*- coding: utf-8 -*-
import scrapy

from product_scraper.items import Product

class EcomSpider(scrapy.Spider):
    name = 'ecom_spider'
    allowed_domains = ['clever-lichterman-044f16.netlify.app']
    start_urls = ['https://clever-lichterman-044f16.netlify.app/products/taba-cream.1/']

    def parse(self, response):
        item = Product()
        item['product_url'] = response.url
        item['price'] = response.xpath("//div[@class='my-4']/span/text()").get()
        item['title'] = response.xpath('//section[1]//h2/text()').get()
        item['img_url'] = response.xpath("//div[@class='product-slider']//img/@src").get(0)
        return item

Here, we created our own EcomSpider class, based on scrap.Spider, and add three fields

  • name, which is our Spider's name (that you can run using scrapy runspider spider_name)
  • start_urls, defines an array of the URLs you'd like to scrape
  • allowed_domains, optional but important when you use a CrawlSpider instance that could follow links on different domains

Finally, we added our own parse() method, where we instantiate a Product object, populate it with the data (using XPath again, ain't XPath great? 🥳), and return our product object.

Let's run our code in the following way, to export the data as JSON (but you could also choose CSV)

scrapy runspider ecom_spider.py -o product.json

You should now have got a nice JSON file:

[
  {
    "product_url": "https://clever-lichterman-044f16.netlify.app/products/taba-cream.1/",
    "price": "20.00$",
    "title": "Taba Cream",
    "img_url": "https://clever-lichterman-044f16.netlify.app/images/products/product-2.png"
  }
]

Item loaders

There are two common problems that you can face while extracting data from the Web:

  • For the same website, the page layout and underlying HTML can be different. If you scrape an E-commerce website, you will often have a regular price and a discounted price, with different XPath / CSS selectors.
  • The data can be dirty and you may need to normalize it, again for an E-commerce website it could be the way the prices are displayed for example ($1.00, $1, $1,00)

Scrapy comes with a built-in solution for this, ItemLoaders. It's an interesting way to populate our product object.

You can add several XPath expression to the same Item field, and it will test it sequentially. By default, in case Scrapy could successfully more than one XPath expression, it will load all of them into a list. You can find many examples of input and output processors in the Scrapy documentation.

It's really useful when you need to transform/clean the data your extract. For example, extracting the currency from a price, transforming a unit into another one (centimeter to meters, Celsius to Fahrenheit) ...

In our webpage we can find the product title with different XPath expressions: //title and //section[1]//h2/text()

Here is how you could use and Itemloader in this case:

def parse(self, response):
    l = ItemLoader(item=Product(), response=response)
    l.add_xpath('price', "//div[@class='my-4']/span/text()")
    l.add_xpath('title', '//section[1]//h2/text()')
    l.add_xpath('title', '//title')
    l.add_value('product_url', response.url)
    return l.load_item()

Generally, you only want the first matching XPath, so you will need to add this output_processor=TakeFirst() to your item's field constructor.

And that's exactly what we are going to do in our case, so a better approach would be to create our own ItemLoader and declare a default output processor to take the first matching XPath:

from scrapy.loader import ItemLoader
from scrapy.loader.processors import TakeFirst, MapCompose, Join

def remove_dollar_sign(value):
    return value.replace('$', '')

class ProductLoader(ItemLoader):
    default_output_processor = TakeFirst()
    price_in = MapCompose(remove_dollar_sign)

I also added a price_in field, which is an input processor to delete the dollar sign from the price. I'm using MapCompose which is a built-in processor that takes one or several functions to be executed sequentially. You can attach as many functions as you like. The convention is to add _in or _out to your Item field's name to add an input or output processor to it.

There are many more processors, you can learn more about this in the documentation.

Item pipelines

Another extremely useful feature of Scrapy are pipelines.

Pipelines are represented by plain classes which implement a process_item method. When your spider runs, it will call that method for each product item it earlier created and each configured pipeline instance will perform its validation and post-process steps on that item.

Let's try that with a pipeline instance, which validates and normalizes our price field. The class should be declared in our product_scraper/pipelines.py file.

import re

class PriceValidatorPipeline:
	def process_item(self, item, spider):

		price = item.get('price')

		if price is not None:
			price = re.sub("[^0-9\.]", "", price)

			if price:
				# Convert string to float
				price = float(price)

				# Validate price
				if price < 10 or price > 100:
					price = None
			else:
				price = None

		if price is not None :
			
			# Set normalised price
			item['price'] = price;

			return item
		else:
			raise DropItem(f"Missing price")

What we did here is to declare a plain class PriceValidatorPipeline with aforementioned method, process_item.

That method first checks if the item has a price at all. If it doesn't, it immediately skips it with a DropItem exception.

Should there be a price, it will first try to sanitize the price string and convert it to a float value. Finally it will check if the value is within our (arbitrary) limit of USD 10 and USD 100. If all that checks out, it will store the sanitized price value in our item and pass it back to the spider, otherwise it will throw a DropItem exception to signal to our spider, that it should drop this specific item.

Last, but not least, we also need to make our spider aware of our pipeline class. We do this by editing our ITEM_PIPELINES entry in product_scraper/settings.py.

ITEM_PIPELINES = {
	'product_scraper.pipelines.PriceValidatorPipeline': 100
}

The numeric value specifies the pipeline's priority. The lower the value, the earlier it will run.

Voilà, we have configured our pipeline class and the spider should now call it for each item and validate the price according to our business logic. All right, let's take that a step further and add a second pipeline to persist the data in a database, shall we?

Database pipeline

For our example, we pick MySQL as our database. As always, Python has got us covered, of course, and there's a lovely Python MySQL package, which will provide us with seamless access to our database. Let's install it first with pip

pip install mysql-connector-python

Perfect, let's code! 😎

Hang on, what about the database?

Database setup

For brevity we assume you have already set up MySQL. If not, please follow MySQL's excellent installation guide.

Once we have our database running, we just need to create a database

CREATE DATABASE scraper

and within our database a table

CREATE TABLE products (
	url VARCHAR(200) NULL DEFAULT NULL,
	price FLOAT NULL DEFAULT NULL,
	title VARCHAR(200) NULL DEFAULT NULL,
	img_url VARCHAR(200) NULL DEFAULT NULL
);

All set!

Database configuration

Next, we only need to configure our database access credentials. Just add the following four entries (with their actual values) to our configuration file product_scraper/settings.py.

MYSQL_HOST = '127.0.0.1'
MYSQL_USER = 'username'
MYSQL_PASS = 'password'
MYSQL_DB = 'scraper'
Database class

And here we go. Just add that code to product_scraper/pipelines.py.

import mysql.connector

class MySQLPipeline:
	def __init__(self, mysql_host, mysql_user, mysql_pass, mysql_db):
		self.mysql_host = mysql_host
		self.mysql_user = mysql_user
		self.mysql_pass = mysql_pass
		self.mysql_db = mysql_db

	@classmethod
	def from_crawler(cls, crawler):
		return cls(
			mysql_host = crawler.settings.get('MYSQL_HOST'),
			mysql_user = crawler.settings.get('MYSQL_USER'),
			mysql_pass = crawler.settings.get('MYSQL_PASS'),
			mysql_db = crawler.settings.get('MYSQL_DB')
		)

	def open_spider(self, spider):
		self.con = mysql.connector.connect(
			host = self.mysql_host,
			user = self.mysql_user,
			password = self.mysql_pass,
			database = self.mysql_db
		)

	def close_spider(self, spider):
		self.con.close()

	def process_item(self, item, spider):
		cursor = self.con.cursor()

		cursor.execute(
			'INSERT INTO products (url, price, title, img_url) VALUES (%s, %s, %s, %s)',
			(item['product_url'], item['price'], item['title'], item['img_url'])
		)

		self.con.commit()

		cursor.close();

		return item

What exactly did we do here?

We first implemented the from_crawler() method, which allows us to initialize our pipeline from a crawler context and access the crawler settings. In our example, we need it to get access to the database connection information, which we have just specified in product_scraper/settings.py.

All right, from_crawler() initializes our pipeline and returns the instance to the crawler. Next, open_spider will be called, where we initialize our MySQL instance and create the database connection. Lovely!

Now, we just sit and wait for our spider to call process_item() for each item, just as it did earlier with PriceValidatorPipeline, when we take the item and run an INSERT statement to store it in our products table. Pretty straightforward, isn't it?

And because we are good netizens, we clean up when we are done in close_spider().

At this point we are pretty much done with our database pipeline and just need to declare it again in product_scraper/settings.py, so that our crawler is aware of it.

ITEM_PIPELINES = {
	'product_scraper.pipelines.PriceValidatorPipeline': 100
	'product_scraper.pipelines.MySQLPipeline': 200
}

We specified "200" as priority here, to make sure it runs after our price validator pipeline.

When we now run our crawler, it should scrape the specified pages, pass the data to our price validator pipeline, where the price gets validated and properly converted to a number, and eventually pass everything to our MySQL pipeline, which will persist the product in our database. 🎉

Scraping multiple pages

Now that we know how to scrape a single page, it's time to learn how to scrape multiple pages, like the entire product catalog. As we saw earlier there are different kinds of Spiders.

When you want to scrape an entire product catalog the first thing you should look at is a sitemap. Sitemap are exactly built for this, to show web crawlers how the website is structured.

Most of the time you can find one at base_url/sitemap.xml. Parsing a sitemap can be tricky, and again, Scrapy is here to help you with this.

In our case, you can find the sitemap here: https://clever-lichterman-044f16.netlify.app/sitemap.xml

If we look inside the sitemap there are many URLs that we are not interested by, like the home page, blog posts etc:

<url>
  <loc>
  https://clever-lichterman-044f16.netlify.com/blog/post-1/
  </loc>
  <lastmod>2019-10-17T11:22:16+06:00</lastmod>
</url>
<url>
  <loc>
  https://clever-lichterman-044f16.netlify.com/products/
  </loc>
  <lastmod>2019-10-17T11:22:16+06:00</lastmod>
</url>
<url>
  <loc>
  https://clever-lichterman-044f16.netlify.com/products/taba-cream.1/
  </loc>
  <lastmod>2019-10-17T11:22:16+06:00</lastmod>
</url>

Fortunately, we can filter the URLs to parse only those that matches some pattern, it's really easy, here we only to have URL that have /products/ in their URLs:

class SitemapSpider(SitemapSpider):
    name = "sitemap_spider"
    sitemap_urls = ['https://clever-lichterman-044f16.netlify.app/sitemap.xml']
    sitemap_rules = [
        ('/products/', 'parse_product')
    ]

    def parse_product(self, response):
        # ... scrape product ...

Here, we simply used the default sitemap_rules field to specify one entry, where we instructed Scrapy to call our parse_product method for each URL which matched our /products/ regular expression.

You can run this spider with the following command to scrape all the products and export the result to a CSV file:

scrapy runspider sitemap_spider.py -o output.csv

Now what if the website didn't have any sitemap? Once again, Scrapy has a solution for this!

Let me introduce you to the... CrawlSpider.

Just like our original spider, the CrawlSpider will crawl the target website by starting with a start_urls list. Then for each url, it will extract all the links based on a list of Rules. In our case it's easy, products has the same URL pattern /products/product_title so we only need filter these URLs.

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from product_scraper.productloader import ProductLoader
from product_scraper.items import Product

class MySpider(CrawlSpider):
    name = 'crawl_spider'
    allowed_domains = ['clever-lichterman-044f16.netlify.com']
    start_urls = ['https://clever-lichterman-044f16.netlify.com/products/']

    rules = (
        Rule(LinkExtractor(allow=('products', )), callback='parse_product'),
    )

    def parse_product(self, response):
      # .. parse product 

As you can see, all these built-in Spiders are really easy to use. It would have been much more complex to do it from scratch.

With Scrapy you don't have to think about the crawling logic, like adding new URLs to a queue, keeping track of already parsed URLs, multi-threading...

💡 One thing to still keep in mind is rate limits and bot detection. Many sites use such features to actively stop scrapers from access their data. At ScrapingBee, we took all these issues into account and provide a platform to handle them in an easy, elegant, and scalable fashion.

Please check out our no-code scraping solution for more details on how ScrapingBee can help you with your scraping projects. And the first one thousand API calls are entirely free.

Conclusion

In this post we saw a general overview of how to scrape the web with Scrapy and how it can solve your most common web scraping challenges. Of course we only touched the surface and there are many more interesting things to explore, like middlewares, exporters, extensions, pipelines!

If you've been doing web scraping more "manually" with tools like BeautifulSoup / Requests, it's easy to understand how Scrapy can help save time and build more maintainable scrapers. I hope you liked this Scrapy tutorial and that it will motivate you to experiment with it.

For further reading don't hesitate to look at the great Scrapy documentation. We have also published our custom integration with Scrapy, it allows you to execute Javascript with Scrapy, so please feel free to check it out and provide us with any feedback you may have.

You can also check out our web scraping with Python tutorial to learn more about web scraping.

Happy Scraping!

image description
Kevin Sahin

Kevin worked in the web scraping industry for 10 years before co-founding ScrapingBee. He is also the author of the Java Web Scraping Handbook.