Scraping Webpages in Python With Beautiful Soup: Search and DOM Modification

In the last tutorial, you learned the basics of the Beautiful Soup library. Besides navigating the DOM tree, you can also search for elements with a given class or id. You can also modify the DOM tree using this library.

In this tutorial, you will learn about different methods that will help you with the search and modifications. We will be scraping the same Wikipedia page about Python from our last tutorial.

Filters for Searching the Tree
Searching the DOM Tree Using Built-In Functions

Searching With find_all()
Filtering by Attribute
Limiting the Number of Results
Non-Recursive Search
Finding a Single Result
Parent and Sibling Searches
Search Using CSS Selectors

Modifying the Tree

Adding Multiple Elements to a Tag
Insert an Element at a Specific Location
Wrapping and Unwrapping Tags

Filters for Searching the Tree

Beautiful Soup has a lot of methods for searching the DOM tree. These methods are very similar and take the same kinds of filters as arguments. Therefore, it makes sense to properly understand the different filters before reading about the methods. I will be using the same find_all() method to explain the differences between the filters.

The simplest filter that you can pass to any search method is a string. Beautiful Soup will then search through the document to find a tag that exactly matches the string.

for heading in soup.find_all('h2'):
    print(heading.text)
    
# Contents
# History[edit]
# Features and philosophy[edit]
# Syntax and semantics[edit]
# Libraries[edit]
# Development environments[edit]
# ... and so on.

You can also pass a regular expression object to the find_all() method. This time, Beautiful Soup will filter the tree by matching all the tags against a given regular expression.

import re

for heading in soup.find_all(re.compile("^h[1-6]")):
    print(heading.name + ' ' + heading.text.strip())
    
# h1 Python (programming language)
# h2 Contents
# h2 History[edit]
# h2 Features and philosophy[edit]
# h2 Syntax and semantics[edit]
# h3 Indentation[edit]
# h3 Statements and control flow[edit]
# ... an so on.

The code will look for all the tags that begin with "h" and are followed by a digit from 1 to 6. In other words, it will be looking for all the heading tags in the document.

Instead of using regex, you could achieve the same result by passing a list of all the tags that you want Beautiful Soup to match against the document.

1	for heading in soup.find_all(["h1", "h2", "h3", "h4", "h5", "h6"]):
2	print(heading.name + ' ' + heading.text.strip())

You can also pass True as a parameter to the find_all() method. The code will then return all the tags in the document. The output below means that there are currently 4,339 tags in the Wikipedia page that we are parsing.

1	len(soup.find_all(True))
2	# 4339

If you are still not able to find what you are looking for with any of the above filters, you can define your own function that takes an element as its only argument. The function also needs to return True if there is a match and False otherwise. Depending on what you need, you can make the function as complicated as it needs to be to do the job. Here is a very simple example:

def big_lists(tag):
    return len(tag.contents) > 20 and tag.name == 'ul'
    
len(soup.find_all(big_lists))
# 13

The above function is going through the same Wikipedia Python page and looking for unordered lists that have more than 20 children.

Searching the DOM Tree Using Built-In Functions

Searching With `find_all()`

One of the most popular methods for searching through the DOM is find_all(). It will go through all the tag's descendants and return a list of all the descendants that match your search criteria. This method has the following signature:

1	find_all(name, attrs, recursive, string, limit, **kwargs)

The name argument is the name of the tag that you want this function to search for while going through the tree. You are free to provide a string, a list, a regular expression, a function, or the value True as a name.

Filtering by Attribute

You can also filter the elements in the DOM tree on the basis of different attributes like id, href, etc. You can also get all the elements with a specific attribute regardless of its value using attribute=True. Searching for elements with a specific class is different from searching for regular attributes. Since class is a reserved keyword in Python, you will have to use the class_ keyword argument when looking for elements with a specific class.

import re

len(soup.find_all(id=True))
# 425

len(soup.find_all(class_=True))
# 1734

len(soup.find_all(class_="mw-headline"))
# 20

len(soup.find_all(href=True))
# 1410

len(soup.find_all(href=re.compile("python")))
# 102

You can see that the document has 1,734 tags with a class attribute and 425 tags with an id attribute.

Limiting the Number of Results

If you only need the first few of these results, you can pass a number to the method as the value of limit. Passing this value will instruct Beautiful Soup to stop looking for more elements once it has reached a certain number. Here is an example:

soup.find_all(class_="mw-headline", limit=4)

# <span class="mw-headline" id="History">History</span>
# <span class="mw-headline" id="Features_and_philosophy">Features and philosophy</span>
# <span class="mw-headline" id="Syntax_and_semantics">Syntax and semantics</span>
# <span class="mw-headline" id="Indentation">Indentation</span>

Non-Recursive Search

When you use the find_all() method, you are telling Beautiful Soup to go through all the descendants of a given tag to find what you are looking for. Sometimes, you want to look for an element only in the direct children on a tag. This can be achieved by passing recursive=False to the find_all() method.

len(soup.html.find_all("meta"))
# 6

len(soup.html.find_all("meta", recursive=False))
# 0

len(soup.head.find_all("meta", recursive=False))
# 6

Finding a Single Result

If you are interested in finding only one result for a particular search query, you can use the find() method to find it instead of passing limit=1 to find_all(). The only difference between the results returned by these two methods is that find_all() returns a list with only one element and find() just returns the result.

soup.find_all("h2", limit=1)
# [<h2>Contents</h2>]

soup.find("h2")
# <h2>Contents</h2>

The find() and find_all() methods search through all the descendants of a given tag to search for an element.

Parent and Sibling Searches

There are ten other very similar methods that you can use to iterate through the DOM tree in different directions.

find_parents(name, attrs, string, limit, **kwargs)
find_parent(name, attrs, string, **kwargs)

find_next_siblings(name, attrs, string, limit, **kwargs)
find_next_sibling(name, attrs, string, **kwargs)

find_previous_siblings(name, attrs, string, limit, **kwargs)
find_previous_sibling(name, attrs, string, **kwargs)

find_all_next(name, attrs, string, limit, **kwargs)
find_next(name, attrs, string, **kwargs)

find_all_previous(name, attrs, string, limit, **kwargs)
find_previous(name, attrs, string, **kwargs)

The find_parent() and find_parents() methods traverse up the DOM tree to find the given element. The find_next_sibling() and find_next_siblings() methods will iterate over all the siblings of the element that come after the current one. Similarly, the find_previous_sibling() and find_previous_siblings() methods will iterate over all the siblings of the element that come before the current one.

The find_next() and find_all_next() methods will iterate over all the tags and strings that come after the current element. Similarly, the find_previous() and find_all_previous() methods will iterate over all the tags and strings that come before the current element.

Search Using CSS Selectors

You can also search for elements using CSS selectors with the help of the select() method. Here are a few examples:

len(soup.select("p a"))
# 411

len(soup.select("p > a"))
# 291

soup.select("h2:nth-of-type(1)")
# [<h2>Contents</h2>]

len(soup.select("p > a:nth-of-type(2)"))
# 46

len(soup.select("p > a:nth-of-type(10)"))
# 6

len(soup.select("[class*=section]"))
# 80

len(soup.select("[class$=section]"))
# 20

Modifying the Tree

You can not only search through the DOM tree to find an element but also modify it. It is very easy to rename a tag and modify its attributes.

heading_tag = soup.select("h2:nth-of-type(2)")[0]

heading_tag.name = "h3"
print(heading_tag)
# <h3><span class="mw-headline" id="Features_and_philosophy">Feat...

heading_tag['class'] = 'headingChanged'
print(heading_tag)
# <h3 class="headingChanged"><span class="mw-headline" id="Feat...

heading_tag['id'] = 'newHeadingId'
print(heading_tag)
# <h3 class="headingChanged" id="newHeadingId"><span class="mw....

del heading_tag['id']
print(heading_tag)
# <h3 class="headingChanged"><span class="mw-headline"...

Continuing from our last example, you can replace a tag's contents with a given string using the .string attribute. If you don't want to replace the contents but add something extra at the end of the tag, you can use the append() method.

Adding Multiple Elements to a Tag

What if you want to add multiple elements to a tag? You can do that with the extend() method. It accepts a list of elements as its parameter. These elements are added to the calling tag in the order of appearance.

import requests
from bs4 import BeautifulSoup
 
req = requests.get('https://en.wikipedia.org/wiki/Python_(programming_language)')
soup = BeautifulSoup(req.text, "lxml")

new_soup = BeautifulSoup("<ol></ol>", "lxml")

new_soup.ol.extend(['<li>' + heading.text + '</li>' for heading in soup.find_all('h2')])

# ['<li>Contents</li>', '<li>History[edit]</li>', ... , '<li>Navigation menu</li>']
print(new_soup.ol.contents)

# Returns an empty list
print(new_soup.find_all('li'))

In the above example, we created a new BeautifulSoup object to store the headings as a list. The list is generated using list comprehensions in Python. We passed this list inside the extend() method to append everything to our ol tag. It may look as if we are adding the headings inside our ol tag as individual list elements, but they are being added as a string. This is evident when we use find_all() on the new_soup we created.

The best way to add elements as proper HTML tags is to call the new_tag() method. The only required argument in this case is the tag name, but you can also add other attributes as shown below.

import requests
from bs4 import BeautifulSoup
 
req = requests.get('https://en.wikipedia.org/wiki/Python_(programming_language)')
soup = BeautifulSoup(req.text, "lxml")

new_soup = BeautifulSoup("<ol></ol>", "lxml")


all_tags = []

counter = 0
for heading in soup.find_all('h2'):
    counter += 1
    id_string = "list-item-" + str(counter)
    tag = new_soup.new_tag('li', id=id_string, attrs={"class": "outline"})
    tag.string = heading.text
    all_tags.append(tag)

new_soup.ol.extend(all_tags)

# [<li class="outline" id="list-item-1">Contents</li>, <li class="outline" id="list-item-2">History[edit]</li>, ... , <li class="outline" id="list-item-19">Navigation menu</li>]
print(new_soup.ol.contents)

# [<li class="outline" id="list-item-1">Contents</li>, <li class="outline" id="list-item-2">History[edit]</li>, ... , <li class="outline" id="list-item-19">Navigation menu</li>]
print(new_soup.find_all('li'))


You can see from the output this time that the list elements are no longer simple strings but actual HTML elements.

Insert an Element at a Specific Location

If you want to insert something inside a tag at a specific location, you can use the insert() method. The first parameter for this method is the position or index at which you want to insert the content, and the second parameter is the content itself. You can remove all the content inside a tag using the clear() method. This will just leave you with the tag itself and its attributes.

heading_tag.string = "Features and Philosophy"
print(heading_tag)
# <h3 class="headingChanged">Features and Philosophy</h3>

heading_tag.append(" [Appended This Part].")
print(heading_tag)
# <h3 class="headingChanged">Features and Philosophy [Appended This Part].</h3>

print(heading_tag.contents)
# ['Features and Philosophy', ' [Appended This Part].']

heading_tag.insert(1, ' Inserted this part ')
print(heading_tag)
# <h3 class="headingChanged">Features and Philosophy Inserted this part  [Appended This Part].</h3>

heading_tag.clear()
print(heading_tag)
# <h3 class="headingChanged"></h3>

At the beginning of this section, you selected a level two heading from the document and changed it to a level three heading. Using the same selector again will now show you the next level two heading that came after the original. This makes sense because the original heading is no longer a level two heading.

The original heading can now be selected using h3:nth-of-type(2). If you completely want to remove an element or tag and all the content inside it from the tree, you can use the decompose() method.

soup.select("h3:nth-of-type(2)")[0]
# <h3 class="headingChanged"></h3>

soup.select("h3:nth-of-type(3)")[0]
# <h3><span class="mw-headline" id="Indentation">Indentation</span>...

soup.select("h3:nth-of-type(2)")[0].decompose()
soup.select("h3:nth-of-type(2)")[0]
# <h3><span class="mw-headline" id="Indentation">Indentation</span>...

Once you've decomposed or removed the original heading, the heading in the third spot takes its place.

If you want to remove a tag and its contents from the tree but don't want to completely destroy the tag, you can use the extract() method. This method will return the tag that it extracted. You will now have two different trees that you can parse. The root of the new tree will be the tag that you just extracted.

1	heading_tree = soup.select("h3:nth-of-type(2)")[0].extract()
2
3	len(heading_tree.contents)
4	# 2

You can also replace a tag inside the tree with something else of your choice using the replace_with() method. This method will return the tag or string that it replaced. It can be helpful if you want to put the replaced content somewhere else in the document.

soup.h1
# <h1 class="firstHeading">Python (programming language)</h1>

bold_tag = soup.new_tag("b")
bold_tag.string = "Python"

soup.h1.replace_with(bold_tag)

print(soup.h1)
# None
print(soup.b)
# <b>Python</b>

In the above code, the main heading of the document has been replaced with a b tag. The document no longer has an h1 tag, and that is why print(soup.h1) now prints None.

Wrapping and Unwrapping Tags

Two more methods that will come in handy when you are modifying the DOM are wrap() and unwrap(). The wrap() method is useful when you want to wrap a tag around some content. Similarly, the unwrap() method gets rid of the calling tag, leaving only its contents behind.

soup = BeautifulSoup("<ol><li>Overview</li><li>Main Content</li><li>Conclusion</li></ol>", "lxml")

for list_item in soup.find_all('li'):
    list_item.string.wrap(soup.new_tag("b"))

# [<li><b>Overview</b></li>, <li><b>Main Content</b></li>, <li><b>Conclusion</b></li>]
print(soup.ol.contents)

You can use the unwrap() method to strip the provided markup of specific tags. In the following example, we will use it to remove all the <b> and <i> tags from a paragraph.

soup = BeautifulSoup("<p>We will <i>try</i> to get rid of <b>tags</b> that make text <b>bold</b> or <i>italic</i>. The content <i>within</i> the <b>tags</b> should still be <b>preserved</b>.</p>", "lxml")

for unwanted_tag in soup.find_all(["b", "i"]):
    unwanted_tag.unwrap()

# ['We will ', 'try', ' to get rid of ', 'tags', ... , 'preserved', '.']
print(soup.p.contents)

soup.p.smooth()

# ['We will try to get rid of tags ...  preserved.']
print(soup.p.contents)

In the above example, we created a list of unwanted tags that we want to remove and passed it to find_all(). This method then finds all the instances of these tags and calls unwrap() on all of them. One side effect of running the above code is that all the individual bits of text are stored as NavigableString objects. NavigableStrings are like regular strings except they carry information about the parse tree. You can combine them all into a single string by calling the smooth() method.

Final Thoughts

After reading the two tutorials in this series, you should now be able to parse different webpages and extract important data from the document. You should also be able to retrieve the original webpage, modify it to suit your own needs, and save the modified version locally.