1. Code
  2. Python

How to Work With PDF Documents Using Python

Scroll to top

I really admire Portable Document Format (PDF) files. They are immensely popular with people because you get the same content and layout irrespective of your operating system, reading device, or the software being used.

Anyone who has worked with plain text files in Python before might think that working with PDF files is also going to be easy, but it's a bit different here. PDF documents are binary files and are more complex than plain text files, especially since they contain elements like different font types, colors, and images.

However, that doesn't mean that it is hard to work with PDF documents using Python—it is rather simple, and using an external module solves the issue. 

In this post, I'll show you how to open a PDF file, print pages, and extract text from it with the PyPDF2 module. If you want to learn how to write PDF files, check out my tutorial on creating and editing PDF documents.

Initial Setup

As I mentioned above, using an external module is the key. The module we will be using in this tutorial is PyPDF2. As it is an external module, the first step we have to take is to install it. For that, we will be using pip, which is (based on Wikipedia):

A package management system used to install and manage software packages written in Python. Many packages can be found in the Python Package Index (PyPI).

You can follow the steps mentioned in the official guide for installing pip. There is a good chance that pip was installed automatically if you downloaded Python from python.org.

PyPDF2 can now be installed by typing the following command inside your terminal:

1
pip install PyPDF2

Great! You now have PyPDF2 installed, and you're ready to start playing with PDF documents.

PyPDF2 Basics

Before we dig deeper, I would like to give you a brief overview of the PyPDF2 module. This is a completely free and open-source library that can do a lot of things with PDF documents. You can use the library not only for reading from a PDF file but also for writing, splitting, and merging.

A lot of things have changed in the library from its older versions. For this tutorial, I am going to use version 2.11.1 of the library.

The PyPDF2 library doesn't require any dependencies for its regular features. However, you will need some dependencies to work with cryptography and images in PDF files. Automatic installation of all dependencies is possible with the command:

1
pip install PyPDF2[full]

However, if you know that you will need to encrypt and decrypt PDF documents with AES or Advanced Encryption System, you will need to install some cryptography-related dependencies:

1
pip install PyPDF2[crypto]

I should also point out that RC4 encryption is supported with the standalone installation of PyPDF2 without any dependencies.

Reading a PDF Document

The sample file we will be working with in this tutorial is a PDF version of Beauty and the Beast hosted on Project Gutenberg. Go ahead and download the file to follow the tutorial, or you can simply use any PDF file you like.

The following code will get you set up for extracting additional information from the file:

1
import PyPDF2
2
3
with open('beauty-and-the-beast.pdf', 'rb') as book:
4
    book_reader = PyPDF2.PdfReader(book)

The first line imports the PyPDF2 module for us to use in our program. We then use the built-in open() function to open our PDF file in binary mode.

Once the file is open, we use the PdfReader base class from the module to initialize our PdfReader object by passing it our book as the parameter. We are now ready to handle a variety of reading operations on our book.

More Operations on PDF Documents

After reading the PDF document, we can now carry out different operations on the document, as we will see in this section.

Number of Pages

The number of pages in a PDF document is accessible with a read-only property of the PdfReader class called pages. This property basically gives us a list of Page objects. Those page objects represent the individual pages of the PDF file.

You can easily get the number of pages by using the built-in len() function and passing the list of Page objects as a parameter.

1
import PyPDF2
2
3
with open('beauty-and-the-beast.pdf', 'rb') as book:
4
    book_reader = PyPDF2.PdfReader(book)
5
    number_of_pages = len(book_reader.pages)
6
    
7
    # Outputs: 48

8
    print(number_of_pages)

In this case, the returned value was 48, which is equal to the number of pages in our document.

Directly Accessing a Page Number

We have seen in the previous section that the pages property of the PdfReader class returns a list of Page objects. You can directly access any page from the list by specifying its index. Consider the following example, in which I will retrieve the second item from a list of languages.

1
languages = ["French", "English", "Hindi"]
2
3
# Outputs: English

4
print(languages[1])

Directly accessing a page from the PDF document will work similarly. Here is an example:

1
import PyPDF2
2
3
with open('beauty-and-the-beast.pdf', 'rb') as book:
4
    book_reader = PyPDF2.PdfReader(book)
5
    page_list = book_reader.pages
6
    
7
    first_page = page_list[0]
8
    last_page = page_list[-1]

Now that we have learned how to access a Page object based on the page number, let's see how to do the reverse and get the page number from a page object. The PyPDF2 library has a very handy function called get_page_number() that you can use to get the page number of the current page. All you need to do is pass the Page object as a parameter to the get_page_number() function.

1
import random
2
from PyPDF2 import PdfReader
3
4
with open('beauty-and-the-beast.pdf', 'rb') as book:
5
    book_reader = PdfReader(book)
6
    page_list = book_reader.pages
7
    
8
    last_page = page_list[-1]
9
    # Outputs: 47

10
    print(book_reader.get_page_number(last_page))
11
12
    some_page = page_list[random.randint(15, 35)]
13
    # Outputs: 19

14
    print(book_reader.get_page_number(some_page))

In the above example, we first try to get the page number for the last page in our PDF document, and it comes out to 47 since the indexing starts at 0. A value of 47 actually means page 48.

We also try the same function with a page between 15 and 35 selected at random. The output is 19 in this particular instance, but it will vary with each execution.

Page Mode and Page Layout

The library also allows you to easily access the page mode and page layout information for your PDF document. You simply need to use the properties called page_mode and page_layout to do so.

All the valid page mode values are shown in the table below:

/UseNone Do not show outlines or thumbnail panels
/UseOutlines Show outlines (aka bookmarks) panel
/UseThumbs Show page thumbnails panel
/FullScreen Fullscreen view
/UseOC Show Optional Content Group (OCG) panel
/UseAttachments Show attachments panel

The table below shows all the valid page layout values:

/NoLayout Layout explicitly not specified
/SinglePage Show one page at a time
/OneColumn Show one column at a time
/TwoColumnLeft Show pages in two columns, odd-numbered pages on the left
/TwoColumnRight Show pages in two columns, odd-numbered pages on the right
/TwoPageLeft Show two pages at a time, odd-numbered pages on the left
/TwoPageRight Show two pages at a time, odd-numbered pages on the right

In order to check our page mode, we can use the following script:

1
from PyPDF2 import PdfReader
2
3
with open('beauty-and-the-beast.pdf', 'rb') as book:
4
    book_reader = PdfReader(book)
5
6
    # Outputs: None

7
    print(book_reader.page_mode)
8
9
    # Outputs: None

10
    print(book_reader.page_layout)

In the case of our PDF document, the returned value is None, which means that the page mode and the page layout are not specified.

Extract Metadata

The PdfReader class also has a property called metadata that returns the document information dictionary for the PDF file that you are reading. This metadata can contain information such as the author name, title of the document, creation date, and producer. The following example tries to extract all of this information from our PDF document.

1
from PyPDF2 import PdfReader
2
3
with open('beauty-and-the-beast.pdf', 'rb') as book:
4
    book_reader = PdfReader(book)
5
    book_metadata = book_reader.metadata
6
7
    # Beauty and the Beast

8
    print(book_metadata.title)
9
10
    # Anonymous

11
    print(book_metadata.author)
12
13
    # 2006-11-30 01:13:00-08:00

14
    print(book_metadata.creation_date)
15
16
    # pdfeTeX-1.21a

17
    print(book_metadata.producer)

Please keep in mind that some PDF files could have all of these values set to None.

Extract Text

We have been wandering around the file so far, so let's see what's inside. The extract_text() method will be our friend in this task. The script to extract text from the PDF document is as follows:

1
from PyPDF2 import PdfReader
2
3
with open('beauty-and-the-beast.pdf', 'rb') as book:
4
    book_reader = PdfReader(book)
5
    page_list = book_reader.pages
6
    
7
    story_page = page_list[6]
8
    page_text = story_page.extract_text()
9
10
    print(page_text)

The output that I got after executing the above script is shown below:

1
[002]
2
BEAUTY AND THE BEAST.
3
Once upon a time, in a very far-off country, there lived a mer-
4
chant who had been so fortunate in all his undertakings that he
5
was enormously rich. As he had, however, six sons and six
6
daughters,hefoundthathismoneywasnottoomuchtoletthem
7
allhaveeverythingtheyfancied,astheywereaccustomedtodo.
8
But one day a most unexpected misfortune befell them. Their
9
house caught fire and was speedily burnt to the ground, with
10
all the splendid furniture, the books, pictures, gold, silver, and
11
precious goods it contained; and this was only the beginning of

I was able to extract all the text on the page. However, as you can see, the extract_text() function doesn't get the spacing between the words right in some places. The final result depends on a variety of factors, one of them being the generator used to create the PDF file. This basically means that you won't face this issue in all PDF files, but some of them are bound to have messed-up spacing upon text extraction.

Conclusion

As we can see, Python makes it simple to work with PDF documents. This tutorial just scratched the surface on this topic, and you can find more details of different operations you can perform on PDF documents on the PyPDF2 documentation page.

Did you find this post useful?
Want a weekly email summary?
Subscribe below and we’ll send you a weekly email summary of all new Code tutorials. Never miss out on learning about the next big thing.
Looking for something to help kick start your next project?
Envato Market has a range of items for sale to help get you started.