How to Create and Edit PDF Documents in Python

In our previous tutorial, we learned how to read PDF documents in Python and discussed the basics of the PyPDF2 library. While some projects will require you to extract data from PDF documents, you'll often need to create a PDF of your own for things like automatic invoice generation or reservation confirmation.

How to Work With PDF Documents Using Python

Abder-Rahman Ali

28 Nov 2022

One amazing library that you can use to create and edit documents in Python is the PyPDF2 library. The library has a huge feature set that lets you do all kinds of things such as extracting information like text, images, and metadata from the PDF document, which we covered in the previous tutorial. You can also create and edit a PDF document, perform encryption and decryption, add or remove annotations, and more.

In this tutorial, our focus will be on creating and editing PDF documents. Let's get started.

Creating PDF Documents

We use the PdfReader class to read and extract content from a PDF document, and we use the PdfWriter class to create new PDF files. One limitation of PyPDF2 is that you can only use the library to create new PDF files from existing PDF files.

We will begin by creating a blank page for our PDF file, and that requires us to instantiate an object using the PdfWriter() class. This class has a method called add_blank_page(), which will create a blank page with the specified dimensions and append it to the existing object.

The dimensions of the page are specified in default user space units, where 72 units are equivalent to 1 inch. Keeping that in mind, we can create an A4 page by multiplying 8.27 by 72 to get the page width and 11.69 by 72 to get the page height.

I used the following code to create a blank PDF document using PyPDF2:

import math
from PyPDF2 import PdfWriter

my_pdf_pages = PdfWriter()

page_width = math.floor(8.27*72)
page_height = math.floor(11.69*72)

my_pdf_pages.add_blank_page(page_width, page_height)

with open('doc.pdf', 'wb+') as file:
    my_pdf_pages.write(file)

It is important to use integer values for the width and height of the page. Otherwise, you end up with a PDF document with incorrect dimensions. I have used the open() function in Python and specified a file name along with the opening mode. The value wb+ means that I will be opening the binary file for writing and updating.

After that, I use the write() method to write the contents of the my_pdf_pages object to the doc.pdf file. Granted, you will only see a blank page if you open up the file now, but we were able to create it using the library.

Remember how we read different pages from a PDF document in the previous tutorial using the pages property? The pages property stored all the pages of the document as a list of Page objects. We can extract a specific set of pages and then embed them into our newly created PDF using the add_page() method.

Here is an example in which I read the content of two different PDF books and write some of their pages to a new file sequentially:

import math
from PyPDF2 import PdfReader, PdfWriter

my_pdf_pages = PdfWriter()

with open('secret-doctrine-01.pdf', 'rb') as book_a:
    with open('secret-doctrine-02.pdf', 'rb') as book_b:
        with open('excerpts.pdf', 'wb+') as file:
            book_a_pages = PdfReader(book_a).pages
            book_b_pages = PdfReader(book_b).pages
            for i in range(1, 10):
                book_a_page = book_a_pages[i]
                my_pdf_pages.add_page(book_a_page)
                book_b_page = book_b_pages[i]
                my_pdf_pages.add_page(book_b_page)
            my_pdf_pages.write(file)

A lot of the code here is similar to the previous example. The only difference is that instead of the add_blank_page() method, we are using the add_page() method to add a Page object to our document. We iterate over pages with index 1 to 9 and then add them to our PdfWriter object called my_pdf_pages one at a time. Once all the pages have been added, we write them to our file called excerpts.pdf.

A few months back, I downloaded a book that I wanted to read. However, it could only be downloaded one chapter at a time, and I wanted to merge them all into a single document. I did it with a third-party service back then, but we can do it just as easily using a few lines of code.

Instead of reading a file one page at a time and then appending that page to our document, we can append the whole file at once using the append_pages_from_reader() function. This function also accepts a second parameter, which is the name of the callback function that you want to call with each page append.

from PyPDF2 import PdfReader, PdfWriter

my_pdf_doc = PdfWriter()

for i in range(101, 107):
    chapter_name = 'lemh' + str(i) + '.pdf'

    with open(chapter_name, 'rb') as chapter:
        chapter_reader = PdfReader(chapter)
        my_pdf_doc.append_pages_from_reader(chapter_reader)

        with open('book.pdf', 'wb+') as file:
            my_pdf_doc.write(file)

Slicing, Insertion, and Concatenation of PDF Documents

There is another class called PdfMerger in the PyPDF2 library that you can use to create a PDF document in Python. This class offers more advanced functionality compared to the PdfWriter class. There are two important functions that we will cover here: append() and merge().

Let's begin with append(). In the previous section, we used the append_pages_from_reader() function from the PdfWriter class to append the chapters in our book one after the other. The advantage of using append() is that it offers you more options and flexibility.

from PyPDF2 import PdfMerger

my_pdf_doc = PdfMerger()

with open('book.pdf', 'wb+') as file:
    for i in range(101, 107):
        chapter_name = 'lemh' + str(i) + '.pdf'
        my_pdf_doc.append(chapter_name)
    my_pdf_doc.write(file)

As you can see, this code is much shorter than what I wrote above to accomplish the same task. The important difference is that we did not have to instantiate a PdfReader object in order to append the chapters. The append() method from the PdfMerger class just needs a file name or a file object.

The append() method accepts four different parameters. The first one is the file name as we saw above.

The second parameter is a string that identifies a bookmark to be applied at the beginning of the included file. We could use it to add the chapter count as a bookmark in our generated document.

The third parameter allows you to add a specific set of pages to the book instead of the whole chapter. It can be a (start, stop[, step]) tuple to signify the start index, the stop index, and the number of pages to skip.

from PyPDF2 import PdfMerger

my_pdf_doc = PdfMerger()

with open('bookmarked.pdf', 'wb+') as file:
    for i in range(101, 107):
        chapter_name = 'lemh' + str(i) + '.pdf'
        outline_name = 'Chapter ' + str(i - 100)
        my_pdf_doc.append(chapter_name, outline_name, (0, 10))
    my_pdf_doc.write(file)

When I executed the above code, it created a PDF document that had bookmarks for each chapter. It also had only the first ten pages from each chapter.

Let's say you have a bunch of books, but they don't have an index or preface at the beginning. The author gives you the index as a separate PDF document. How do you prepend it to the beginning of the books? The append() method won't be of much help here, especially if you also want to add some content somewhere in the middle of the book. Luckily, a similar method called merge() would be handy here.

1	my_pdf_doc.merge(0, 'lemh1ps.pdf')
2	my_pdf_doc.write(file)

The first line above adds the index document at the beginning of our PdfMerger object, while the second line writes all the merged data back to our PDF file.

Adding Bookmarks to a PDF Document

You might be required to add bookmarks for some specific pages to a PDF document for easy access. One handy method that you can use to add bookmarks is add_outline_item(). This method is available in both the PdfWriter class and the PdfMerger class. Two required parameters for this method specify the title and the page number for the bookmark. The title has to be a string, and the page number has to be an integer.

You can also specify a parent outline item as the third parameter in order to create nested bookmark items. The next three parameters determine the font color, weight, and style of the bookmark. Here is an example that uses the first two parameters to create a bookmark for the summary of Chapter 1.

1	my_pdf_doc.add_outline_item("Chapter 1 (Summary)", 52)

Final Thoughts

In this tutorial, we learned how to create a PDF document in Python and how to add content to the document by appending individual pages or a group of pages. We also learned how to add content at particular locations in our PDF document using the PdfMerger class from the PyPDF2 library.