Skip to main content
extract-text-from-pdf-python

Extract Text From PDF File Using Python

This python tutorial help to extract data from pdf file using python. We’ll use the PyPDF2 module that is widely used to access & manipulate PDF files in Python. We’ll use PdfFileReader class to extract information from pdf files.

You can also checkout other python file tutorials:

Extract Data from PDF FIle

Let’s install and extract data from a pdf file using python3.

Install pypdf2 in python

To use the PyPDF2 library in Python, we need to first install PyPDF2. Run the below code to install the PyPDF2 module in the system.

pip install PyPDF2

Let’s Read and extract text from the PDF file

import PyPDF2

pdfFileObj = open('test.pdf', 'rb')

# creating a pdf reader object
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

# creating a page object
pageObj = pdfReader.getPage(0)

print(pageObj.extractText())

# close the pdf file object
pdfFileObj.close()

Output:

A Simple PDF File This is a small demonstration .pdf file....

in the above code, we have done the following things one by one line:

Step 1: At the top of the, we have imported the PyPDF2 module.

Step 2: Open the PDF file using open() method. This ll create an object that holds the path of the pdf file. We have provided one more argument i.e rb which means read binary. I am assuming test.pdf file is stored in the same directory where the main program is.

Step 3: PdfFileReader function is used to read the data from the object that holds the path of a pdf file. It also offers few more arguments that can be passed.

we have read the pdf file and now access some properties to get data:

Step 4: The getPage() method is used to get returns the page object. It takes page number (starting from index 0) as an argument.

Step 5: The extractText() method is used to extract text from the page object.

Step 6: We have closed the pdf file object.

Conclusions:

We have installed the PyPDF2 module and use PdfFileReader class to read a pdf files. We have opened the file and passed rb mode to read pdf file. Also, we have use some properties to extract data from the pdf file.

Leave a Reply

Your email address will not be published. Required fields are marked *