How to Read PDF files in Python?

python programs

PDF is one of the widely used file formats for sharing data digitally. So reading a pdf file using python language would be more interesting.

Python being a high-level language is capable of doing almost everything to automate a task. Likewise reading the “txt” file in python is easy as python has inbuilt library methods to do so. But the problem is that the inbuilt function doesn’t support pdf file formats.

But don’t worry there are several 3rd party python libraries to work with pdf files:

  • PyPDF2
  • PDFMiner
  • Tabula-py
  • Slate

In this article, we will use PyPDF2.

Before proceeding please note that “PyPDF2” cannot extract images, charts, tables, or other media from PDF documents. There are other 3rd party libraries to read media from PDFs like ‘Tabula-py’ for tables, ‘pyTesseract’ for extracting images, and so on but here our main focus is reading text in the form of string.

So let’s start.

Reading PDF File using PyPDF2

Start by installing PyPDF2 from the command line using the following command.

pip install PyPDF2

The module name is case sensitive so make sure only the letter ‘y’ is in the lower case and everything rest in the upper case. Once it has been successfully done writing import PyPDF2 will not throw an error.

Now let’s try reading the file “Btech_job.pdf” having the following 2 pages.

pdf file 1
pdf file 2

Download this PDF here

# import the module
import PyPDF2

# create an object 
file = open('example.pdf', 'rb')

# create a pdf reader object
pdfReader = PyPDF2.PdfFileReader(file)

# get the number of pages in pdf file
pages = pdfReader.numPages

for i in range(pages):
    
    #extract the page
    page = pdfReader.getPage(i)
    
    #print the page text content along with page number
    print("Page no:",i)
    print(page.extractText())

Output:

Python pdf read output

Explanation:

First, we start by importing the PyPDF2 module. After this, we open the “Btech_job.pdf” in ‘read binary’ (rb) mode and store its reference in file.

Now, we create a PdfFileReader object using PyPDF2.PdfFileReader(file) expression and store this into pdfReader.

The pdfReader object has an attribute named numPages that stores the count of the number of pages in the PDF document.

In our example, the document has a total of 2 pages.

The getPage(i) method of pdfReader object returns the page at index ‘i’, so use it and fetch the page object into variable page.

Now, we extract the text content of the page using the extractText() method and output it.

If we need to process different pages of the given PDF, we have to repeat the above process with their corresponding page index.

Note: Numbering of pdf pages (i.e index) start from 0. So page 1 is at index 0, page 2 at index 1, and so on.

I hope from now onwards you can easily read a pdf file in python. If you have any doubts or suggestions then comment below.

Leave a Reply

Your email address will not be published. Required fields are marked *