PDF is one of the widely used file formats for sharing data digitally. So reading a pdf file using python language would be more interesting.

Python being a high-level language is capable of doing almost everything to automate a task. Likewise reading the “txt” file in python is easy as python has inbuilt library methods to do so. But the problem is that the inbuilt function doesn’t support pdf file formats.

But don’t worry there are several 3rd party python libraries to work with pdf files:

  • PyPDF2
  • PDFMiner
  • Tabula-py
  • Slate

In this article, we will use PyPDF2.

Before proceeding please note that “PyPDF2” cannot extract images, charts, tables or other media from PDF documents. There are other 3rd party libraries to read media from PDF like ‘Tabula-py’ for tables, ‘pyTesseract’ for extracting images and so on but here our main focus is reading text in the form of string.

So let’s start.

Reading PDF File using PyPDF2

Start by installing PyPDF2 from the command line using the following command.

The module name is case sensitive so make sure only letter ‘y’ is in the lower case and everything rest in the upper case. Once it has been successfully done writing import PyPDF2 will not throw error.

Now let’s try reading the file “Btech_job.pdf” having the following 2 pages.

pdf file 1

pdf file 2

Download this PDF here

output

Python pdf read output

Explanation

First, import the PyPDF2 module. Then open “Btech_job.pdf” in read binary (rb) mode and store it in file. Now get a PdfFileReader object by calling PyPDF2.PdfFileReader(file) (pass file). Store this object into pdfReader.

pdfReader has attribute named numPages which stores the total number of pages in the PDF document. In our example the document has a total of 2 pages.

getPage(i) method of pdfReader returns a page at index ‘i’, so use it and store the single page object into variable page. Now extract text content of the page using extractText() and output it. Repeat the steps for different pages having diffrent index.

Note: Numbering of pdf pages (i.e index) start from 0. So page 1 means page at index 0. page 2 at index 1 and so on.

I hope from now onwards you can easily read a pdf file in python. If you have any doubts or suggestion then comment below.

Leave a Reply

Close Menu