How to Extract Text From PDF with Python 3

The correct way to extract Text From Pdf in Python 3 for MacOS Windows Linux Updated 2020

In this tutorial, we are going to examine the most popular libraries for extracting data from PDF with Python. PDF is great for reading but we may need to extract some details for further processing.

I tested numerous packages, each with its own strengths and weakness. There are good packages for PDF processing and extracting text from PDF which most of people are using: Textract, Apache Tika, pdfPlumber, pdfmupdf, PyPDF2

Note: PyPDF2 is not maintained, so I ignore it.

Let all these libraries anyway

pdfplumber

Plumb a PDF for detailed information about each text character, rectangle, and line. Plus: Table extraction and visual debugging.

Works best on machine-generated, rather than scanned, PDFs. Built on pdfminer.six.

Currently tested on Python 3.6, 3.7, and 3.8 and work on MacOS, Windows, Linux

pip install pdfminer.six

Install pdfplumber

pip install pdfplumber

Basic usage

import pdfplumber
with pdfplumber.open("pdffile.pdf") as pdf:
    page  = pdf.pages[0]
    text = page.chars[0]
    print(text)

To start working with a PDF, call pdfplumber.open(x), where x can be a:

path to your PDF file
file object, loaded as bytes
file-like object, loaded as bytes The open method returns an instance of the pdfplumber.PDF class.

Tika

Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.

Install tika

Installing the Python library is simple enough, but it will not work unless you have JAVA installed. So make sure you have Java installed.

pip install tika

tika basic usage

import tika
tika.initVM()
from tika import parser
parsed = parser.from_file('sample.pdf')
print(parsed["metadata"])
print(parsed["content"])

pdftotext

This is very simple and easy to use PDF text extraction library. However, because it depends on poppler so the installation depends on the OS

OS Dependencies

Debian, Ubuntu, and friends

sudo apt install build-essential libpoppler-cpp-dev pkg-config python3-dev

Fedora, Red Hat, and friends

sudo yum install gcc-c++ pkgconfig poppler-cpp-devel python3-devel

macOS

brew install pkg-config poppler python

Windows

Currently tested only when using conda:

Install the Microsoft Visual C++ Build Tools
Install poppler through conda:
```
conda install -c conda-forge poppler
```

Install pdftotext

pip install pdftotext

pdftotext basic usage

import pdftotext

# Load PDF file
with open("pdffile.pdf", "rb") as f:
    pdf = pdftotext.PDF(f)

# If it's password-protected
with open("secure_pdffile.pdf", "rb") as f:
    pdf = pdftotext.PDF(f, "secret")

# Iterate over all the pages
for page in pdf:
    # text content in pdf page
    print(page)

# Read all the text into one string
print("\n\n".join(pdf))

PyMuPDF

With PyMuPDF you can access not only PDF but also files with extensions like “.xps”, “.oxps”, “.cbz”, “.fb2” or “.epub”. In addition, about 10 popular image formats can also be opened and handled like documents.

Install PyMuPDF

For Windows, Linux and Mac OSX platforms, there are wheels in the download section of PyPI. This includes Python 64bit versions 3.6 through 3.9. For Windows only, 32bit versions are available too.

PyMuPDF basic usage

import fitz  # this is pymupdf

with fitz.open("my.pdf") as doc:
    text = ""
    for page in doc:
        text += page.getText()

print(text)

Conclusion

The textract library was not considered for using the same algorithm as pdftotext. (textract is a wrapper for Poppler: pdftotext) | https://pypi.org/project/textract/ The observations about the extraction of the algorithm are dependent on the PDF file, its encoding process and the diversity of non-textual elements present, such as Images and Tables.

Main features found:
PyMuPDF | Good conversion even considering the tables. The algorithm does not consider blank line spaces, which helps in the treatment. It has a very fast conversion time.
pdftotext | Great conversion, but it extracts the text in two columns, as in the original layout, a characteristic that will result in an error due to the combination of different phrases. It has excellent extraction quality, but for my purpose (information retrieval) it won’t do.
Tika-Python | Good conversion with URL recognition and full extraction. But the algorithm considers blank line spaces, another necessity in the treatment. Its processing time is longer than PyMuPDF, but nothing that prevents its use. It also has the disadvantage of not being native: The .jar file is downloaded in the first call of the library, a Java server is executed to serve the requests.
PyPDF2 | Many line breaks that have not occurred in other converters. And in 3 files of the test, the extraction was unacceptable due to the total absence of spaces between words.
Abstract:
In this experiment, the choice should fall on the PyMuPDF or Tika-Python libraries. pdftotext is a great library, but preserves the same layout as the original text, which in certain situations is inappropriate.

Last modified November 4, 2020