Pdf2image Python Example

0 views

Skip to first unread message

Chara Dagres

unread,

Aug 5, 2024, 9:27:40 AM8/5/24

to actahovil

Manytools are available on the internet for converting a PDF to an image. In this article, we are going to write code for converting pdf to image and make a handy application in python. Before writing the code we need to install the required module pdf2image and poppler.

I was working on a Python project recently (for a DEV hackathon, of course) and I needed a library to extract the pages from a huge PDF document as images. The primary use case was to extract the content from the PDF document and doing this in a blocking way was expensive. So I was looking for options to split the document's pages into multiple images which would let me concurrently do the content extraction.

Even though the solution is pretty simple, you need to set up a utility called as poppler before you start writing your Python code. Poppler is the underlying C++ based PDF rendering library used by pdf2image for rendering the PDF document behind the scenes.

If the PDF file you want to convert is nicely located in a directory, then the job is much easier. You can directly import the convert_from_path function from the pdf2image package to extract the pages as images

The above code will pick the PDF document named top_secret_document.pdf from the current path and extracts every single page in the document as a JPEG image and stores it in the output folder named extracted_images.

The function expects the output folder to be already available and if it is not available, then an exception will be raised. To avoid this, you can conditionally check for the existence of the folder and create it, if not present

The good thing about this function is that it assigns a UUID to the output images and appends a page counter to the file name to make the file names unique. If you want to add a common prefix instead of going with a UUID, then you can make use of the output_file parameter accepted by the function

The users are asked to upload the document and the uploaded file is submitted to a Golang based microservice. This service does some validations and other stuff behind the scenes. Once all the processing is done, it stores the document in a minio bucket and publishes an event to a Kafka topic

The python service acts as a consumer and once the event is received, it fetches the PDF document from the bucket and starts with the page extraction process. This means that the document will not be readily available in the path of the file system and it will be available as a byte stream.

Provided we already have the convert_from_path function to abstract all this boilerplate, I highly doubt if someone is ever gonna use the convert_from_bytes function for working with a local PDF file, still I just wanted to show that it's possible

That's all to it. The pdf2image library is not limited to light PDF documents and it can work with bulk documents too. Both the convert_from_path and convert_from_bytes variety of arguments that can be used to control the process of rendering the PDF document and extracting the images in different ways

One of the most typical Machine Learning tasks is reading structured text from images with OCR. We experimented with 5 sample invoices, trying to read the data using a few Python libraries. Let's see how they work.

It is as simple as that. We will take any invoices from the invoices folder and save them as images in the invoices_images folder. This used glob to get all PDF files and then pdf2image to convert them.

We have finally reached the OCR part. For this demo, we will use the Pytesseract library, a wrapper for Google's Tesseract-OCR Engine. It's a popular library, and it's easy to use, so let's get started with the installation:

There are some things out of place. For example, as mentioned in the image - our address is not in the order we expected. It is mixing our From and To address information, which... Causes some issues. We must remember this when working with OCR - it sometimes works in different ways than we read it. Here are a few reasons for it:

As you can see, it is very similar to the PDFToText example, but the difference is that we read this from an image, and we do have a few mistakes. These mistakes should be solved by Regex adjustments, but we will leave that out here.

And while this is insanely slower than PDFToText - it is expected. OCR is much more complex, and it must read the image entirely. For example, our PDF-to-text never found this, but OCR managed to read it:

This can be more valuable than speed, depending on your use case. While we parsed invoices, you can use OCR for something else. You might even want to parse semi-obstructed photos, where OCR shines. So, it all depends on your use case.

This, of course, can be our fault because we didn't fine-tune the parameters precisely. But that process was a bit complex, and even with a few hours - we could not get a better result than defaults and the box merging algorithm from StackOverflow.

We suspect this is because it is the most powerful in seeing hard-to-read text. For example, this is one of the libraries that can deal with captchas, which is challenging. So, if you need to read hard-to-read text - this might be your library.

Also, the majority of the work on reading the text from images is not about using the OCR library, but rather about writing the correct regular expressions or parsing the text with other algorithms, to get the structured data as a result. We will cover various techniques for that, in the future lessons.

Using OCR to detect and localize text is simple in MATLAB. However, it is only workable if your input is image format (jpg,png) but not pdf. Hence, we are going to convert the pdf to image. However, up to MATLAB version R2019a, It don't have any built-in function to convert pdf to image. For this example, i am going to use a python package pdf2image help us to convert pdf to image. There are no conflicts using MATLAB or Python. If there is something working better in Python, we can collaborate both platform (MATLAB and Python) through MATLAB Api to complete our objective.

Foxit PDF SDK provides high-performance libraries to help any software developer add robust PDF functionality to their enterprise, mobile and cloud applications across all platforms (includes Windows, Mac, Linux, Web, Android, iOS, and UWP), using the most popular development languages and environments.

Foxit PDF SDK for Python API ships with simple-to-use APIs that can help Python developers seamlessly integrate powerful PDF technology into their own projects on Windows, Linux and Mac platforms. It provides rich features on PDF documents, such as PDF viewing, bookmark navigating, text selecting/copying/searching, PDF signatures, PDF forms, rights management, PDF annotations, and full text search.

Foxit PDF SDK allows users to download a trial version to evaluate the SDK. The trial version has no difference from a standard version except for the 10-day limitation trial period and the trail watermarks that will be generated on the PDF pages. After the evaluation period expires, customers should contact Foxit sales team and purchase licenses to continue using Foxit PDF SDK.

Developers should purchase licenses to use Foxit PDF SDK in their solutions. Licenses grant users permissions to release their applications based on PDF SDK libraries. However, users are prohibited to distribute any documents, sample codes, or source codes in the SDK released package to any third party without the permission from Foxit Software Incorporated.

For Python2.7, if _fsdk.pyd in the corresponding directory matches the current system python version, you can use python to run the examples directly. For the correspondence between _fsdk.pyd and python version, please see About package directory structure. For Python3, if you have installed FoxitPDFSDKPython3 module, you can use python to run the examples directly.

In this section, we will show you how to use Foxit PDF SDK for Windows Python to create a simple project that renders the first page of a PDF to a bitmap and saves it as a JPG image. Please follow the steps below:

For Python2.7, if _fsdk.so in the corresponding directory matches the current system python version, you can use python to run the examples directly. For the correspondence between _fsdk.so and python version, please see About package directory structure. For Python3, if you have installed FoxitPDFSDKPython3 module, you can use python to run the examples directly.

In this section, we will show you how to use Foxit PDF SDK for Linux Python to create a simple project that renders the first page of a PDF to a bitmap and saves it as a JPG image. Please follow the steps below:

In this section, we will show you how to use Foxit PDF SDK for Mac x64 (Python) to create a simple project that renders the first page of a PDF to a bitmap and saves it as a JPG image. Please follow the steps below:

In this section, we will show you how to use Foxit PDF SDK for Mac arm64 (Python) to create a simple project that renders the first page of a PDF to a bitmap and saves it as a JPG image. Please follow the steps below:

In this section, we will introduce a set of major features and list some examples for each feature to show you how to integrate powerful PDF capabilities with your applications using Foxit PDF SDK Python API.

It is necessary for applications to initialize Foxit PDF SDK before calling any APIs. The function Library.Initialize is provided to initialize Foxit PDF SDK. A license should be purchased for the application and pass unlock key and code to get proper supports. When there is no need to use Foxit PDF SDK any more, please call function Library.Release to release it.

A PDF document object can be constructed with an existing PDF file from file path, memory buffer, a custom implemented ReaderCallback object and an input file stream. Then call function PDFDoc.Load or PDFDoc.StartLoad to load document content. A PDF document object is used for document level operation, such as opening and closing files, getting page, metadata and etc.