"Tesseract --psm Impact on Scanned Books with Two Facing Pages"

113 views
Skip to first unread message

Mahmoud Mohamed

unread,
Feb 20, 2025, 11:38:34 PM2/20/25
to tesseract-ocr

Understanding Tesseract OCR and --psm: Why Removing It Can Improve Accuracy for Scanned Books

Introduction

Tesseract OCR is a powerful tool for extracting text from images, but selecting the right parameters is critical for accuracy. One commonly misunderstood parameter is --psm (Page Segmentation Mode). In this guide, we'll discuss a real-world issue where --psm caused incorrect text extraction for scanned books and how removing it led to better results.

The Problem: Mixed Text from Two Facing Pages

Many scanned books and documents contain two facing pages in a single image. When processed with --psm, Tesseract sometimes misinterprets the structure, causing the extracted text to be jumbled. Instead of reading one page at a time, Tesseract would mix text from both pages, extracting:

Line 1 from right page + Line 1 from left page Line 2 from right page + Line 2 from left page ...

This happens because --psm forces Tesseract to assume a specific layout, which can conflict with the actual structure of the scanned document.

The Solution: Removing --psm

By removing --psm, Tesseract processed the right page first in order, then moved to the left page. This resulted in a natural reading order and a significantly better OCR result:

Line 1 from right page Line 2 from right page ... (Line 1 from left page follows after the right page is complete)

This confirms that in some cases, manually setting --psm can do more harm than good.

When to Avoid --psm
  • When processing scanned books or documents with two facing pages.
  • When text is misaligned or mixed in the OCR output.
  • When dealing with complex layouts where Tesseract's automatic handling works better.
When to Use --psm

There are cases where --psm is still useful, such as:

  • Single-column printed text (--psm 6)
  • Sparse text (--psm 11)
  • Images containing only a single word (--psm 8)
Recommended OCR Settings

For scanned books or multi-column text, a safer approach is:

pytesseract.image_to_string(image, config='--oem 1 -c preserve_interword_spaces=1')

This avoids forcing a layout assumption while keeping Tesseract optimized for text extraction. Users can specify the language(s) as needed (e.g., -l eng, -l ara, or -l ara+eng).

Conclusion

This discovery highlights why experimentation is key when working with OCR. If your text output appears mixed or out of order, try removing --psm and letting Tesseract handle the layout automatically. Hopefully, this guide helps others facing similar issues!

Have you encountered other OCR challenges? Share your experience in the comments or discussion forums!

Tom Morris

unread,
Mar 17, 2025, 5:31:37 PM3/17/25
to tesseract-ocr
On Thursday, February 20, 2025 at 11:38:34 PM UTC-5 dooha...@gmail.com wrote:

Understanding Tesseract OCR and --psm: Why Removing It Can Improve Accuracy for Scanned Books

The recommendation to understand the operation of various page segment modes and to experiment with alternatives are good ones, but "removing" the --psm switch is equivalent to using --psm 3, which is the default. 

You can get an overview of all the different page segmentation modes by using --help-psm:

$ tesseract --help-psm
Page segmentation modes:
  0    Orientation and script detection (OSD) only.
  1    Automatic page segmentation with OSD.
  2    Automatic page segmentation, but no OSD, or OCR. (not implemented)
  3    Fully automatic page segmentation, but no OSD. (Default)
  4    Assume a single column of text of variable sizes.
  5    Assume a single uniform block of vertically aligned text.
  6    Assume a single uniform block of text.
  7    Treat the image as a single text line.
  8    Treat the image as a single word.
  9    Treat the image as a single word in a circle.
 10    Treat the image as a single character.
 11    Sparse text. Find as much text as possible in no particular order.
 12    Sparse text with OSD.
 13    Raw line. Treat the image as a single text line,
       bypassing hacks that are Tesseract-specific.




Reply all
Reply to author
Forward
0 new messages