Tesseract recognition issues.

90 views

Skip to first unread message

Ulan Bator

unread,

May 14, 2024, 12:52:15 AMMay 14

to tesseract-ocr

I have some problems with trailing dots in table of contents.

The attached picture is recognized but all the dots are interpreted as random caraters:

example:

where I have this text in the picture

Cochemiea (Teil 1) .......................................... 3

I get the following text after OCR

Cochemiea (Teil 1) .....::: 2222 see essen eennseenneeneeener nen

As one can see statring from dots the rest ol the line is wrong evn the last number 3 is missing

Anyone have an idea about how to fix this?.

I use following command

tesseract --dpi 300 -l deu --oem 1 Kakt_Sukk-1986-1_02.jpg Kakt_Sukk-1986-1_02 txt

on Linux Debian 12

Kakt_Sukk-1986-1_02_ocr.pdf

Kakt_Sukk-1986-1_02.jpg

Yaofu Zhou

unread,

May 21, 2024, 1:10:41 AMMay 21

to tesseract-ocr

It is going to be a project for you but one way to achieve your goal is to fine-tune the model using a custom training set -
1. You would procedurally generate a set (a few thousand would be a good start) of images of similar content with various amounts of dots, as well as the corresponding text files that label the ground truth for the images.
2. You would fine-tune the specific Tesseract OCR model you are using (deu in your case) with the training set generated. Tesseract's GitHub has a tool "Tesstrain" that can help with the training process.
You should be able to achieve most of the project with the help of GPT or Claude.
Sorry if this is not the solution you were looking for.

Reply all

Reply to author

Forward

0 new messages