Tesseract recognition issues.

90 views
Skip to first unread message

Ulan Bator

unread,
May 14, 2024, 12:52:15 AMMay 14
to tesseract-ocr
I have some problems with trailing dots in table of contents.
The attached picture is recognized but all the dots are interpreted as random caraters:

example:
where I have this text in the picture
Cochemiea (Teil 1) .......................................... 3

I get the following text after OCR
Cochemiea (Teil 1) .....::: 2222 see essen eennseenneeneeener nen

As one can see statring from dots the rest ol the line is wrong evn the last number 3 is missing

Anyone have an idea about how to fix this?.

I use following command
tesseract --dpi 300 -l deu --oem 1 Kakt_Sukk-1986-1_02.jpg Kakt_Sukk-1986-1_02 txt

on Linux Debian 12
Kakt_Sukk-1986-1_02_ocr.pdf
Kakt_Sukk-1986-1_02.jpg

Yaofu Zhou

unread,
May 21, 2024, 1:10:41 AMMay 21
to tesseract-ocr
It is going to be a project for you but one way to achieve your goal is to fine-tune the model using a custom training set -
1. You would procedurally generate a set (a few thousand would be a good start) of images of similar content with various amounts of dots, as well as the corresponding text files that label the ground truth for the images.
2. You would fine-tune the specific Tesseract OCR model you are using  (deu in your case) with the training set generated. Tesseract's GitHub has a tool "Tesstrain" that can help with the training process.
You should be able to achieve most of the project with the help of GPT or Claude.
Sorry if this is not the solution you were looking for.
Reply all
Reply to author
Forward
0 new messages