Tesseract ignores numbers/bullets

63 views
Skip to first unread message

An Keilha

unread,
Aug 14, 2020, 6:25:30 AM8/14/20
to tesseract-ocr
Hello,
I am using Tesseract 4.1.1 via the command line (input and output files are attached):

tesseract  DE000029711094U1-8.tif DE000029711094U1-8_tif-deu-best-bullets-missing -l deu --psm 3 hocr

The traineddata from https://github.com/tesseract-ocr/tessdata_best is used.

The problem with the result is that the numbers on the left (bullets) are missing (see PageViewer screenshot attached)

If I change page segmentation from the default to "--psm 12" (for sparse text) the numbers are there, but page segmentation is poor (because it is not actually sparse text). Moreover, in general I cannot really use "--psm 12", because some of the pages I do OCR on have layouts that can only properly handled by "--psm 3".

My my configs/hocr file looks like the following:

tessedit_create_hocr 1
hocr_font_info 1

I have also tried setting parameters like:

tessedit_zero_rejection 1
tessedit_zero_kelvin_rejection 1

Nothing improved the recognotion of the numbers on the left.

What should I try next? Are there any parameters I should try? Is it possible to train osd?

Regards
Anne

PS.: I had to zip my filed because Google won't let me upload them otherwise. :-)
test-case.zip

An Keilha

unread,
Aug 14, 2020, 11:21:23 AM8/14/20
to tesseract-ocr
My last question "Is it possible to train osd?" does not really make much sense. What I meant to write was: "Is it possible to train/change the psm mode to something else that is not included in the built in psm modes of Tesseract?"
Reply all
Reply to author
Forward
0 new messages