Hello,
I am using Tesseract 4.1.1 via the command line (input and output files are attached):
tesseract DE000029711094U1-8.tif DE000029711094U1-8_tif-deu-best-bullets-missing -l deu --psm 3 hocr
The traineddata from
https://github.com/tesseract-ocr/tessdata_best is used.
The problem with the result is that the numbers on the left (bullets) are missing (see PageViewer screenshot attached)
If I change page segmentation from the default to "--psm 12" (for sparse text) the numbers are there, but page segmentation is poor (because it is not actually sparse text). Moreover, in general I cannot really use "--psm 12", because some of the pages I do OCR on have layouts that can only properly handled by "--psm 3".
My my configs/hocr file looks like the following:
tessedit_create_hocr 1
hocr_font_info 1
I have also tried setting parameters like:
tessedit_zero_rejection 1
tessedit_zero_kelvin_rejection 1
Nothing improved the recognotion of the numbers on the left.
What should I try next? Are there any parameters I should try? Is it possible to train osd?
Regards
Anne
PS.: I had to zip my filed because Google won't let me upload them otherwise. :-)