Optimal numbers for the ground truth

65 views

Skip to first unread message

Филип Пешић

unread,

Aug 18, 2020, 9:31:53 AM8/18/20

to tesseract-ocr

Hi,

I want to train tesseract with tesstrain, with .tif and .gt.txt pairs. However, the native images are 231DPI scans of old books from 1800s and, I assume, that's pretty low, based on what I read on so many forums, plus, there is an huge amount of text on the scanned images, basically 90% of both side are just text, pictures are really rare. I tried a lot of the methods to increase quality, IM script, and some projects from GH, with little to no improvement. Image Magick's resample seems to have the most impact. I tried 300, 400, 600, 800 and 1000 DPI, with "sweet spot" being 800 based on the results, since there's a regression on 1000 and below 800 I saw some errors like line could not be read, something like that. I used tesseract's hocr output, than hocr-tools to generate segmentation pairs.

So here;'s my dilemma..

-What range is the best for tesseract, is 800 too much?

-If I upscale the initial image from which I make hocr than segment it, should I then upscale all my images that I will later use my trained model on?

-Does ground truth need to have some order? When I do ground truth for

one segmented file it goes, for example, from 00001 to 00999 and another with one 0 less like 0001 to 0999, then I just put then into the same folder and that's okay?

Hopefully that makes sense and my English is not that bad. Apologies if I sound confusing, it's kinda hard to explain. I'll add any additional info if I missed it.