"best" training data fails but "fast" data works using v5.3.1

47 views
Skip to first unread message

Ray Lutz

unread,
May 13, 2023, 2:06:03 PM5/13/23
to tesseract-ocr
Hello Friends:

We just got done working for days to isolate a problem to the "best" training data whereas the "fast" data works much better. This is counterintuitive, so I think others may benefit from this work.

The image we were converting is attached. We use the tsv mode because it provides the most information about each word, including the bounding box and location on the page.
This is an election data processing application.

See a clip of the output from Tesseract using the "fast" training data. It shows the words "SHERIFF" and "MAYOR"

Then, only changing to the best the training data, those are missing, being replaced with crud. We found identical results whether we ran on windows or linux box (Ubuntu latest).

I hope this is useful for development and to guide others away from the "best" training data. The fast data is only about 4MB whereas the best data is 22MB. 

My suggestion is to withdraw the "best" data.

--Ray
failed result with best training data.png
160726-ev_cvt_10_bot.png
good result with fast training data.png
Reply all
Reply to author
Forward
0 new messages