"best" training data fails but "fast" data works using v5.3.1

47 views

Skip to first unread message

Ray Lutz

unread,

May 13, 2023, 2:06:03 PM5/13/23

to tesseract-ocr

Hello Friends:

We just got done working for days to isolate a problem to the "best" training data whereas the "fast" data works much better. This is counterintuitive, so I think others may benefit from this work.

The image we were converting is attached. We use the tsv mode because it provides the most information about each word, including the bounding box and location on the page.
This is an election data processing application.

See a clip of the output from Tesseract using the "fast" training data. It shows the words "SHERIFF" and "MAYOR"

Then, only changing to the best the training data, those are missing, being replaced with crud. We found identical results whether we ran on windows or linux box (Ubuntu latest).

I hope this is useful for development and to guide others away from the "best" training data. The fast data is only about 4MB whereas the best data is 22MB.

My suggestion is to withdraw the "best" data.

--Ray

failed result with best training data.png

160726-ev_cvt_10_bot.png

good result with fast training data.png

Reply all

Reply to author

Forward

0 new messages