Ray Lutz
unread,May 13, 2023, 2:06:03 PM5/13/23Sign in to reply to author
Sign in to forward
You do not have permission to delete messages in this group
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to tesseract-ocr
Hello Friends:
We just got done working for days to isolate a problem to the "best" training data whereas the "fast" data works much better. This is counterintuitive, so I think others may benefit from this work.
The image we were converting is attached. We use the tsv mode because it provides the most information about each word, including the bounding box and location on the page.
This is an election data processing application.
See a clip of the output from Tesseract using the "fast" training data. It shows the words "SHERIFF" and "MAYOR"
Then, only changing to the best the training data, those are missing, being replaced with crud. We found identical results whether we ran on windows or linux box (Ubuntu latest).
I hope this is useful for development and to guide others away from the "best" training data. The fast data is only about 4MB whereas the best data is 22MB.
My suggestion is to withdraw the "best" data.
--Ray
failed result with best training data.png
good result with fast training data.png