does it make sense to train existing languages? how to fix repeatedly wrong letters?

71 views
Skip to first unread message

JP T

unread,
Apr 2, 2018, 8:39:16 AM4/2/18
to tesseract-ocr
Hi

I don't really got an understanding of the consequences of training.

My problem:
I've got tons of pages with a special format. ("one place study" about the historic inhabitants of a town)

tesseract repeatedly fails on a few special words:
oo (oh-oh) at start of line for "wedding" is often interpreted as 00 (zero zero)
roman numbers 2 and 3 in Arial font are taken for lowercase LL or uppercase I plus lowercase LL
*/~ (birth at about) is percent %
~ is -

my scans are of almost perfect quality (used Fred's scripts). so there is nothing I can do on that side any more.
adding oo to user words did not help.

Can I use training to solve these or should I instead write a script that fixes the mistakes after OCR?
The problem is, that OCR needs to know some semantics. The Arial letters itself do hardly provide a hint which one is correct.

thanks


ShreeDevi Kumar

unread,
Apr 3, 2018, 1:34:50 AM4/3/18
to tesser...@googlegroups.com
My suggestion would be to do post processing of the OCR output.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/5cd68a84-a7d2-4185-91c9-115c9e62d1d4%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages