Inconsistencies (sometimes) on similar characters. Is there a map for characters that are very similar?

45 views
Skip to first unread message

Deborah

unread,
Jun 16, 2024, 2:41:27 AM (14 days ago) Jun 16
to tesseract-ocr
Hello, I am using Tesseract to extract some data from screenshots.
I've noticed that sometimes there are mistakes in interpreting characters like '0' and 'O', 'P' and 'R' or '-' and '—' or the other way around. This happen with the same font. And it happens sometimes even with some preprocessing, like binarization.
Is there a comprehensive map of all characters that are usually mistakenly recognised that are very similar?
I need that map in order to calculate effective string distance with Levenshtein and adjust the cost for characters that are very similar. Thanks.

John Roxton

unread,
Jun 17, 2024, 7:53:52 AM (12 days ago) Jun 17
to tesseract-ocr
Hello Deborah,
Hopefully this isn't off-topic, and I don't mean to derail your thread, but I just wanted to chime in that I am having some very similar difficulties and considerations in the hopes that it will generate enough interest to yield an effective solution.
Reply all
Reply to author
Forward
0 new messages