Inconsistencies (sometimes) on similar characters. Is there a map for characters that are very similar?

45 views

Skip to first unread message

Deborah

unread,

Jun 16, 2024, 2:41:27 AM (14 days ago) Jun 16

to tesseract-ocr

Hello, I am using Tesseract to extract some data from screenshots.
I've noticed that sometimes there are mistakes in interpreting characters like '0' and 'O', 'P' and 'R' or '-' and '—' or the other way around. This happen with the same font. And it happens sometimes even with some preprocessing, like binarization.
Is there a comprehensive map of all characters that are usually mistakenly recognised that are very similar?
I need that map in order to calculate effective string distance with Levenshtein and adjust the cost for characters that are very similar. Thanks.

John Roxton

unread,

Jun 17, 2024, 7:53:52 AM (12 days ago) Jun 17

to tesseract-ocr

Hello Deborah,
Hopefully this isn't off-topic, and I don't mean to derail your thread, but I just wanted to chime in that I am having some very similar difficulties and considerations in the hopes that it will generate enough interest to yield an effective solution.

Reply all

Reply to author

Forward

0 new messages