--
--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesser...@googlegroups.com
To unsubscribe from this group, send email to
tesseract-oc...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en
---
You received this message because you are subscribed to a topic in the Google Groups "tesseract-ocr" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/tesseract-ocr/iH79rOniEtM/unsubscribe.
To unsubscribe from this group and all its topics, send an email to tesseract-oc...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
What I'm doing: As part of a longer pipeline, at one step I am reasoning over very small but highly characteristic strings like drug dosage, "60 mg". Edit distance (Levenshtein or a variation) and n-grams, even unigrams, only do a so-so job. I'd like to calculate probabilities based on look-alikes per above. That is, a not unreasonable case on a poor document is to mistake "60 mg" for 6Ong" which gives a ratio of only 44%, for example. But, if the program knew that 0 and O as well as m and n can be frequently mistaken for the same character ... better matching. I've also considered dumping individual character probabilities into the mix from Tesseracts API, but I'm new to Tesseract, haven't gotten there yet, and I'm not even convinced that this would be a better solution.
--