Individual character variation lists

112 views
Skip to first unread message

John Green

unread,
Mar 12, 2014, 7:57:38 AM3/12/14
to tesser...@googlegroups.com
Bottom line up front: Has anyone compiled a list of common misperceptions on the part of tesseract? E.g.: e is often seen as o and l can be mistaken for 1, etc. 

Forgive me if this is well known, but a cursory search provided no result, though it is always possible I was too hasty or overlooked an obvious resource:

What I'm doing: As part of a longer pipeline, at one step I am reasoning over very small but highly characteristic strings like drug dosage, "60 mg". Edit distance (Levenshtein or a variation) and n-grams, even unigrams, only do a so-so job. I'd like to calculate probabilities based on look-alikes per above. That is, a not unreasonable case on a poor document is to mistake "60 mg" for 6Ong" which gives a ratio of only 44%, for example. But, if the program knew that 0 and O as well as m and n can be frequently mistaken for the same character ... better matching. I've also considered dumping individual character probabilities into the mix from Tesseracts API, but I'm new to Tesseract, haven't gotten there yet, and I'm not even convinced that this would be a better solution. 

Thanks in advance to anyone who has the time to answer,
Regards,
John

Nick White

unread,
Mar 12, 2014, 8:55:24 AM3/12/14
to tesser...@googlegroups.com
Hi John,

On Wed, Mar 12, 2014 at 04:57:38AM -0700, John Green wrote:
> Bottom line up front: Has anyone compiled a list of common misperceptions on
> the part of tesseract? E.g.: e is often seen as o and l can be mistaken for 1,
> etc.

Tesseract has some basic information of that sort built in to its
training files, which it uses to help recognition.

You can see the list for english by unpacking the english
.traineddata file:

combine_tessdata -u /path/to/eng.traineddata eng.

And then looking at the resulting eng.unicharambigs file. It's
documented in the manpage unicharambigs.5, and it's pretty
straightforward.

Nick

John Green

unread,
Mar 13, 2014, 1:53:19 PM3/13/14
to tesser...@googlegroups.com
Thank you very much Nick. I look forward to trying this out. 

JG

John Green

unread,
Mar 13, 2014, 2:27:37 PM3/13/14
to tesser...@googlegroups.com
That is perfect, thank you.

I'm sure I know the answer to this: These are trained off of observations made over a specific set of data, correct? Not some a priori deductions from the algorithms? 

JG

On Wednesday, March 12, 2014 8:55:24 AM UTC-4, Nick White wrote:

Nick White

unread,
Mar 13, 2014, 2:34:31 PM3/13/14
to tesser...@googlegroups.com
On Thu, Mar 13, 2014 at 11:27:37AM -0700, John Green wrote:
> That is perfect, thank you.

Great, glad to hear it.

> I'm sure I know the answer to this: These are trained off of observations made
> over a specific set of data, correct? Not some a priori deductions from the
> algorithms?

In general, yes. The Ancient Greek training I made does algorithmic
stuff to set up some basic rules about accentuation[0], but that's
pretty unusual ;)

Nick

0. See http://eutypon.gr/eutypon/pdf/e2012-29/e29-a01.pdf

John Green

unread,
Mar 17, 2014, 4:06:17 PM3/17/14
to tesser...@googlegroups.com
Outstanding, thank you Nick. 

Sent from Mailbox for iPhone


--
--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesser...@googlegroups.com
To unsubscribe from this group, send email to
tesseract-oc...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

---
You received this message because you are subscribed to a topic in the Google Groups "tesseract-ocr" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/tesseract-ocr/iH79rOniEtM/unsubscribe.
To unsubscribe from this group and all its topics, send an email to tesseract-oc...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Tom Morris

unread,
Mar 18, 2014, 1:46:56 PM3/18/14
to tesser...@googlegroups.com
On Wednesday, March 12, 2014 7:57:38 AM UTC-4, John Green wrote:

What I'm doing: As part of a longer pipeline, at one step I am reasoning over very small but highly characteristic strings like drug dosage, "60 mg". Edit distance (Levenshtein or a variation) and n-grams, even unigrams, only do a so-so job. I'd like to calculate probabilities based on look-alikes per above. That is, a not unreasonable case on a poor document is to mistake "60 mg" for 6Ong" which gives a ratio of only 44%, for example. But, if the program knew that 0 and O as well as m and n can be frequently mistaken for the same character ... better matching. I've also considered dumping individual character probabilities into the mix from Tesseracts API, but I'm new to Tesseract, haven't gotten there yet, and I'm not even convinced that this would be a better solution. 

It's not clear from your description if you're already doing this, but you might want to consider modeling the target domain that you're matching to either in terms of n-gram probabilities or something even stricter.  There's going to be much less variability in something like a dosage string than there is in general text.  You could use something like a medical term ontology to create a pretty comprehensive list of things like units, frequencies, routes, etc.

Tom 

John Green

unread,
Apr 2, 2014, 9:11:50 AM4/2/14
to tesser...@googlegroups.com, tesser...@googlegroups.com
Im using the UMLS ontology modeled on 2ngrams. Im also doing a state space search generated from the unicharamig with goals being defined by that set. So im right there with you  Tom!

To anyone:Im assuming tesseract makes the replacements defined in unicharambig if they have the mandatory flag, but what about those flagged non mandatory? I couldnt find in the man pages the criteria for when tesseract makes these replacements if at all.

Thanks to anyone who considers this,
John 

Sent from Mailbox for iPhone


--

Nick White

unread,
Apr 2, 2014, 10:05:11 AM4/2/14
to tesser...@googlegroups.com
On Wed, Apr 02, 2014 at 06:11:50AM -0700, John Green wrote:
> To anyone:Im assuming tesseract makes the replacements defined in unicharambig
> if they have the mandatory flag, but what about those flagged non mandatory? I
> couldnt find in the man pages the criteria for when tesseract makes these
> replacements if at all.

The flagged not mandatory unicharambigs rules just make the
substitution a little more likely. I haven't looked in the code to
see how much they affect things, but my experience is "not much". I
wouldn't be at all surprised if there was a config variable that
could increase it, if you wanted to.

Nick

John Green

unread,
Apr 2, 2014, 10:15:11 AM4/2/14
to tesser...@googlegroups.com, tesser...@googlegroups.com
Thanks Nick. Ill just sub them in the post processing.

JG

Sent from Mailbox for iPhone


Reply all
Reply to author
Forward
0 new messages