How to restrict OCR character set.

Martin Emmerson

unread,

Mar 29, 2019, 1:47:30 AM3/29/19

to tesseract-ocr

Is there a way to restrict the character set that tesseract-ocr will attempt to identify? I'm scanning USA-based receipts which have a fairly simple set of monospaced characters but, for example, often '1' will get misidentified as '|', and a whole host of other simple substitution errors. If I could just restrict tesseract to [-a-zA-Z0-9,.$()/] it would be an immediate boost to accuracy. (Hoping for a way that doesn't involved having to retrain from scratch on the limited set.)

Shree Devi Kumar

unread,

Mar 29, 2019, 2:03:59 AM3/29/19

to tesser...@googlegroups.com

See https://github.com/tesseract-ocr/tesseract/pull/2294

On Fri, 29 Mar 2019, 11:17 Martin Emmerson, <sho...@gmail.com> wrote:

Is there a way to restrict the character set that tesseract-ocr will attempt to identify? I'm scanning USA-based receipts which have a fairly simple set of monospaced characters but, for example, often '1' will get misidentified as '|', and a whole host of other simple substitution errors. If I could just restrict tesseract to [-a-zA-Z0-9,.$()/] it would be an immediate boost to accuracy. (Hoping for a way that doesn't involved having to retrain from scratch on the limited set.)

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/2180d37f-50fd-47e6-9f48-c3ff73b1569e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Martin Emmerson

unread,

Mar 29, 2019, 4:17:05 PM3/29/19

to tesseract-ocr

Yikes! Thanks for the reply, but I could barely follow the discussion on that pull request. It seems the answer at least for now is that there isn't a straightforward way to restrict character set without being somewhat familiar with the code base and dev environment (which I'm not). Thanks anyway; I'll try to figure out some external workarounds.

On Thursday, March 28, 2019 at 11:03:59 PM UTC-7, shree wrote:

See https://github.com/tesseract-ocr/tesseract/pull/2294

On Fri, 29 Mar 2019, 11:17 Martin Emmerson, <sho...@gmail.com> wrote:

Is there a way to restrict the character set that tesseract-ocr will attempt to identify? I'm scanning USA-based receipts which have a fairly simple set of monospaced characters but, for example, often '1' will get misidentified as '|', and a whole host of other simple substitution errors. If I could just restrict tesseract to [-a-zA-Z0-9,.$()/] it would be an immediate boost to accuracy. (Hoping for a way that doesn't involved having to retrain from scratch on the limited set.)

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.

To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.

Shree Devi Kumar

unread,

Mar 30, 2019, 2:06:34 AM3/30/19

to tesser...@googlegroups.com

try the finetuned traineddata from

https://github.com/Shreeshrii/tessdata_shreetest/commit/0108263ad0c4c9bd11e0c8190a81fb36e2e4e56a

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/df5177e4-32d0-4015-a863-02878ef53f9b%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Shree Devi Kumar

unread,

Mar 30, 2019, 2:12:44 AM3/30/19

to tesser...@googlegroups.com

This was finetuned with 20+ monospaced fonts for 400 iterations to error rate of 0.242%.

At iteration 44/400/400, Mean rms=0.258%, delta=0.076%, char train=0.242%, word train=0.761%, skip ratio=0%, New best char error = 0.242 wrote best model:/home/ubuntu/tesstutorial/engrestrict_from_full/engrestrict_plus0.242_44.checkpoint wrote checkpoint.

Finished! Error rate = 0.242

If you know the font used and customize training text to your data, you will get better results.

Martin Emmerson

unread,

Mar 30, 2019, 1:24:35 PM3/30/19

to tesseract-ocr

Thanks! This may still be a stretch for my current level of tesseract knowledge but definitely more within reach! I look forward to giving it a try.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/df5177e4-32d0-4015-a863-02878ef53f9b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

shree

unread,

Mar 31, 2019, 4:19:07 AM3/31/19

to tesseract-ocr

You can download https://github.com/Shreeshrii/tessdata_shreetest/raw/master/engrestrict_best.traineddata and use it instead of eng.traineddata. Keep it in same location as eng.traineddata. Use with -l engrestrict_best

Reply all

Reply to author

Forward