How to restrict OCR character set.

103 views
Skip to first unread message

Martin Emmerson

unread,
Mar 29, 2019, 1:47:30 AM3/29/19
to tesseract-ocr
Is there a way to restrict the character set that tesseract-ocr will attempt to identify?  I'm scanning USA-based receipts which have a fairly simple set of monospaced characters but, for example, often '1' will get misidentified as '|', and a whole host of other simple substitution errors.  If I could just restrict tesseract to [-a-zA-Z0-9,.$()/] it would be an immediate boost to accuracy.  (Hoping for a way that doesn't involved having to retrain from scratch on the limited set.)

Shree Devi Kumar

unread,
Mar 29, 2019, 2:03:59 AM3/29/19
to tesser...@googlegroups.com

On Fri, 29 Mar 2019, 11:17 Martin Emmerson, <sho...@gmail.com> wrote:
Is there a way to restrict the character set that tesseract-ocr will attempt to identify?  I'm scanning USA-based receipts which have a fairly simple set of monospaced characters but, for example, often '1' will get misidentified as '|', and a whole host of other simple substitution errors.  If I could just restrict tesseract to [-a-zA-Z0-9,.$()/] it would be an immediate boost to accuracy.  (Hoping for a way that doesn't involved having to retrain from scratch on the limited set.)

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/2180d37f-50fd-47e6-9f48-c3ff73b1569e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Martin Emmerson

unread,
Mar 29, 2019, 4:17:05 PM3/29/19
to tesseract-ocr
Yikes!   Thanks for the reply, but I could barely follow the discussion on that pull request.   It seems the answer at least for now is that there isn't a straightforward way to restrict character set without being somewhat familiar with the code base and dev environment (which I'm not).  Thanks anyway; I'll try to figure out some external workarounds.


On Thursday, March 28, 2019 at 11:03:59 PM UTC-7, shree wrote:
On Fri, 29 Mar 2019, 11:17 Martin Emmerson, <sho...@gmail.com> wrote:
Is there a way to restrict the character set that tesseract-ocr will attempt to identify?  I'm scanning USA-based receipts which have a fairly simple set of monospaced characters but, for example, often '1' will get misidentified as '|', and a whole host of other simple substitution errors.  If I could just restrict tesseract to [-a-zA-Z0-9,.$()/] it would be an immediate boost to accuracy.  (Hoping for a way that doesn't involved having to retrain from scratch on the limited set.)

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.

Shree Devi Kumar

unread,
Mar 30, 2019, 2:06:34 AM3/30/19
to tesser...@googlegroups.com

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

For more options, visit https://groups.google.com/d/optout.


--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Shree Devi Kumar

unread,
Mar 30, 2019, 2:12:44 AM3/30/19
to tesser...@googlegroups.com
This was finetuned with 20+ monospaced fonts for 400 iterations to error rate of 0.242%. 

At iteration 44/400/400, Mean rms=0.258%, delta=0.076%, char train=0.242%, word train=0.761%, skip ratio=0%,  New best char error = 0.242 wrote best model:/home/ubuntu/tesstutorial/engrestrict_from_full/engrestrict_plus0.242_44.checkpoint wrote checkpoint.

Finished! Error rate = 0.242

If you know the font used and customize training text to your data, you will get better results.  

Martin Emmerson

unread,
Mar 30, 2019, 1:24:35 PM3/30/19
to tesseract-ocr
Thanks!   This may still be a stretch for my current level of tesseract knowledge but definitely more within reach!   I look forward to giving it a try.


--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

shree

unread,
Mar 31, 2019, 4:19:07 AM3/31/19
to tesseract-ocr
You can download https://github.com/Shreeshrii/tessdata_shreetest/raw/master/engrestrict_best.traineddata and use it instead of eng.traineddata. Keep it in same location as eng.traineddata. Use with -l engrestrict_best
Reply all
Reply to author
Forward
0 new messages