Some questions about tesseract 3.0x.

126 views
Skip to first unread message

smwikipedia smwikipedia

unread,
May 11, 2015, 8:49:04 AM5/11/15
to tesser...@googlegroups.com


1. For tesseract 3.02, after installation I see there's a pre-trained eng.traineddata file in the tessdata folder. How is this file generated? What font does it target? Can I blindly use it for my OCR application?

2. For tesseract 3.03, I see there's a new option "--print-parameters" for the tesseract executable. There're more than 600 parameters. How am I supposed to use them? If I need to tune them, how?

3. During my experimentation, I see tesseract works better for some font type than other font type. Is this true? Which font has the best precision?


smwikipedia smwikipedia

unread,
May 12, 2015, 10:22:37 AM5/12/15
to tesser...@googlegroups.com
Regarding question 2, I just found 2 sites to explain the control parameters:

https://code.google.com/p/tesseract-ocr/wiki/ControlParams

http://www.sk-spell.sk.cx/tesseract-ocr-parameters-in-302-version


在 2015年5月11日星期一 UTC+8下午8:49:04,smwikipedia smwikipedia写道:

zdenko podobny

unread,
May 13, 2015, 3:35:05 AM5/13/15
to tesser...@googlegroups.com
Ad 1. This file was generated by Google in their internal system. The tools are opensourced (see 3Training.pdf[1] - but I would suggest you to read all presentations) now (or "ported", so they use free libraries instead of google internal libraries). Regarding used fonts I guess that file eng.cube.size[2] should provide you relevant indication.

Ad 3. I am not sure if there could an simple answer. IMO for "modern" fonts and text without graphics it should be not to use tesseract for OCR. For old fonts (a.k.a. fraktur fonts) you will need training. If your text has a graphics, tables etc. you should expect problems[3]. I heard about companies that successfully use tesseract in OCR of invoices, but their sw is doing image pre-processing, page segmentation and text post-processing and tesseract is used for only for OCR of text areas. 


Zdenko

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/e201c2a8-3271-40f6-87a0-183245a19abb%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages