Tesseract 3.0 English Trained Data

3,715 views
Skip to first unread message

mw18888

unread,
Mar 18, 2011, 4:38:33 PM3/18/11
to tesseract-ocr
In evaluating the Tesseract 3.0/2.0x, I find that the trained data
eng.traineddata.gz from

http://code.google.com/p/tesseract-ocr/downloads/detail?name=eng.traineddata.gz&can=2&q=

is much better than

a. the data from the svn trunk repository.
b. the data boxtiff-2.01.eng.tar.gz from
http://code.google.com/p/tesseract-ocr/downloads/detail?name=boxtiff-2.01.eng.tar.gz&can=2&q=

The eng.traineddata.gz data performs much better in digit ocr and
character ocr comparing with its counterparts.

However, the eng.traineddata.gz contains multiple unichar(s) which I
want to eliminate (or to customize the training data). I wonder how I
can access the eng.traineddata.gz tif/box files.?

Thank you in advance.

mw18888

patrickq

unread,
Mar 18, 2011, 10:31:49 PM3/18/11
to tesseract-ocr
Why not simply use a blacklist to exclude these unichars?

On Mar 18, 4:38 pm, mw18888 <man_...@yahoo.com> wrote:
> In evaluating the Tesseract 3.0/2.0x, I find that the trained data
> eng.traineddata.gz from
>
> http://code.google.com/p/tesseract-ocr/downloads/detail?name=eng.trai...
>
> is much better than
>
> a. the data  from the svn trunk repository.
> b. the data boxtiff-2.01.eng.tar.gz fromhttp://code.google.com/p/tesseract-ocr/downloads/detail?name=boxtiff-...

Saurabh Gandhi

unread,
Mar 18, 2011, 10:56:34 PM3/18/11
to tesser...@googlegroups.com, patrickq
You can refer to this thread which talks about character whitelisting and blacklisting to limit the number of characters to be identified
https://groups.google.com/forum/?pli=1#!topic/tesseract-ocr/0msQtTB_XrI

--
Regards,
Saurabh Gandhi




--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To post to this group, send email to tesser...@googlegroups.com.
To unsubscribe from this group, send email to tesseract-oc...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.


mw18888

unread,
Mar 20, 2011, 7:17:21 AM3/20/11
to tesseract-ocr
Thank patrickq

mw18888

unread,
Mar 20, 2011, 7:17:56 AM3/20/11
to tesseract-ocr
Thank you,

Saurabh



On Mar 18, 10:56 pm, Saurabh Gandhi <saurabh...@gmail.com> wrote:
> You can refer to this thread which talks about character whitelisting and
> blacklisting to limit the number of characters to be identifiedhttps://groups.google.com/forum/?pli=1#!topic/tesseract-ocr/0msQtTB_XrI
>
> --
> Regards,
> Saurabh Gandhi
Reply all
Reply to author
Forward
0 new messages