Add frequently used words to Tesseract

1,681 views
Skip to first unread message

Nilesh Gundecha

unread,
Sep 13, 2012, 5:24:14 AM9/13/12
to tesser...@googlegroups.com
Hi Friends,

I am using Tesseract 3.02 (with Tess4j java wrapper) on Windows 7.

I have got some list of words which are frequently used. I want to add these words in the Tesseract vocabulary.

I have gone through the TrainingTesseract3 wiki, but was just wondering in order to achieve this, do I need to create the tiff files, box files, .tr files, etc.

Cant I just generate eng.freq-dawg and add it to Tesseract vocabulary.

Any help is highly appreciable.

Thanks for this wonderful product.

Regards,
Nilesh

Nilesh Gundecha

unread,
Sep 20, 2012, 1:33:20 AM9/20/12
to tesser...@googlegroups.com
I got the solution for this. All I had to so is - 

1) Unpack the eng.traineddata using combinedata tool
2) Add frequent words as per given in the Wiki
3) Pack all the eng.* files

However, I would like to add vocabulary to user-words rather than the freq-dawg. So I tried that as per it is given in the Wiki page.

But even after adding user-words file, there is no change in the OCR result. Any idea on this?

Regards,
Nilesh
Message has been deleted

Ray Smith

unread,
Sep 25, 2012, 1:23:53 PM9/25/12
to tesser...@googlegroups.com
User-words gained an extra level of indirection. You don't have to put it in the traineddata file, but you have to specify an INIT parameter giving the extension of the user-words dawg. The INIT parameter can only be set via a config file provided with the call to Init. See http://code.google.com/p/tesseract-ocr/wiki/ControlParams

On Fri, Sep 21, 2012 at 10:24 AM, Zdenko Podobný <zde...@gmail.com> wrote:


Dne čtvrtek, 13. září 2012 11:24:14 UTC+2 Nilesh Gundecha napsal(a):
You do not need to run full training - pay attention to "Putting it all together" in TrainingTesseract3 wiki. BTW: in 3.02 version there is tool that creates wordlist from dawg dictionary, so you can merge your list with original.

--
Zdenko
 

Reply all
Reply to author
Forward
0 new messages