Custom Wordlist without Retraining

1,734 views
Skip to first unread message

Max Cantor

unread,
May 8, 2011, 11:53:58 AM5/8/11
to tesser...@googlegroups.com
Is there a way to set up a custom wordlist without going through the entire retraining process? our wordlists will change a bit at runtime, so if there is an API variable to set, that would be perfect for us.

Thanks,
Max

Keep up the good work!

zdenko podobny

unread,
May 8, 2011, 1:40:51 PM5/8/11
to tesser...@googlegroups.com
see [1] or user-words on the same page.

--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesser...@googlegroups.com
To unsubscribe from this group, send email to
tesseract-oc...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Max Cantor

unread,
May 8, 2011, 8:01:12 PM5/8/11
to tesser...@googlegroups.com
I was looking at that, but can't find the other component files in the source tree. is there somewhere to get the component files for the eng.trainneddata?

sorry if i'm missing something obvious...

max

zdenko podobny

unread,
May 9, 2011, 2:01:25 AM5/9/11
to tesser...@googlegroups.com
Please try to read (to look is not enough ;-) ) [1] :


// Specify option -u to unpack all the components to the specified path:
//
// combine_tessdata -u tessdata/eng.traineddata /home/$USER/temp/eng.
//
// This will create  /home/$USER/temp/eng.* files with individual tessdata
// components from tessdata/eng.traineddata.
//

Max Cantor

unread,
May 9, 2011, 2:27:36 AM5/9/11
to tesser...@googlegroups.com
I feel really dumb now. Sorry for the bother. 


Thanks, max

Oleg Tikhonov

unread,
May 9, 2011, 12:18:23 AM5/9/11
to tesser...@googlegroups.com
Hi Max,

Look at:
Extracts all component files from .traineddata

combine_tessdata -u tessdata/ell.traineddata /home/$USER/temp/ell

combine_tessdata language_data_path_prefix (e.g. tessdata/eng.)

Combines all individual tessdata components (unicharset, DAWGs, classifier templates, ambiguities, language configs). The result will be a combined tessdata file lang_code.traineddata

Hope it helps,

Oleg

zdenko podobny

unread,
May 9, 2011, 3:30:25 AM5/9/11
to tesser...@googlegroups.com
no problem :-) I think you will like option "-o" too. 

Zdenko

Max Cantor

unread,
May 9, 2011, 10:07:02 PM5/9/11
to tesser...@googlegroups.com
Ok, i feel a bit less bad now. combine_tessdata segfaults on both ubuntu and osx:

182:tess max$ combine_tessdata -u eng.traineddata eng
Extracting tessdata components from eng.traineddata
tesseract::TessdataManager::TessdataTypeFromFileName( filename, &type, &text_file):Error:Assert failed:in file tessdatamanager.cpp, line 241
Segmentation fault

this is tesseract 3.00. seems to have some problem with the traineddata suffix.

thanks,
max

Max Cantor

unread,
May 9, 2011, 10:21:30 PM5/9/11
to tesser...@googlegroups.com
Ok, I found the problem. the fix is described here: http://code.google.com/p/tesseract-ocr/issues/detail?id=356

the output dir needs to end in a period.

my bad.

max

Parmeet

unread,
May 10, 2011, 4:32:34 AM5/10/11
to tesseract-ocr
Hello there,

Sorry if i sounds naive, but i think the original question is not
answered yet, that is how to include our own word list. After going
through FAQ page, i found that we can put our eng.user-words file in
tessdata folder.

I did exactly same and to test if it works i put characters a though z
as a single word in eng.user-words file, save it as UTF-8 encoding.
Then i make an image in Paint and put character from a through z as
one word (with different fonts in different lines in same image) and
try to run OCR on it. Unfortunately it did not corrected the output
even when there is only single wrongly identified character in all the
characters from a through z. Could you please let me know if i am
doing something wrong or if somehow i need to retrain using my user-
words..

I shall be grateful for early reply.

Thanks and Kind Regards
Parmeet


On May 10, 7:21 am, Max Cantor <mxcan...@gmail.com> wrote:
> Ok, I found the problem.  the fix is described here:  http://code.google.com/p/tesseract-ocr/issues/detail?id=356
>
> the output dir needs to end in a period.  
>
> my bad.
>
> max
>
> On May 9, 2011, at 3:30 PM, zdenko podobny wrote:
>
>
>
>
>
>
>
> > no problem :-) I think you will like option "-o" too.
>
> > Zdenko
>
> > On Mon, May 9, 2011 at 8:27 AM, Max Cantor <mxcan...@gmail.com> wrote:
> > I feel really dumb now. Sorry for the bother.
>
> > Thanks, max
>
> > On May 9, 2011, at 14:01, zdenko podobny <zde...@gmail.com> wrote:
>
> >> Please try to read (to look is not enough ;-) ) [1] :
>
> >> // Specify option -u to unpack all the components to the specified path:
> >> //
>
> >> // combine_tessdata -u tessdata/eng.traineddata /home/$USER/temp/eng.
> >> //
>
> >> // This will create  /home/$USER/temp/eng.* files with individual tessdata
> >> // components from tessdata/eng.traineddata.
>
> >> //
> >> [1]http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3#Puttin...
>
> >> On Mon, May 9, 2011 at 2:01 AM, Max Cantor <mxcan...@gmail.com> wrote:
> >> I was looking at that, but can't find the other component files in the source tree.  is there somewhere to get the component files for the eng.trainneddata?
>
> >> sorry if i'm missing something obvious...
>
> >> max
> >> On May 9, 2011, at 1:40 AM, zdenko podobny wrote:
>
> >> > see [1] or user-words on the same page.
>
> >> > [1]http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3#Puttin...
>
> >> > Zdenko

Max Cantor

unread,
May 10, 2011, 4:51:01 AM5/10/11
to tesser...@googlegroups.com
Hi,

Well, it was answered enough in that I was able to make my own xxx.traineddata file. unfortunately, even with that traineddata file, I'm running into the same problem that you are and I can't seem to get tesseract to use the freq-dawg that I included. I've been digging through the source code to find the right config but haven't succeeded yet. I'll let you and the group know when I do!

thanks,
max

Parmeet

unread,
May 10, 2011, 5:36:10 AM5/10/11
to tesseract-ocr
Hi there,

@ Max:Thanks,hope you will find the solution soon..

@ Admin: It would be great if you could suggest something, as i think
it is quite important and great feature to correct user words in the
output..

Thanks and Regards,
Parmeet
Reply all
Reply to author
Forward
0 new messages