user-words

1,626 views
Skip to first unread message

Bonny

unread,
Sep 27, 2011, 9:10:46 AM9/27/11
to tesseract-ocr
Hello...

I have question about user-words.
I use eng.traineddata and OCR works well. But the problem is that text
have a lot of foregin names and that is not recongnized correctly. So
I try to make file eng.user-words in same directory as eng.traineddata
is and put that names in file one name per line. Then I try to OCR
again. But no difference. So the question is.
Is enought to just make file eng.user-words or something else should
be done?

Thanks.

Slavko Kocjancic

unread,
Sep 27, 2011, 8:03:03 AM9/27/11
to tesser...@googlegroups.com

Bonny

unread,
Sep 29, 2011, 7:44:21 AM9/29/11
to tesser...@googlegroups.com
Nobody know or the question is too silly?

Calomer

unread,
Sep 29, 2011, 1:39:11 PM9/29/11
to tesseract-ocr
I'll try my best to answer, tho I'm hardly eligible.

According to training instructions (on http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3)
and general OCR knowledge, you cannot train solely by new characters.
You need training images, you need to create boxes (with any box
editor, but I only used Qt Box Editor). Once you create new boxes
around your new tiff image, and label them accordingly, you should be
ready for training.

Keep in mind, you'll need at least 12 low x-height in pixels
(preferably around 20 pixels), variety in images would be nice for
increased performance.

Follow training instructions, train your own language file, try OCR
again, if you fail again, I'm sure someone else who has wider
knowledge than me should be able to answer your further questions.

Sven Pedersen

unread,
Sep 29, 2011, 3:39:06 PM9/29/11
to tesser...@googlegroups.com
Thanks Calomer.

Bonny, is the language you're trying to improve using a different set
of characters (alphabet)? If so, you'll need to do a lot of training
as Calomer described. Otherwise you'll just need some tweaks. The font
may be an issue.
--Sven

> --
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To post to this group, send email to tesser...@googlegroups.com
> To unsubscribe from this group, send email to
> tesseract-oc...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en
>

--
``All that is gold does not glitter,
  not all those who wander are lost;
the old that is strong does not wither,
  deep roots are not reached by the frost.
From the ashes a fire shall be woken,
  a light from the shadows shall spring;
renewed shall be blade that was broken,
  the crownless again shall be king.”

Calomer

unread,
Sep 30, 2011, 2:03:40 AM9/30/11
to tesseract-ocr
Sven,

Now I'm curious. What kind of tweaks are you talking about ?

Appending old language training data with new fonts?
Pre-enhancement of the image (skew transformation on italic
characters, contract enhancement on low-contrast fonts etc) ?

I'd love to know any other tweaks there is.

Thanks

On Sep 29, 10:39 pm, Sven Pedersen <sven.peder...@gmail.com> wrote:
> Thanks Calomer.
>
> Bonny, is the language you're trying to improve using a different set
> of characters (alphabet)? If so, you'll need to do a lot of training
> as Calomer described. Otherwise you'll just need some tweaks. The font
> may be an issue.
> --Sven
>
>
>
>
>
>
>
>
>
> On Thu, Sep 29, 2011 at 12:39 PM, Calomer <calo...@gmail.com> wrote:
> > I'll try my best to answer, tho I'm hardly eligible.
>
> > According to training instructions (onhttp://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3)

Slavko Kocjancic

unread,
Oct 1, 2011, 3:26:23 AM10/1/11
to tesser...@googlegroups.com
Dne 29.9.2011 21:39, pi�e Sven Pedersen:

> Thanks Calomer.
>
> Bonny, is the language you're trying to improve using a different set
> of characters (alphabet)? If so, you'll need to do a lot of training
> as Calomer described. Otherwise you'll just need some tweaks. The font
> may be an issue.
> --Sven
>
>

Seems that I'm not clear enougth or just my english is not good enougth.
So I try to explain again.
I have sacns of english text. But in the text is a lot of foregin names
(but just english characters)
And when I apply the OCR the text is recongnized without problems. But
the names is many times wrong, and confidence (I use commandline and
hOCR output) is low on that words (names).

As I wan't to proffread the text I write application to show text in
editor and image in other window. And I get confidence from hOCR to show
text where tess means that can be wrong. And all the names is marked red
in example as they are not in dictionary. (I use prebuilt
eng.traineddata). The attached page is just index and that names appear
in the book many times. So I just wonder if I can put that words (names)
in eng.user-words to make confidence better. So I don't want to train
new characters or new font. Just wan't to add new word to dictionary.
And just to be used in particiculary book. Is that possible?

As I discowered for now just adding text file eng.user-words has no
efect. So what steps are required to put it on?

hopefuy It's clear enougth now.

Clipboard01.jpg

Sven Pedersen

unread,
Oct 1, 2011, 9:32:05 AM10/1/11
to tesser...@googlegroups.com
Sounds like maybe a bad version of Tess -- which version do you have? Latest svn would be good for what you're doing.
Sven


On Saturday, October 1, 2011, Slavko Kocjancic <esl...@gmail.com> wrote:
> Dne 29.9.2011 21:39, piše Sven Pedersen:

Sven Pedersen

unread,
Oct 1, 2011, 9:29:25 AM10/1/11
to tesser...@googlegroups.com
Yes, I think you have covered the tweaks I thought of suggesting.
Sven

B.J.

unread,
Oct 1, 2011, 7:36:29 PM10/1/11
to tesseract-ocr
I ran into this problem recently. Here is the solution (I'm using
Tesseract 3.01):
to use user-words list, in dict.h and dict.cpp, find user_words_suffix
and change the "" to "user-words"
//dict.h
STRING_VAR_H(user_words_suffix, "user-words", "A list of user-provided
words.");

//dict.cpp
STRING_INIT_MEMBER(user_words_suffix, "user-words",
"A list of user-provided words.",
getImage()->getCCUtil()->params()),

This assumes, then, that in your tessdata folder there is a file named
"eng.user-words" with your user made word list.

.bj.

Slavko Kocjancic

unread,
Oct 3, 2011, 3:20:05 AM10/3/11
to tesser...@googlegroups.com
Dne 2.10.2011 1:36, pi�e B.J.:

> I ran into this problem recently. Here is the solution (I'm using
> Tesseract 3.01):
> to use user-words list, in dict.h and dict.cpp, find user_words_suffix
> and change the "" to "user-words"
> //dict.h
> STRING_VAR_H(user_words_suffix, "user-words", "A list of user-provided
> words.");
>
> //dict.cpp
> STRING_INIT_MEMBER(user_words_suffix, "user-words",
> "A list of user-provided words.",
> getImage()->getCCUtil()->params()),
>
> This assumes, then, that in your tessdata folder there is a file named
> "eng.user-words" with your user made word list.
>
> .bj.
>

I have 3.01 from svn too.
And that field's are empty. So I modified as you suggest. But I see no
difference in OCR. The confidence is still low and missreaded word is
still missreaded.
And if I remove 'eng.user-words' then tess just abort execution with
missing eng.user-words statments so I assume that file is oppened and used.

So is there someone smart enought to explain how that
('lang.user-words') works.
And other things.. Is there someone smart enought to change source on
svn to have that included but just to check if user-words exist not to
popup error? (as I know the lang.user-words is optional so keep is like
that.)

Thanks...

Samuel backus

unread,
May 31, 2017, 3:20:12 AM5/31/17
to tesseract-ocr, esl...@gmail.com
I had to recompile tesseract after updating dict.h and dict.cpp for this change to take effect. 

ShreeDevi Kumar

unread,
May 31, 2017, 5:16:21 AM5/31/17
to tesser...@googlegroups.com
Samuel,

Do the user-words work as expected after making this change?

Which version of tesseract are you using?

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/18a7aac6-cc5d-4904-985e-4bb6ea1bccde%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages