Bonny, is the language you're trying to improve using a different set
of characters (alphabet)? If so, you'll need to do a lot of training
as Calomer described. Otherwise you'll just need some tweaks. The font
may be an issue.
--Sven
> --
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To post to this group, send email to tesser...@googlegroups.com
> To unsubscribe from this group, send email to
> tesseract-oc...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en
>
--
``All that is gold does not glitter,
not all those who wander are lost;
the old that is strong does not wither,
deep roots are not reached by the frost.
From the ashes a fire shall be woken,
a light from the shadows shall spring;
renewed shall be blade that was broken,
the crownless again shall be king.”
Seems that I'm not clear enougth or just my english is not good enougth.
So I try to explain again.
I have sacns of english text. But in the text is a lot of foregin names
(but just english characters)
And when I apply the OCR the text is recongnized without problems. But
the names is many times wrong, and confidence (I use commandline and
hOCR output) is low on that words (names).
As I wan't to proffread the text I write application to show text in
editor and image in other window. And I get confidence from hOCR to show
text where tess means that can be wrong. And all the names is marked red
in example as they are not in dictionary. (I use prebuilt
eng.traineddata). The attached page is just index and that names appear
in the book many times. So I just wonder if I can put that words (names)
in eng.user-words to make confidence better. So I don't want to train
new characters or new font. Just wan't to add new word to dictionary.
And just to be used in particiculary book. Is that possible?
As I discowered for now just adding text file eng.user-words has no
efect. So what steps are required to put it on?
hopefuy It's clear enougth now.
I have 3.01 from svn too.
And that field's are empty. So I modified as you suggest. But I see no
difference in OCR. The confidence is still low and missreaded word is
still missreaded.
And if I remove 'eng.user-words' then tess just abort execution with
missing eng.user-words statments so I assume that file is oppened and used.
So is there someone smart enought to explain how that
('lang.user-words') works.
And other things.. Is there someone smart enought to change source on
svn to have that included but just to check if user-words exist not to
popup error? (as I know the lang.user-words is optional so keep is like
that.)
Thanks...
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/18a7aac6-cc5d-4904-985e-4bb6ea1bccde%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.