retrieve words not matching the dictionary

328 views
Skip to first unread message

Meenal Goyal

unread,
Jun 30, 2014, 4:40:10 AM6/30/14
to tesser...@googlegroups.com
Hi,

When i run tesseract on my image, it produces some words not present in the dictionary. Is there some way to directly get the list of these words and prevent tesseract from showing them in the output.
Example of such words are: fiJfifilnlflfiflhu-«fifllfllfilfi , neefls» , oscxmwxufis etc.

Nick White

unread,
Jun 30, 2014, 11:14:29 AM6/30/14
to tesser...@googlegroups.com
Hi Meenal,
Do you mean you want Tesseract to only match dictionary words, or
recognise, but not print, words that aren't in the dictionary?

Nick

Meenal Goyal

unread,
Jul 1, 2014, 5:04:36 AM7/1/14
to tesser...@googlegroups.com
Hi Nick,

When I try to ocr an image, it also produces some noise apart from the meaningful words. An example output for an image is:

All women become

like their’ mqthers. _ ' 1"’ '

- —T at-{rs their tragedy. ” "R"-‘»“T‘*'-.
‘ .

/

 

N man does“

That's‘his. ‘ '

os'cAR»w;L'15E ‘ 9

So, I wanted something which removes the noise in the text or at least reduce it and produce correct output.

Nick White

unread,
Jul 1, 2014, 11:22:35 AM7/1/14
to tesser...@googlegroups.com
Hi Meena,

On Tue, Jul 01, 2014 at 02:04:36AM -0700, Meenal Goyal wrote:
> When I try to ocr an image, it also produces some noise apart from the
> meaningful words. An example output for an image is:
>
> All women become
>
> like their’ mqthers. _ ' 1"’ '
>
> - —T at-{rs their tragedy. ” "R"-‘»“T‘*'-.
> ‘ .
>
> /
>
>
>
> N man does“
>
> That's‘his. ‘ '
>
> os'cAR»w;L'15E ‘ 9
>
> So, I wanted something which removes the noise in the text or at least reduce
> it and produce correct output.

I see. The best plan would be to preprocess the image to clean it
up, so that Tesseract isn't seeing all that noise in the first
place. Check out this wiki page:
https://code.google.com/p/tesseract-ocr/wiki/ImproveQuality

If you want to send a specific example image to the mailing list, we
can try to offer more specific advice.

Nick

Meenal Goyal

unread,
Jul 2, 2014, 2:48:07 AM7/2/14
to tesser...@googlegroups.com
Hi Nick,

I have read that post earlier and also tried to preprocess the image. This is the input image http://imgur.com/yCxOvQS,GD38rCa which after preprocessing gives this http://imgur.com/JzrDkug . I wanted to know if there is some way to improve in post-processing phase. Right now I am using regex matching to filter the noise but it doesn't work in all cases. For eg:
"does‘?",  "That's‘his." ,  "their’" are some words which may not be considered fully as noise but they get filtered out after regex matching.

Also, Is there any way to retrain tesseract for improving results in such cases? Any feedback mechanism which can help improve?


On Tuesday, July 1, 2014 8:52:35 PM UTC+5:30, Nick White wrote:
Hi Meenal,

Nick White

unread,
Jul 2, 2014, 10:10:56 AM7/2/14
to tesser...@googlegroups.com
That's a tough thing to preprocess. Take a look at this recent
thread on this list: "question about training tesseract".

Nick
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an email
> to tesseract-oc...@googlegroups.com.
> To post to this group, send email to tesser...@googlegroups.com.
> Visit this group at http://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/msgid/
> tesseract-ocr/bcaac70d-0459-4783-9b4b-86934eb003b7%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Meenal Goyal

unread,
Jul 3, 2014, 1:26:17 AM7/3/14
to tesser...@googlegroups.com
Hi Nick,

The post about "question about training tesseract" only suggests some pre-processing steps which include binarisation and  I have already tried them. I wanted to know if anything can be done to improve output at later stage, something like adding the words to the dictionary used by tesseract.

I have tried listing words in eng.user-words but it wasn't much useful. Can you suggest anything of this sort which can train tesseract over the time and help improve the output.

Thanks,
Meenal

Nick White

unread,
Jul 3, 2014, 5:18:41 PM7/3/14
to tesser...@googlegroups.com
On Wed, Jul 02, 2014 at 10:26:16PM -0700, Meenal Goyal wrote:
> The post about "question about training tesseract" only suggests some
> pre-processing steps which include binarisation and I have already tried them.
> I wanted to know if anything can be done to improve output at later stage,
> something like adding the words to the dictionary used by tesseract.

OK, I see. The reason I recommended binarisation is that I suspect
you'll have a lot more luck with that than anything else, for your
problems.

> I have tried listing words in eng.user-words but it wasn't much useful. Can you
> suggest anything of this sort which can train tesseract over the time and help
> improve the output.

If you're sure that all the words you will encounter will be in the
dictionary this should help somewhat:
https://code.google.com/p/tesseract-ocr/wiki/FAQ#How_to_increase_the_trust_in/strength_of_the_dictionary?

Nick

Meenal Goyal

unread,
Jul 4, 2014, 5:08:46 AM7/4/14
to tesser...@googlegroups.com


On Friday, July 4, 2014 2:48:41 AM UTC+5:30, Nick White wrote:
On Wed, Jul 02, 2014 at 10:26:16PM -0700, Meenal Goyal wrote:
> The post about "question about training tesseract" only suggests some
> pre-processing steps which include binarisation and  I have already tried them.
> I wanted to know if anything can be done to improve output at later stage,
> something like adding the words to the dictionary used by tesseract.

OK, I see. The reason I recommended binarisation is that I suspect
you'll have a lot more luck with that than anything else, for your
problems.

I have tried binarisation and it was surely helpful in improving the output.
 
> I have tried listing words in eng.user-words but it wasn't much useful. Can you
> suggest anything of this sort which can train tesseract over the time and help
> improve the output.

If you're sure that all the words you will encounter will be in the
dictionary this should help somewhat:
https://code.google.com/p/tesseract-ocr/wiki/FAQ#How_to_increase_the_trust_in/strength_of_the_dictionary?

 
The words won't always be in dictionary so I tried adding them in file eng.user-words but i m confused about the weightage given to this file against the already defined dictionaries.
Also, I have read that post earlier about strengthening the dictionary and tried to modify some variables in the configuration file.  But then it starts recognizing wrong words, may be its the case of over-correcting.
 
Nick

Nick White

unread,
Jul 4, 2014, 10:15:54 AM7/4/14
to tesser...@googlegroups.com
On Fri, Jul 04, 2014 at 02:08:46AM -0700, Meenal Goyal wrote:
> If you're sure that all the words you will encounter will be in the
> dictionary this should help somewhat:
> https://code.google.com/p/tesseract-ocr/wiki/FAQ#How_to_
> increase_the_trust_in/strength_of_the_dictionary?
>
> The words won't always be in dictionary so I tried adding them in file
> eng.user-words but i m confused about the weightage given to this file against
> the already defined dictionaries.
> Also, I have read that post earlier about strengthening the dictionary and
> tried to modify some variables in the configuration file. But then it starts
> recognizing wrong words, may be its the case of over-correcting.

Yes, that's the problem with just emphasising the dictionary.
Ultimately if you're giving Tesseract a lot of noise, it's going to
be very hard to stop it producing garbage output. So I'm afraid
better binarisation is all I can recommend.

Meenal Goyal

unread,
Jul 7, 2014, 1:03:53 AM7/7/14
to tesser...@googlegroups.com
Hi Nick,

I am using this technique for binarisation http://liris.cnrs.fr/christian.wolf/software/binarize/ . Could you recommend anything better than this one.

Thanks.
Reply all
Reply to author
Forward
0 new messages