Pharmaceutics OCR recognition project

471 views
Skip to first unread message

elena bresciani

unread,
Jun 11, 2014, 6:49:34 AM6/11/14
to tesser...@googlegroups.com
Hello to everybody,

for the project I'm working on I need to automatically recognize a grug from an image of its package.
I tried tesseract but with not so good results. In particular sometimes certain words (especially the drug names) are totally bad interpreted and moreover other words (even printed in big fonts) are missing.

How can I resolve my issues?
Maybe I have to train tesseract with a "drug-dictionary"?
And how can I resolve the problem of completly missing words?

Thank you in advance

Cheers
Elena

Paul

unread,
Jun 14, 2014, 10:11:58 AM6/14/14
to tesser...@googlegroups.com
Could you probably show us an example image that gives you bad results?

Probably it would be useful to use another technique for  image binarization.
Tesseract uses Otsu's method. I would suggest to use a method like this one by Kasar et. al.
It can be helpful with colored imagery and white on black/color text.

Your idea to add a drug dictionary could also be beneficial. You don't necessarily need to start a new training, though.
Maybe using bazaar with your own "eng.user-words" file might be enough (see http://tesseract-ocr.googlecode.com/svn-history/r1116/trunk/doc/tesseract.1.html).

elena bresciani

unread,
Jun 17, 2014, 5:43:06 AM6/17/14
to tesser...@googlegroups.com
I tried using bazaar with my user-words and results are way much better, also working on image pre-processing contributed to improve output.

I have another issue now: I expanded my list of user-words to about 7000 words but i get this error:

 >>Error: word '......' not in DAWG after adding it
 >>Error: failed to load /usr/local/share/tessdata/ita.user-words

I found a report of the problem here: https://code.google.com/p/tesseract-ocr/issues/detail?id=1020
but still I don't know how to solve it. Reading through the source code (in dict.h) I found, like in the report:

   static const int kMaxUserDawgEdges = 50000;

is this that cause the error? But my list is of 7000 words, which is much less than 50000...
I don't understand.

Thank you very much.

Elena

Paul

unread,
Jun 18, 2014, 10:30:03 AM6/18/14
to tesser...@googlegroups.com
The number of edges in the DAWG is not equivalent to the number of words in your dictionary.
Here's some information about DAWGs: http://tesseract-ocr.repairfaq.org/allaboutdawg.html

That upper bound actually might be the root of your problem. If you've already compiled Tesseract on your own,
try to use a greater number for kMaxUserDawgEdges. If you have not, you could either reduce the number of
words in your dictionary or add the dictionary during training.

Regards,
Paul

Nick White

unread,
Jun 19, 2014, 12:26:31 PM6/19/14
to tesser...@googlegroups.com
On Wed, Jun 18, 2014 at 07:30:03AM -0700, Paul wrote:
> That upper bound actually might be the root of your problem. If you've already
> compiled Tesseract on your own,
> try to use a greater number for kMaxUserDawgEdges. If you have not, you could
> either reduce the number of
> words in your dictionary or add the dictionary during training.

As stated in the report the problem is now fixed in the recent SVN
versions of the code, so another alternative solution would be to
just compile the latest SVN.

Nick

elena bresciani

unread,
Jun 20, 2014, 4:27:45 AM6/20/14
to tesser...@googlegroups.com
Yep, that was the solution.
Now it works fine even with bigger dictionaries.

Thank you guys.

Elena



--
You received this message because you are subscribed to a topic in the Google Groups "tesseract-ocr" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/tesseract-ocr/neyvXo2TAn0/unsubscribe.
To unsubscribe from this group and all its topics, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/20140619162557.GB8381%40manta.lan.
For more options, visit https://groups.google.com/d/optout.

Paul

unread,
Jun 23, 2014, 6:45:01 AM6/23/14
to tesser...@googlegroups.com
Great to hear that.

Paul
Reply all
Reply to author
Forward
0 new messages