Tesseract configuration for alphanumeric strings: mixes up 2, Z, 6 and G

1,910 views
Skip to first unread message

Timothy Korse

unread,
Jun 26, 2016, 4:27:48 PM6/26/16
to tesseract-ocr

I'm trying to configurate tesseract to recognize alphanumeric strings of 10 characters long (all uppercase).


This works pretty good, except it seems to mix up the following characters pretty often:

  • 2 and Z
  • 6 and G


Examples of images are:

These are preprocessed. I think this process was successfully done. I'll glad to hear otherwise. 


This is how I run Tesseract:


tesseract = new Tesseract();                      
tesseract.setOcrEngineMode(TessAPI.TessOcrEngineMode.OEM_TESSERACT_ONLY);
tesseract.setPageSegMode(7);
tesseract.setTessVariable("load_system_dawg", "0");
tesseract.setTessVariable("load_freq_dawg", "0");
tesseract.setTessVariable("load_punc_dawg", "0");
tesseract.setTessVariable("load_number_dawg", "0");
tesseract.setTessVariable("load_unambig_dawg", "0");
tesseract.setTessVariable("load_bigram_dawg", "0");
tesseract.setTessVariable("load_fixed_length_dawgs", "0");

tesseract.setTessVariable("classify_enable_learning", "0");
tesseract.setTessVariable("classify_enable_adaptive_matcher", "0");

tesseract.setTessVariable("segment_penalty_garbage", "0");
tesseract.setTessVariable("segment_penalty_dict_nonword", "0");
tesseract.setTessVariable("segment_penalty_dict_frequent_word", "0");
tesseract.setTessVariable("segment_penalty_dict_case_ok", "0");
tesseract.setTessVariable("segment_penalty_dict_case_bad", "0");


Note that this is Java code, but my question is not limited to Java.

I am not really experienced with Tesseract and seem to find the documentation very unclear. I hope someone else can help me out.


To give some more context:


How do I train Tesseract?


I train Tesseract by combining over 200 images into one image. Every image contains 10 alphanumeric characters. Also, I am sure the box file is correct.


I build the final language by executing the following batch script:

tesseract qwe.combined.jpg qwe.combined.box nobatch box.train

echo combined 1 0 0 0 0 > font_properties

unicharset_extractor qwe.combined.box

shapeclustering -F font_properties -U unicharset qwe.combined.box.tr

mftraining -F font_properties -U unicharset -O qwe.unicharset qwe.combined.box.tr

cntraining qwe.combined.box.tr

copy inttemp qwe.inttemp
copy normproto qwe.normproto
copy pffmtable qwe.pffmtable
copy shapetable qwe.shapetable

combine_tessdata qwe.

How can I make Tesseract discriminate better between the 2Z6 and G?

Allistair

unread,
Jun 26, 2016, 4:49:09 PM6/26/16
to tesser...@googlegroups.com
Did you ever look at incorporating the unicharambigs file into your training?


--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/bba1f122-6bb2-43f6-9a7d-9daa75f5323e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Timothy Korse

unread,
Jun 27, 2016, 4:37:36 AM6/27/16
to tesseract-ocr
Hi Alistair,

Thank you for your response. Yes I actually tried that without luck. I think unicharambigs is useful when using dictionaries, which I do not use. I simply can't substitute a 2 by a Z because it might as well be a 2.

I tried the following format:

v1
1 Z 1 2 x
1 2 1 Z x
1 G 1 6 x
1 6 1 G x
1 M 1 H x
1 H 1 M x

Where x is ofcourse the mode. For this setting I tried 0, 1 and 3. Unfortunately other modes than 0 and 1 are not documented. I thought looking at the source code of Tesseract that 3 might do the trick, but it didn't.

Am I doing something wrong?


Op zondag 26 juni 2016 22:49:09 UTC+2 schreef Allistair C:

Allistair

unread,
Jun 27, 2016, 4:52:58 AM6/27/16
to tesser...@googlegroups.com
Have you tried the generally useful increasing your image sizes until it works approach? Not sure if the samples you posted were the actual size but in the past I have read this problem *can* lessen with larger image sizes - even artificially upsampled images.

Timothy Korse

unread,
Jun 27, 2016, 5:43:52 AM6/27/16
to tesseract-ocr
Yes, these images are actually upsampled. The height of the caracters from the input source is about 20 pixels. Now they are about 100 pixels, I can see the difference between the 2 and the Z for instance quite strong. So I am sure that Tesseract can too.

Please let me know if someone needs more information in order to help me out.

I really appreciate your help!

Op maandag 27 juni 2016 10:52:58 UTC+2 schreef Allistair C:

Allistair

unread,
Jun 27, 2016, 5:50:58 AM6/27/16
to tesser...@googlegroups.com
Have you perhaps tried upsizing without sampling? What that will do is give you harsher edges on the larger image size which may allow Tesseract to fit its classifications better. 

Timothy Korse

unread,
Jun 27, 2016, 5:53:47 AM6/27/16
to tesser...@googlegroups.com
I'm afraid I misunderstood `upsampling`. I just upsized the ROI. But I'll definitely look into upsampling!

--
You received this message because you are subscribed to a topic in the Google Groups "tesseract-ocr" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/tesseract-ocr/Hr79AmtApeA/unsubscribe.
To unsubscribe from this group and all its topics, send an email to tesseract-oc...@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

Eric Medin

unread,
Nov 23, 2016, 1:48:47 PM11/23/16
to tesseract-ocr
Timothy,

Did you ever get this working? It looks like you and I are working a similar problem set.
Reply all
Reply to author
Forward
0 new messages