Strange regocnition

181 views
Skip to first unread message

Gennady Goncharov

unread,
Oct 31, 2014, 9:03:28 AM10/31/14
to tesser...@googlegroups.com
Hi, first of all I must say thank you for all developers, who did so much job to create Tesseract OCR! That's really cool!

But I have one problem with using it and can't solve it by myself, so please help me!

Result of recognize atteched image is 105 500. Why is it 500 not 600? I tried a lot of ways to solve it:
- I used tesseract2 and tesseract3;
- I tried to train it by special font with jTessBoxEditor (after that it doesn't work at all - error about attempting to read data);
- I tried to scale the image four times with highqualitybicube.

Nothing works. And I have a question - could anybody train tesseract3 for a font supercell-magic? It is attached. I need only numbers.

Any help would be greatly appreciated!
333.bmp
Supercell-magic-webfont.ttf

ShreeDevi Kumar

unread,
Oct 31, 2014, 10:20:18 AM10/31/14
to tesser...@googlegroups.com
change image to 300 dpi

try vietocr - in screenshot mode - 
try with the vietnamese traineddata

----------------

with commandline tesseract use 'digits' config file as parameter

"recognizing only numbers is actually answered on the tesseract FAQ page. See that page for more info, but if you have the version 3 package, the config files are already set up. You just specify on the commandline:
tesseract image.tif outputbase nobatch digits


ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/e46e341f-0ce0-4ec3-9709-4826849a7ce0%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Gennady Goncharov

unread,
Oct 31, 2014, 10:44:38 AM10/31/14
to tesser...@googlegroups.com
Thank you very much for answer! But as I say, VietOCR doesn't make workable files for me and I don't know why. Even if I use "vie" language made by that program without any changes, I got an error "Attempted to read or write protected memory" on the line
if (processor.Init(@".\tessdata\\", "vie", (int)eOcrEngineMode.OEM_DEFAULT))

By the way, do you see two ending slashes? :) I got the same error with native "eng" languag if I use one and no slahes at all. Tried two slashes and it began to work! But for "vie" language data that trick doesn't work.

What about only numbers -I like to use

processor.SetVariable("tessedit_char_whitelist", "0123456789");

It works fine, but I get wrong numbers sometimes.
Thank for your advice about resolution, I'll try.

пятница, 31 октября 2014 г., 17:20:18 UTC+3 пользователь shree написал:

ShreeDevi Kumar

unread,
Oct 31, 2014, 11:03:26 AM10/31/14
to tesser...@googlegroups.com
Quan will be able to answer the question regarding 'memory fault' - 
are you using the java version of vietocr?
since you have tried multiple versions of tesseract, there maybe conflict betweem tessdata_prefix paths ???


the 'digits' config file just has 

tessedit_char_whitelist 0123456789-.

so, it does what you want to, 

from command line.

-----------
here's what I get with the java version (4.0 beta) of vietocr

Inline image 1

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Gennady Goncharov

unread,
Nov 1, 2014, 4:56:16 AM11/1/14
to tesser...@googlegroups.com
Thanks a lot for your attention to me, I've found good .net tesseract wrapper from GitHub and after that I use the latest version of the OCR, so the number from topic (105600) is recognized correctly.

But I have another problem - that number can't recognize even VietOCR! It's very strange why it gives empty result. Could you check please?

333.bmp

Quan Nguyen

unread,
Nov 1, 2014, 1:38:20 PM11/1/14
to tesser...@googlegroups.com
The image is really small -- it needs 300 DPI.

Nevertheless, VietOCR 4.0 beta, which uses Tesseract 3.03 RC, can pick it up without any problem.

If you use the .NET version, be sure to scale the image first.

Gennady Goncharov

unread,
Nov 2, 2014, 4:05:33 AM11/2/14
to tesser...@googlegroups.com
I got it, the problem was if I don't specify the region of recognition. For some reason it can't recognize it fully, but if make a border around that text - all ok! I did the same in my code:

                        using (var engine = new TesseractEngine(Application.StartupPath + @"\tessdata", "coc", EngineMode.Default))
                        {
                            engine.SetVariable("tessedit_char_whitelist", "0123456789");
                            engine.DefaultPageSegMode = PageSegMode.SingleWord;
                            var result = engine.Process(b).GetText();
                            int.TryParse(result, out number);
                        }
                        if (number < 10)
                        {
                            using (var engine = new TesseractEngine(Application.StartupPath + @"\tessdata", "coc", EngineMode.Default))
                            {
                                engine.SetVariable("tessedit_char_whitelist", "0123456789");
                                var result = engine.Process(b, new Rect(0, 4, b.Width, b.Height - 8), PageSegMode.SingleWord).GetText();
                                int.TryParse(result, out number);
                            }
                        }

And now it is much more stable! Also, finally I trained data for special font and accuracy of recognition is perfect now. Thanks to everyone!

суббота, 1 ноября 2014 г., 20:38:20 UTC+3 пользователь Quan Nguyen написал:
Reply all
Reply to author
Forward
0 new messages