Android OCR application looking for quality improvments

416 views
Skip to first unread message

Pierre-Henri DAUVERGNE

unread,
Jul 17, 2014, 8:36:50 AM7/17/14
to tesser...@googlegroups.com
Hello
I am relatively new to android development and I am working on an OCR app that would take a picture of a document and get the text out of it (the cameras could be from relatively old phones). During my research, I've found that tesseract was the best API to use, so here I am :)

I understand that the image needs this to be as good as possible :
- Binarization (having a picture in black and white)
- Without border (I'm using another library to crop the photo and process only the part I want)
- Deskewing
- Training

Others parameters that would influence, I guess, would be Scaling and trying to recognize one character after another (I haven't looked that much into it)

But I can't find any documentation or people having the same issue as I have :
I added the "eng.traineddata" in my project, but I don't feel like it's being used or anything. I just added the file I found online but haven't done anything else and tesseract seems to be having troubles reading characters that appears to be fine (well, at least not that unaccurate). I can't find any guide or tutorial online on "how to train tesseract for android". Could anyone help ? I've understood that it would take time but I'm willing to do it on my own.

The other thing is about deskewing. Same idea : no guide nor tutorials online and the Skew class doesn't seem to be working properly as it always returns 0.0. Could anyone help ? ^^

Thank you for your help, I hope I'm clear enough on my issues.

I added a picture of the photos I'm taking and the cropped+binarized result as well as the returning string (sorry it's not english but you can see it's not really good :x)


Do you know how I could improve my picture preprocessing ? As you can see, there's still a lot of noise around the characters.



This is what I'm doing so far :

photo = WriteFile.writeBitmap(AdaptiveMap.backgroundNormMorph(ReadFile.readBitmap(photo))); // locally adaptive; preparation to binarize
photo = WriteFile.writeBitmap(Binarize.otsuAdaptiveThreshold(ReadFile.readBitmap(photo))); // locally adaptive; special binarization methods
photo = WriteFile.writeBitmap(Enhance.unsharpMasking(ReadFile.readBitmap(photo), 1, (float) 0.5));  //im not sure about those parameters

ocr_engine.setVariable("textord_max_noise_size", "3");
ocr_engine.setVariable("textord_heavy_nr ", "1");
ocr_engine.setImage(photo);
ocr_engine.setPageSegMode(TessBaseAPI.OEM_TESSERACT_CUBE_COMBINED);
String recognizedText = ocr_engine.getUTF8Text();


Thank you for any help
photo.jpg
result.txt
IMG_20140717_143218.jpg
Message has been deleted

Pierre-Henri DAUVERGNE

unread,
Jul 21, 2014, 8:26:16 AM7/21/14
to tesser...@googlegroups.com
Well, I have managed to significantly increase the quality of the output by rescaling my input photo (I multiplied it by 3 on width and height) but I'm still having some issues with recognizing some characters.

For instance, capital i and l are often misread, same goes with O and 0, 5 and s etc.

I would like to force tesseract to find the most appropriate word in the language dictionary I'm using, does anyone knows if that's possible ?
Would there be any other things I could do to improve the quality of the OCR ? (I've thought of skewing but I can't find anything online and it doesn't seem to be that big of a problem)

Anirban Jana

unread,
Feb 25, 2015, 8:26:24 AM2/25/15
to tesser...@googlegroups.com
Can u please help me. Where you added this line of code.
photo = WriteFile.writeBitmap(AdaptiveMap.backgroundNormMorph(ReadFile.readBitmap(photo))); // locally adaptive; preparation to binarize
photo = WriteFile.writeBitmap(Binarize.otsuAdaptiveThreshold(ReadFile.readBitmap(photo))); // locally adaptive; special binarization methods
photo = WriteFile.writeBitmap(Enhance.unsharpMasking(ReadFile.readBitmap(photo), 1, (float) 0.5));  //im not sure about those parameters

ocr_engine.setVariable("textord_max_noise_size", "3");
ocr_engine.setVariable("textord_heavy_nr ", "1");
ocr_engine.setImage(photo);
ocr_engine.setPageSegMode(TessBaseAPI.OEM_TESSERACT_CUBE_COMBINED);
String recognizedText = ocr_engine.getUTF8Text();



Daniel

unread,
Mar 9, 2015, 12:56:20 AM3/9/15
to tesser...@googlegroups.com
Hey Pierre, I'm trying to accomplish the same thing as you for my thesis. Could you tell me if you managed to preprocess images enough to read the documents? So far I've applied Unsharp Mask and Threshold as you did, I even fixed the skew angle by following this link http://felix.abecassis.me/2011/10/opencv-rotation-deskewing/ but haven't gotten acceptable results.

What others filter can I use to make the image more readable for Tesseract?

Attached is the image I'm testing (cropped and rotated), the two outcomes from the filters and the text resulting from Tesseract.
0.img02.png
1.img02-sharpened.png
2.img02-sharpened-threshold1.png
out.txt

ShreeDevi Kumar

unread,
Mar 9, 2015, 4:49:09 AM3/9/15
to tesser...@googlegroups.com
have you followed the suggestions given on


ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/c4e390d2-cb38-4b08-b713-39650fb45c34%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

zdenko podobny

unread,
Mar 9, 2015, 5:52:13 AM3/9/15
to tesser...@googlegroups.com
Have a look at Text Fairy (OCR)[1].
I have a good experience with it (I use it for extracting text from books for quotation e.g. just few lines).
Code is availabe on Github[2] (I am not sure if it is up-to-date).

Zdenko

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
Reply all
Reply to author
Forward
0 new messages