Extract text from simple image

223 views
Skip to first unread message

Vlad Vasilov

unread,
Sep 30, 2017, 2:07:40 PM9/30/17
to tesseract-ocr
Hello,

I'm trying to extract txt from a simple image, and I can achieve this, please help me with some advices.


This is the image:


What i have try so far:

preparing the image with imagemagick:

1) convert -units PixelsPerInch petrom.png -resample 300 p.png
2) convert -colorspace gray -density 300 -sigmoidal-contrast 3,0%  -depth 8 petrom.png p.png


But after runing: 
tesseract p.png stdout

I have:


for no image preparing:

Warning. Invalid resolution 0 dpi. Using 70 instead.

mummy mm"... stum-m ann . Enn/on mm:


 


2m7ru973n 19 35 3 93m RON 5 22m RON



for 1:

Motorinl Stlndlrd Motorinl ExtrllOMIl' Diesel



for 2:

i


a mu,

2m mu «5 aa

Dmitri Silaev

unread,
Sep 30, 2017, 4:22:10 PM9/30/17
to tesser...@googlegroups.com
Did you know that the upper image part has transparent background? Moreover, it is an antialiased text, and some of character pixels do not bear correct transparency and/or color. Instead, they look like rendered on a white background. See attached "petrom_gimp.png".

It's not over yet. The image is PNG in indexed color (has a palette). So if you want white background (and it *must* be white because of the above), you can't have it, because the palette is jam packed. Need to convert to RGB first.

Until you get rid of all those problems, you won't get any further.

Now, when you fix everything (see "petrom_x.png"), your Option #1 will work relatively well. Not so well for the "Data/Timp" text, though. Here I suggest to do custom binarization. Here's what you can get by means of it: "screenmine_ss02.png".

Not sure what you wanted to achieve with option #2. To me, it seems useless.

All above image conversions can be made with ImageMagick. I'm not quite sure about custom binarization, but maybe you can get away without it.

Best regards,
Dmitri Silaev
www.CustomOCR.com





--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/35f5e01c-c932-449c-a2b4-f6341f68235a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

petrom_gimp.png
screenmine_ss02.png
petrom_x.png

Vlad Vasilov

unread,
Oct 1, 2017, 4:35:35 AM10/1/17
to tesseract-ocr
Did you know that the upper image part has transparent background? Moreover, it is an antialiased text, and some of character pixels do not bear correct transparency and/or color. Instead, they look like rendered on a white background. See attached "petrom_gimp.png".

I didn't know that.

I have converted to RGB and removed ALPHA + resample
convert $1 -colorspace RGB -alpha remove -units PixelsPerInch $2 -resample 300 $2

it works like a charm!
Thanks a lot.


Now, when you fix everything (see "petrom_x.png"), your Option #1 will work relatively well. Not so well for the "Data/Timp" text, though. Here I suggest to do custom binarization. Here's what you can get by means of it: "screenmine_ss02.png".

This part is not so important, but I made a few tests:

1)
convert in.png -threshold 38.039% out.png

Data/Timp Motorina Standard Motorina ExtraIOMV Diesel 2017-09-3019:38 4.930 RON 5.220 RON



2)
convert in.png -threshold 38.038% out.png

Dataffimp Motorina Standard Motorina Extra/OMV Diesel 2017-09-3019:38 4.930 RON 5.220 RON



For threshold >= 38,039 'Data/Timp' is correct but  "ExtraIOMV" not
End for 2 reverse.


Thanks again!
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

Dmitri Silaev

unread,
Oct 1, 2017, 9:45:28 AM10/1/17
to tesser...@googlegroups.com
You may want to check the way you grab and save your input image. It might be the app or component that messes up with transparency. Some components replace same colored pixels with 100% transparency while taking the upper left pixel as a reference. And you can save as RGB from the get go. Then you can avoid image related issues altogether.

-Dmitri





To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
Reply all
Reply to author
Forward
0 new messages