Unable detect number in box

305 views
Skip to first unread message

smarty pokemon

unread,
Mar 26, 2020, 3:03:48 PM3/26/20
to tesseract-ocr
Hi All,

I am trying to convert the following images into to the text via tesseract but unable to do so after multiple attempts.
I tried with different images by binarization of image-making color invert I only want to extract number in box
but no luck after several attempts I am using the
ubuntu 16.04 server
tesseract 3.04.01
leptonica-1.73
libgif 5.1.2 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.4 : libopenjp2 2.1.2
locale set to eng us
I am using the cli option `tesseract image_1.jpg stdout`
tried with all -psm as well.

Can some help me to understand where I am doing wrong or image has some issue?
Thanks in advance.
image_1.jpg
4237229115.jpg
image (2).jpg
image (1).jpg
image (4).jpg

Lorenzo Bolzani

unread,
Mar 27, 2020, 6:18:16 AM3/27/20
to tesser...@googlegroups.com
Hi,
an easy trick to remove closed borders it to fill the outside area with the border color and then with the opposite one. See the attached example.

For image 2 it is more complex. You can crop a little the image to remove the external borders and paint a rectangle over the middle line if the location is approximately fixed.

Otherwise use morphological transformations to merge the number into blobs:


dilate to join the letters and later erode to delete the lines (or the opposite depending if the background if black or white).

Now do component analysis to find the remaining blobs and crop those regions from the original image with some margin.

Now you have the numbers but I do not know a simple reliable way to fill them. Maybe tesseract is able to read them. Otherwise I would try to do a little dilate, to make them thicker, it might help.


Bye

Lorenzo

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/14602759-68be-4a71-b6d2-43fa9b2a8081%40googlegroups.com.
remove_border.py

smarty pokemon

unread,
Mar 27, 2020, 7:11:05 AM3/27/20
to tesseract-ocr
Hi Lorenzo Biz,

Thanks for your response I would definitely try  this since the Box position is dynamic and It can anywhere.
Is there no other way to detect e,g image (1).jpg and image (4).jpg 
does the box creates an issue for tesseract ?
Message has been deleted

smarty pokemon

unread,
Mar 31, 2020, 3:09:54 PM3/31/20
to tesseract-ocr
Using the New version of tesseract 5.0 Alpha trying to get the value in the current column but everything is failing I tried a lot.

Had to do this from backend script can anyone suggest something please Number can be anything in the column but had to be done from backend.

tesseract is able to extract all other info from file except the numbers in box.
tried the following command as well no luck

convert -colorspace gray -fill white  -resize 480%  -sharpen 0x1  file.png file.jpg
tesseract file.jpg file

getting output like 


Current | Potential Current | Potential
Very energy efficient - lower running costs Very environmentally friendly - lower CO2 emissions
(92-100) (92-100) /\
(81-91) € (81-91) B)
(69-80) (ey
«i (55-68) D
(39-54) Ee
Not energy efficient - higher running costs Not environmentally friendly - higher CO2 emissions
England, Scotland & Wales voi: a England, Scotland & Wales Soonce Eee


I have installed all the languages though.
EPC_91958096.png
EPC_68906690 (1).png
EPC_12121212.png

smarty pokemon

unread,
Mar 31, 2020, 3:29:38 PM3/31/20
to tesseract-ocr
Neither these worked after converting them to mono crome and inverting color.

check_.jpg
check_.jpg

Lorenzo Bolzani

unread,
Apr 3, 2020, 1:12:35 PM4/3/20
to tesser...@googlegroups.com

Yes, I think this kind of boxes may be a problem for tesseract.

But the script I posted removes the box and solves the problem. To remove the box you fill the area around the box with the same color of the box so they merge. It's easier if you do it on a thresholded image.




Bye

Lorenzo


--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages