Reading Device labels to get model number

Bill Garrison

unread,

Nov 12, 2014, 5:00:03 PM11/12/14

to tesser...@googlegroups.com

So if someone sends in labels like the attached ones, I need to grab the model number. So far results from straight tesseract usage are dismal. I used an ImageMagick library to clean up the image a bit and send it in and if its rotated at ALL the results are still dismal. Overall, I am just looking to increase accuracy.

Steps I have taken:

1) Using pre-processing library to clean up image

2) Added a new config that turns off dictionary and calls in a words file that has all the different samsung model numbers in it

3) tried to take my most promising pre-processed image and create a box file and then used "tesseract <image_name> <box_file_name> nobatch box.train" to train tesseract to not miss the two characters it missed ....this caused a segmentation fault.

Any hints or advice about how I can use tesseract to grab this information with at least 50% accuracy would be GREATLY appreciated.

Thanks!!

RF28HMEDBSR.png

RF28HMEDBSR_gt30.png

RF34H9960S4.png

ShreeDevi Kumar

unread,

Nov 13, 2014, 4:34:25 AM11/13/14

to tesser...@googlegroups.com

Straighten the image before sending to tesseract. You can use scantailor or unpaper.

Imagemagick may also have an option, you'll have to look.

See attached images - output from scantailor - and then OCRed using Vietocr (gui frontend to Tesseract)

MODEL NAME 7

MOORE RF28HMEDBSR

ml.“

| mt RFQBHMEDBSH

MODEL NAML I

MODELE I RF34H996084

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/aeb92e24-faa7-4a08-bcca-e7ab0c225776%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

RF28HMEDBSR.tif

RF28HMEDBSR_gt30.tif

RF34H9960S4.tif

Allistair C

unread,

Nov 13, 2014, 11:42:47 AM11/13/14

to tesser...@googlegroups.com

I think the table lines are not helping.

I up-sized your image to 1000px wide, then ran into Tesseract with PSM=6 and got mostly rubbish.

Then I removed the table lines manually in Photoshop, then up-sized your image to 1000px wide, then ran into Tesseract with PSM=6:

RFZBHMEDBSR

R 134a/ 160 g(5.64 oz)

AC 115 VI 60 Hz

6.0 A

230 PSI I 103 PSI

NOV. 2013

35 96 x 36 % x 70

Food for thought.

Allistair C

unread,

Nov 13, 2014, 11:45:34 AM11/13/14

to tesser...@googlegroups.com

Do you have higher resolution images to work with - that's one issue going on here as the edges of your text are very fuzzy and at that resolution it's pretty hard for Tesseract. You can also play with Thresholding and Opening (Erosion/Dilation) to thicken some of your lines up (using e.g. ImageMagick or OpenCV) prior to Tesseract.

On Wednesday, 12 November 2014 22:00:03 UTC, Bill Garrison wrote:

shree

unread,

Nov 13, 2014, 1:08:53 PM11/13/14

to tesser...@googlegroups.com

also take a look at the pre-processing method mentioned at https://github.com/tleyden/open-ocr/wiki/Stroke-Width-Transform-In-Action

Reply all

Reply to author

Forward