High Error rate even if good quality image and low noise

Alex Szeto

unread,

Mar 30, 2016, 11:34:14 AM3/30/16

to tesseract-ocr

I am working on a license plate recognition project, I have trouble in improve accuracy of OCR.

Attached is one of the image I used and the result is very poor.

version of tesseract : 3.0.3

The command that I used : tesseract Untitled.jpg out -psm 9

The result is : SXUSBBB while I am expecting for 5X0S888

I have did some experiments and I have found some character pairs are easily get confused by tesseract.

for example : '0' become 'U' ; '5' and 'S' ; 'B' and '8'

Is there some methods or parameters I can set so the result can be improved?

Thank a lot and I really appreciated any advises.

Untitled.png

Tom Morris

unread,

Mar 30, 2016, 6:43:11 PM3/30/16

to tesseract-ocr

Looking at the image and result, it's pretty easy to see what the confusion is, particularly for a recognizer tuned to deal with a wide variety of fonts, and given the fact that you're not attempting to recognize actual words, but arbitrary strings of symbols.

Have you considered building something on OpenCV or a similar tool where you could take advantage of a) the very small number of symbols and their specific shapes and b) knowledge of the specific ordering of numbers and letters plus any other domain knowledge that's available.

Tom

Art Rhyno.

unread,

Mar 31, 2016, 2:35:26 PM3/31/16

to tesser...@googlegroups.com

Hi,

Tesseract is detecting the blobs for each character correctly at least. One trick is to leverage the coordinates of each character for extracting individual images, invert the colours, and use single character mode (-psm 10) to do the recognition. I think you have to dig into the API to get the character coordinates or use the makebox option (e.g. tesseract license.png license makebox). If you isolate each character, it usually recognizes it, not something that is recommended for a lot of text but maybe worthwhile in this case.

art

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/abcbfacf-3491-4b85-87b1-a43e5e4de56f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Alex Szeto

unread,

Apr 1, 2016, 4:57:12 AM4/1/16

to tesseract-ocr

Thank you for your advises. I am working on Opencv for my project actually.

a: can I have more detail on how to advantage of symbols and their specific shapes?

I have used whitelist in tesseract options to eliminate some impossible results.

Recently I have used opencv to make the font thinner (make it more like normal font), and result is improved for character like '8' , however for '0', it still have 50% chance getting 'U'. I really have no clue why it gets a U instead of 'D' ('O' is eliminated).

b: unfortunately, in my case, Hong Kong license plate have no fixed ordering of character/number, so no prior knowledge like this can be used.

Alex Szeto

unread,

Apr 1, 2016, 5:09:51 AM4/1/16

to tesseract-ocr

Hi art, In fact my program have did your trick, isolating the character and use -psm 10. However, result haven't get better.

I have one question about this. when using -psm 10, what background color should be used? As I suspect the tesseract sometime not knowing whether black or white color is the background, it then get bad result.

Is there a option in tesseract for setting background color or text color? I have actually found some parameter related but I dont know what value should be input.

For example , the preset value have no much sense to me , why it is '2' for editor_image_text_color ..etc . Really appreciated if you could help. Thank you

name

value

description

editor_image_word_bb_color	7	Word bounding box colour
editor_image_blob_bb_color	4	Blob bounding box colour
editor_image_text_color	2	Correct text colour

ref : http://www.sk-spell.sk.cx/tesseract-ocr-parameters-in-302-version

Alex

Art Rhyno.

unread,

Apr 1, 2016, 7:20:46 AM4/1/16

to tesser...@googlegroups.com

> I have one question about this. when using -psm 10, what background color should be used?

Hi Alex,

I did a simple invert but I have never burrowed very deeply into why it seems to make a difference for single characters. I usually add a margin to the character when extracted as well but opencv is probably the way to go in this case. Good luck!

art

Alex Szeto

unread,

Apr 1, 2016, 10:28:52 AM4/1/16

to tesseract-ocr

Hi Art,

Really thank you for reply.

Is adding margin to the character means adding a border to the image, so the image edge wont touch the character?

If yes, I have did this tick with opencv.

Alex

Art Rhyno.

unread,

Apr 1, 2016, 10:44:03 AM4/1/16

to tesser...@googlegroups.com

Hi Alex,

Yes, some spacing around the character seems to help.

art

Alex Szeto

unread,

Apr 2, 2016, 3:13:26 AM4/2/16

to tesseract-ocr

Hi art,

Yes it is, it does help. Thank you.

Alex

Alex Szeto

unread,

Apr 2, 2016, 3:21:30 AM4/2/16

to tesseract-ocr

Thank you Tom and art for you guy's kind advices.

Eventually I have obtained a much better result by using a training data from openalpr project.

In fact, the font in my place (hk) is different from US (HK font is more square like),then I have changed my character image's aspect ratio so it is more like that of US.

I am really fortunate that this little trick work.

Reply all

Reply to author

Forward