Possible to prioritise some characters over others during OCR?

Diederik Hattingh

unread,

May 31, 2016, 2:10:28 PM5/31/16

to tesseract-ocr

I have a case where my tesseract isn't detecting URLs as expected. (More details in my SO question.)

The http:// part is being recognised as http:II. If I specify a white list of characters that doesn't include capital I tesseract recognizes the string correctly.

Is it possible for me to specify a priority of characters to recognize?

Any other ideas on how to tweak the parameters to increase my accuracy?

Is incorrectly read as "

http:II11111111111111111111111111111111111
1111111111111111111.coml

"

Ashish Goel

unread,

Jun 1, 2016, 1:01:22 AM6/1/16

to tesseract-ocr

I also wish to find a way to avoid such cases. Even I am facing some cases where I get extra white spaces, lower/upper case mismatch and wrong detection of characters...

Stef

unread,

Jun 2, 2016, 5:21:51 PM6/2/16

to tesseract-ocr

You can resolve the ambiguity using the unicharambigs file, for details see my SO answer to your SO question.

Stef

John Muccigrosso

unread,

Jun 3, 2016, 12:31:47 PM6/3/16

to tesseract-ocr

On Thursday, June 2, 2016 at 5:21:51 PM UTC-4, Stef wrote:

You can resolve the ambiguity using the unicharambigs file, for details see my SO answer to your SO question.

Stef

I'm curious about this as well. Could you post a link to this discussion?

Thanks.

Stef

unread,

Jun 3, 2016, 12:39:38 PM6/3/16

to tesseract-ocr

Here you are: SO answer.

Diederik Hattingh

unread,

Jun 15, 2016, 3:58:15 PM6/15/16

to tesseract-ocr

Hi Stef,
Thanks for the reply (here and on SO).

The fix mostly works, but unfortunately I am still seeing that tesseract sometimes ignores the unicharambigs file I set for it.

For example I have the following two images:

And :

The only difference between the files is the border around them.

In my eng.unicharambigs file I have added the following lines:

3    : I I    3    : / /    1
3    : / I    3    : / /    1
3    : I /    3    : / /    1
5    . c o m l    5    . c o m /    1
3    : / l    3    : / /    1
3    : l /     3    : / /    1

When I run tesseract on file without spacing I get the following output:

http:II11111111111111111111111111111111111111111
1111111111111111111.com/

When I run tesseract on file with spacing I get the correct output:

http://11111111111111111111111111111111111111111
1111111111111111111.com/

Another example of spacing (or something else?) making a difference:

Smaller border

Larger border:

both these files have spacing around the text with the first image having less spacing. (and the find is a little different between the two images, though very slightly)

running Tesseract on first file gives correct result: http://alphaGl.com/primenumbershittingbearl (Except for 6 -> G and last / becoming l)

On the second image I get the output http://alpha61.comIprimenumbershittingbearl. It seems as if the unicharambigs file is ignored for the .com/ case. It doesn't do the substitution as specified.

Anything you can think of the fix this problem?

Bojidar Stanchev

unread,

Jun 16, 2016, 6:36:37 AM6/16/16

to tesseract-ocr

well you could just run a simple program on the output on tesseract to find and correct those mistakes

in your case if you have http:// and you see http:II then it should be a no brainer to just change to http:// it's an easy case because those two dashes are always there

another thing is that probably after .com there is eather nothing or a slash.. not many cases there eather.

Give it a quick search - maybe there is already a program that checks urls..

tl; dr: with urls being standartised it is pretty easy to create a program that detects errors in links and correct them.

Diederik Hattingh

unread,

Jul 7, 2016, 2:33:41 PM7/7/16

to tesseract-ocr, bojidar...@gmail.com

Hi Bojidar,
Thanks for your reply. Yes, you are right, I can just rely on the well defined strings of URLs to work around this problem, but having set the values for the unicharambigs file I was expecting the output behavior to be reliable from tesseract. I'm disappointed that it is not. If another user has the same problems, but not the same easy to fix output strings she will be more stuck than I am.

It still looks like a tesseract bug to me.

Anyway, thanks again for the effort.

Regards

Reply all

Reply to author

Forward