poor recognition of 'fi'

Rick Leir

unread,

Jun 8, 2015, 12:22:10 PM6/8/15

to tesser...@googlegroups.com

This problem with ligatures or digraphs is appearing frequently, how can I avoid it? I want simple output text, without ligatures. It is possible that the 'f' and 'i' are touching in the image. Is there a way to pass hints to Tesseract? Version 3.03 on Linux. TIA

image text: fish
OCR: "\x{fb01}sh";
utf8: ﬁsh

image text: flambeau
OCR: "\x{fb02}ambeau,";
utf8: ﬂambeau,

"\x{fb01}xed";
ﬁxed

"arti\x{fb01}cial";
artiﬁcial

Greg Dunkel

unread,

Jun 8, 2015, 12:30:05 PM6/8/15

to tesser...@googlegroups.com

Since 'fi' and other ligatures generally get OCRed to a separate character, I just run a post-ocr sed script to take care of them, in Linux.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/db2d9502-388e-4558-9ade-a484e0a941c1%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

/greg

Rick Leir

unread,

Jun 9, 2015, 1:23:44 PM6/9/15

to tesser...@googlegroups.com

Hi Greg
What ligatures do you run into with tesseract that you need to post-ocr?
Thanks -- Rick

For anyone having trouble with utf-8 in sed, see:
http://stackoverflow.com/questions/27072558/sed-and-utf-8-encoding

John Slade

unread,

Jun 10, 2015, 9:23:12 AM6/10/15

to tesser...@googlegroups.com

Have a look at the options "tessedit_char_blacklist" and "tessedit_char_whitelist". You could blacklist any ligatures you aren't interested in.

Or go the other way and just whitelist the things you want - for instance you could whitelist to just the printable ascii characters.

John

Rick Leir

unread,

Jun 15, 2015, 1:55:08 PM6/15/15

to tesser...@googlegroups.com

John: thanks, I had not seen that! How does "tessedit_char_blacklist" affect OCR speed? Accuracy? I want to use it, but feel as if I am walking on thin ice..

Here is a list of ligatures from http://www.unicode.org/Public/UNIDATA/NamesList.txt , which ones do you commonly see in Tesseract output?

FB00    LATIN SMALL LIGATURE FF
    # 0066 0066
FB01    LATIN SMALL LIGATURE FI
    # 0066 0069
FB02    LATIN SMALL LIGATURE FL
    # 0066 006C
FB03    LATIN SMALL LIGATURE FFI
    # 0066 0066 0069
FB04    LATIN SMALL LIGATURE FFL
    # 0066 0066 006C
FB05    LATIN SMALL LIGATURE LONG S T
    # 017F 0074
FB06    LATIN SMALL LIGATURE ST
    # 0073 0074

There are also Armenian ligatures, Hebrew, Arabic ..

Tom Morris

unread,

Jun 16, 2015, 12:42:59 PM6/16/15

to tesser...@googlegroups.com

It's difficult to tell what the problem is without any example images. Are you saying that there are ligatures in the image and you don't want them recognized as such or that there are not ligatures, but the characters are touching due to low resolution or poor quality scan or over inking or very tight kerning or ...?

If everything else is satisfactory except for the occasional composed character being generated, why not just add a simple post processing step to decompose the ligatures into their constituent characters? It's a straight string substitution for characters which are not confusable with anything else.

Tom

Message has been deleted

Rick Leir

unread,

Jun 16, 2015, 2:53:27 PM6/16/15

to tesser...@googlegroups.com

use Encode qw(decode encode);
...
    $hocr = decode( 'UTF-8', $rawhocr );

    $hocr =~ s/\x{FB00}/ff/g;
    $hocr =~ s/\x{FB01}/fi/g;

    $octets = encode('UTF-8', $hocr);

Ryan Baumann

unread,

Jun 18, 2015, 9:53:41 AM6/18/15

to tesser...@googlegroups.com

For Latin OCR, I found I got vastly better results using unicharambigs with mandatory replacements, e.g.: https://github.com/ryanfb/latinocr-lattraining/blob/master/ligatures.unicharambigs