John: thanks, I had not seen that! How does "tessedit_char_blacklist" affect OCR speed? Accuracy? I want to use it, but feel as if I am walking on thin ice..
Here is a list of ligatures from
http://www.unicode.org/Public/UNIDATA/NamesList.txt , which ones do you commonly see in Tesseract output?
FB00 LATIN SMALL LIGATURE FF
# 0066 0066
FB01 LATIN SMALL LIGATURE FI
# 0066 0069
FB02 LATIN SMALL LIGATURE FL
# 0066 006C
FB03 LATIN SMALL LIGATURE FFI
# 0066 0066 0069
FB04 LATIN SMALL LIGATURE FFL
# 0066 0066 006C
FB05 LATIN SMALL LIGATURE LONG S T
# 017F 0074
FB06 LATIN SMALL LIGATURE ST
# 0073 0074
There are also Armenian ligatures, Hebrew, Arabic ..