text2image creates char boxes for 'fi' and 'fl'. Why?

Brais Gabín Moreira

unread,

Sep 3, 2016, 5:23:55 AM9/3/16

to tesseract-ocr

Hi, I'm trying to train tesseract. But text2image creates a single box for 'fi' or 'fl'. Why it thinks that 'fi' or 'fl' are a single character instead of two? How can I fix this?

fuzzy7k

unread,

Sep 3, 2016, 1:45:21 PM9/3/16

to tesseract-ocr

It's a language thing: https://en.wikipedia.org/wiki/Typographic_ligature

Try specifying a specific language?

This parameter seems like a possible association (due to the description containing glyph):
segment_penalty_dict_nonword 1.25 Score multiplier for glyph fragment segmentations which do not match a dictionary word (lower is better).

Let me know what you find. I had this occur recently but have been chasing other issues and haven't verified a solution.

fuzzy7k

unread,

Sep 4, 2016, 4:19:34 PM9/4/16

to tesseract-ocr

My earlier successes were definitely font related. Use a blacklist, or whitelist....

-c tessedit_char_blacklist=ﬁﬂ

https://groups.google.com/d/topic/tesseract-ocr/jO_4ZMMK9xw/discussion

Brais Gabín Moreira

unread,

Sep 8, 2016, 5:18:17 AM9/8/16

to tesseract-ocr

How can I set a blacklist to text2image? -c tessedit_char_blacklist=ﬁﬂ doesn't work for me.

My problem is that text2image writes things like this:

fl 133 162 159 199 5

I tried with --ligatures=true but the result is this one:

ﬂ 133 162 159 199 5

I'll continue with my research...

Reply all

Reply to author

Forward