text2image creates char boxes for 'fi' and 'fl'. Why?

94 views
Skip to first unread message

Brais Gabín Moreira

unread,
Sep 3, 2016, 5:23:55 AM9/3/16
to tesseract-ocr
Hi, I'm trying to train tesseract. But text2image creates a single box for 'fi' or 'fl'. Why it thinks that 'fi' or 'fl' are a single character instead of two? How can I fix this?

fuzzy7k

unread,
Sep 3, 2016, 1:45:21 PM9/3/16
to tesseract-ocr
It's a language thing: https://en.wikipedia.org/wiki/Typographic_ligature

Try specifying a specific language?

This parameter seems like a possible association (due to the description containing glyph):
segment_penalty_dict_nonword    1.25    Score multiplier for glyph fragment segmentations which do not match a dictionary word (lower is better).

Let me know what you find. I had this occur recently but have been chasing other issues and haven't verified a solution.

fuzzy7k

unread,
Sep 4, 2016, 4:19:34 PM9/4/16
to tesseract-ocr
My earlier successes were definitely font related. Use a blacklist, or whitelist....

-c tessedit_char_blacklist=fifl

https://groups.google.com/d/topic/tesseract-ocr/jO_4ZMMK9xw/discussion

Brais Gabín Moreira

unread,
Sep 8, 2016, 5:18:17 AM9/8/16
to tesseract-ocr
How can I set a blacklist to text2image? -c tessedit_char_blacklist=fifl doesn't work for me.

My problem is that text2image writes things like this:

fl 133 162 159 199 5

I tried with --ligatures=true but the result is this one:

fl 133 162 159 199 5

I'll continue with my research...
Reply all
Reply to author
Forward
0 new messages