Re: How to instruct tesseract not to use ligatures (i.e. don't use fi, fl... instead fi, fl...)

1,577 views
Skip to first unread message
Message has been deleted

Zdenko Podobný

unread,
Mar 31, 2012, 4:12:50 PM3/31/12
to tesser...@googlegroups.com

Dňa 31.03.2012 16:17, klo  wrote / napísal(a):
In my simple testing, I find this most common problem, is there a way to 
instruct tesseract not to use those glyphs without limiting it to ASCII?

I use tesseract 3.01 BTW

put them to blacklist with variable tessedit_char_blacklist (search forum if you do not know how).

Zdenko

klo

unread,
Apr 1, 2012, 5:16:59 AM4/1/12
to tesser...@googlegroups.com, zde...@gmail.com
Thanks. I added it to my tesseract configuration file and it works great

Cheers
Message has been deleted

klo uo

unread,
Apr 29, 2013, 9:21:16 AM4/29/13
to tesser...@googlegroups.com
Michael,

for example add this line in your config file:

tessedit_char_blacklist    fifl

I don't know how gmail with represent these characters, but make sure file is in UTF8 I guess


On Mon, Apr 29, 2013 at 9:45 AM, Michael Sander <michael...@gmail.com> wrote:
How did you format your config file? I tried adding the following line and it doesn't seem to work:

tessedit_char_blacklist fi
--
--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesser...@googlegroups.com
To unsubscribe from this group, send email to
tesseract-oc...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en
 
---
You received this message because you are subscribed to a topic in the Google Groups "tesseract-ocr" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/tesseract-ocr/jO_4ZMMK9xw/unsubscribe?hl=en.
To unsubscribe from this group and all its topics, send an email to tesseract-oc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Nick White

unread,
Apr 29, 2013, 11:09:32 AM4/29/13
to tesser...@googlegroups.com
On Mon, Apr 29, 2013 at 07:00:47AM -0700, Michael Sander wrote:
> On a related note, why is tesseract even generating these characters in the
> first place given the fact that I chose English as the training data?

They are english characters. They're ligatures, used in printed
English a lot. Look closely at the nicest printed books you have for
fi and fl and you'll find they're joined in a different way to if
they had just been separate letters.

So it is reasonable for Tesseract to try to recognise when they're
used, as its goal is recognising printed text.

Nick

Greg Dunkel

unread,
Apr 29, 2013, 7:48:48 PM4/29/13
to tesser...@googlegroups.com
I couldn't get the config to work on Ubuntu so I wrote a post-processing sed script to convert the ligatures to two characters.


On Mon, Apr 29, 2013 at 3:45 AM, Michael Sander <michael...@gmail.com> wrote:
How did you format your config file? I tried adding the following line and it doesn't seem to work:

tessedit_char_blacklist fi

On Sunday, April 1, 2012 5:16:59 AM UTC-4, klo wrote:

--
--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesser...@googlegroups.com
To unsubscribe from this group, send email to
tesseract-oc...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en
 
---
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.
 
 



--
/greg

Michael Sander

unread,
Apr 29, 2013, 8:39:57 PM4/29/13
to tesser...@googlegroups.com
Yes, I'm doing something similar in python. Do you know of a list of a ligatures so I can convert them to ascii? I know fi and fl are the most popular, but there are probably many more.

You received this message because you are subscribed to a topic in the Google Groups "tesseract-ocr" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/tesseract-ocr/jO_4ZMMK9xw/unsubscribe?hl=en.
To unsubscribe from this group and all its topics, send an email to tesseract-oc...@googlegroups.com.

Sven Pedersen

unread,
Apr 29, 2013, 10:54:56 PM4/29/13
to tesser...@googlegroups.com
You appear to be a fellow Ithacan! (I no longer live there, but remember it fondly.)

Anyway, other common ligatures include ff, ffi, ffl, fb, fy, ft
``All that is gold does not glitter,
  not all those who wander are lost;
the old that is strong does not wither,
  deep roots are not reached by the frost.
From the ashes a fire shall be woken,
  a light from the shadows shall spring;
renewed shall be blade that was broken,
  the crownless again shall be king.”

Michael Sander

unread,
Apr 30, 2013, 3:00:34 AM4/30/13
to tesseract-ocr
Thanks or the article, very helpful. 

And yes, I too remember Ithaca, though most of it was spent downing bottles of diet coke late at night in Philips hall.

Tom Morris

unread,
Apr 30, 2013, 2:06:39 PM4/30/13
to tesser...@googlegroups.com, me...@cornell.edu
On Monday, April 29, 2013 8:39:57 PM UTC-4, Michael Sander wrote:
Yes, I'm doing something similar in python. Do you know of a list of a ligatures so I can convert them to ascii? I know fi and fl are the most popular, but there are probably many more.


The list of Unicode ligatures is here: http://www.unicode.org/charts/PDF/UFB00.pdf

Go Big Red!
Reply all
Reply to author
Forward
0 new messages