Nick White
unread,May 1, 2014, 2:36:37 PM5/1/14Sign in to reply to author
Sign in to forward
You do not have permission to delete messages in this group
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to tesser...@googlegroups.com
Hi all,
I noticed recently that my training doesn't do a good of detecting
superscripted numbers (which occur frequently in the texts I'm
working with, to point to footnotes). They're often misrecognised as
speech marks (e.g. ”).
They will always be difficult, as they're small, and (particularly
with the old books I'm working with) not very clearly printed
compared to the surrounding text. However I suspect Tesseract can do
a better job.
My current plan is to train variants of numbers that are
superscripted (smaller and above the baseline), as this will
presumably help things, as Tesseract uses information on the
location of a character on the line to help it identify them.
Presuming this works after testing I'll add a mode to the text2image
tool to enable superscripted rendering of selected characters.
I wanted to check if anyone else has encountered issues with
superscripted words or characters, and if anybody has tried any
techniques to improve recognition. Conversely, if superscripted
words and characters generally work perfectly for you, that's useful
to know too.
Thanks in advance,
Nick