Improving recognition of superscripted numbers

42 views
Skip to first unread message

Nick White

unread,
May 1, 2014, 2:36:37 PM5/1/14
to tesser...@googlegroups.com
Hi all,

I noticed recently that my training doesn't do a good of detecting
superscripted numbers (which occur frequently in the texts I'm
working with, to point to footnotes). They're often misrecognised as
speech marks (e.g. ”).

They will always be difficult, as they're small, and (particularly
with the old books I'm working with) not very clearly printed
compared to the surrounding text. However I suspect Tesseract can do
a better job.

My current plan is to train variants of numbers that are
superscripted (smaller and above the baseline), as this will
presumably help things, as Tesseract uses information on the
location of a character on the line to help it identify them.
Presuming this works after testing I'll add a mode to the text2image
tool to enable superscripted rendering of selected characters.

I wanted to check if anyone else has encountered issues with
superscripted words or characters, and if anybody has tried any
techniques to improve recognition. Conversely, if superscripted
words and characters generally work perfectly for you, that's useful
to know too.

Thanks in advance,

Nick
Reply all
Reply to author
Forward
0 new messages