Train Tesseract with sample word images

283 views
Skip to first unread message

Raj Julha

unread,
Jul 6, 2011, 11:42:37 AM7/6/11
to tesseract-ocr
Hi

I'm planning to train Tesseract on handwritten text, from mainly
historical documents. Because of the cursive nature of the handwritten
text it is difficult to isolate single characters so I was planning to
create images of words and then use a list of words as training
source. Alternatively I could create a text file with the handwritten
transcription and the coordinates of each word on the image. Can I use
that as input for tesseract training? I'm mainly interested in using
the command line version.

Cheers

Raj

Dmitri Silaev

unread,
Jul 6, 2011, 2:31:21 PM7/6/11
to tesser...@googlegroups.com
Yes, this is possible, at least in theory. In box files you can map
arbitrary glyphs to character sequences. However possibility is high
you'll stumble upon some difficulties with accuracy. From what comes
to my mind for the moment, I can name the two. First, although
Tesseract is somewhat immune to glyph variations, these can be quite
high in the case of handwritten text. Second, Tesseract uses internal
scaling for every glyph (called normalization), so that many word
glyphs obviously different to a human eye can be recognized as the
same word. By the same reason Tess may confuse word glyphs if their
lengths vary much and there are very long words. What is "vary much"
and "very long" should be determined experimentally, though.

BTW I suppose you mean that your historic documents use a connected
script, as not all cursive is necessarily connected, see
http://en.wikipedia.org/wiki/Cursive. With letters that are only
sloppy but not connected, the problem is much easier, and imho it
makes sense to spend some time devising a good segmentation algo and
pre- and post-processing logic to use Tess in a more traditional way.

HTH

Warm regards,
Dmitri Silaev
www.CustomOCR.com

> --
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To post to this group, send email to tesser...@googlegroups.com
> To unsubscribe from this group, send email to
> tesseract-oc...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en
>

Raj Julha

unread,
Jul 7, 2011, 2:03:46 AM7/7/11
to tesseract-ocr
Thanks for your input Dmitri.

Raj

On Jul 6, 10:31 pm, Dmitri Silaev <daemons2...@gmail.com> wrote:
> Yes, this is possible, at least in theory. In box files you can map
> arbitrary glyphs to character sequences. However possibility is high
> you'll stumble upon some difficulties with accuracy. From what comes
> to my mind for the moment, I can name the two. First, although
> Tesseract is somewhat immune to glyph variations, these can be quite
> high in the case of handwritten text. Second, Tesseract uses internal
> scaling for every glyph (called normalization), so that many word
> glyphs obviously different to a human eye can be recognized as the
> same word. By the same reason Tess may confuse word glyphs if their
> lengths vary much and there are very long words. What is "vary much"
> and "very long" should be determined experimentally, though.
>
> BTW I suppose you mean that your historic documents use a connected
> script, as not all cursive is necessarily connected, seehttp://en.wikipedia.org/wiki/Cursive. With letters that are only
Reply all
Reply to author
Forward
0 new messages