Ray,
can you explain what you mean by skipping text line and word finding,
ie how to enable or disable this correctly in tesseract?
I've had mixed results with the standard tesseract 2.03 (debian,
default options) on mathematical documents. Most sentences with simple
formulas or isolated mathematical symbols can be read reasonably well
after training some sample pages, but displayed equations and formulas
(ie on their own line(s)) are usually completely garbled. Moderately
simple symbols with both a superscript and a subscript cannot usually
be recognized at all. Also, having both superscripts and subscripts
somewhere in a single formula can confuse tesseract so that it thinks
the superscript belongs to the previous line or an "extra" line in
between. I've also observed that sometimes, the same symbol can be
recognized easily when it occurs in a subscript position, but is often
mistaken when it occurs in a superscript position.
lab.
On Dec 12, 8:51 am, "Ray Smith" <
theraysm...@gmail.com> wrote:
> This problem has not been attempted before with tesseract.
> The biggest thing to watch out for is to skip the text line and word
> finding. You might have significant success just running the classifier on
> the connected components.
> Training might be a bit tricky too, since it relies on the text line finder.
> Ray.
>
> Sent from my G1 Android Phone.
>