Mathematical Formulae recognition

1,323 views
Skip to first unread message

jean

unread,
Dec 11, 2008, 1:45:27 AM12/11/08
to tesseract-ocr
Hi,

I'm interested in developing an OCR to read math formulas using
tesseract as my platform. I have been trying to use tesseract to read
LATEX image files. I have tried reading the squareroot of x+2, and
tesseract read it as vx+2. For the sqrt(x+ sqrt(2)), tesseract sees
J@. No big surprise since tesseract wasn't made for understanding the
recursive nature of math formulas.

So my question is what progress has been made on a tesseract-based
math-OCR? And would there be any things I need to watch out for?

--Jean

74yrs old

unread,
Dec 11, 2008, 2:37:48 AM12/11/08
to tesser...@googlegroups.com
Have you trained the maths formulas?

jean

unread,
Dec 11, 2008, 12:33:42 PM12/11/08
to tesseract-ocr
not yet. has that been successfully done?

On Dec 10, 11:37 pm, "74yrs old" <withblessi...@gmail.com> wrote:
> Have you trained the maths formulas?
>

74yrs old

unread,
Dec 11, 2008, 12:51:07 PM12/11/08
to tesser...@googlegroups.com
If you train maths formula  properly as per training instructions in wiki I am sure you will suceeded.

Ray Smith

unread,
Dec 11, 2008, 5:51:29 PM12/11/08
to tesser...@googlegroups.com

This problem has not been attempted before with tesseract.
The biggest thing to watch out for is to skip the text line and word finding. You might have significant success just running the classifier on the connected components.
Training might be a bit tricky too, since it relies on the text line finder.
Ray.

Sent from my G1 Android Phone.

Hussein

unread,
Dec 11, 2008, 6:56:05 PM12/11/08
to tesser...@googlegroups.com
As a segmentation person myself, I would handle this in the preprocessing stage by recognizing the square root sign connected component  (rule based) and remove it and plug in its place a special component that is rarely used to be translated in the postprocessing to the square root sign.

Hussein Al-Hussein


Date: Thu, 11 Dec 2008 14:51:29 -0800
From: thera...@gmail.com
To: tesser...@googlegroups.com
Subject: Re: Mathematical Formulae recognition

74yrs old

unread,
Dec 11, 2008, 10:11:12 PM12/11/08
to tesser...@googlegroups.com
please upload the  maths formula for reference
I shall try with it

lab

unread,
Dec 12, 2008, 4:12:56 AM12/12/08
to tesseract-ocr
Ray,

can you explain what you mean by skipping text line and word finding,
ie how to enable or disable this correctly in tesseract?

I've had mixed results with the standard tesseract 2.03 (debian,
default options) on mathematical documents. Most sentences with simple
formulas or isolated mathematical symbols can be read reasonably well
after training some sample pages, but displayed equations and formulas
(ie on their own line(s)) are usually completely garbled. Moderately
simple symbols with both a superscript and a subscript cannot usually
be recognized at all. Also, having both superscripts and subscripts
somewhere in a single formula can confuse tesseract so that it thinks
the superscript belongs to the previous line or an "extra" line in
between. I've also observed that sometimes, the same symbol can be
recognized easily when it occurs in a subscript position, but is often
mistaken when it occurs in a superscript position.

lab.

On Dec 12, 8:51 am, "Ray Smith" <theraysm...@gmail.com> wrote:
> This problem has not been attempted before with tesseract.
> The biggest thing to watch out for is to skip the text line and word
> finding. You might have significant success just running the classifier on
> the connected components.
> Training might be a bit tricky too, since it relies on the text line finder.
> Ray.
>
> Sent from my G1 Android Phone.
>

Ray Smith

unread,
Dec 16, 2008, 5:34:21 PM12/16/08
to tesser...@googlegroups.com
You would need to cut out most of the code in the textord directory, and just run the classifier directly on the blobs, with the baseline correction feature disabled.

This means:
bypass filter_blobs and textord_page in edges_and_textord, making fake words and text lines from individiual blobs. The code in applybox.cpp might give you some idea of how to do this.
Set IntegerMatcherMultiplier to zero.

Ray.

jean

unread,
Dec 27, 2008, 2:14:50 PM12/27/08
to tesseract-ocr
Thanks for all the help. I found that the square rt sign is being cut
up (?) by the method rotate_cblob inside blobbox.cpp. Can someone
explain the purpose in rotating the blob? Is it to match this extra-
long blob with some character? I'm thinking about bypassing this
method for now...

Leopold Hamminger

unread,
Oct 31, 2019, 9:48:41 AM10/31/19
to tesseract-ocr
Hi,

I came across this conversation regarding formulae. May I ask whether you have made any progress?

I need a solution for this as well. Am glad to cooperate in testing etc.

Greetings,
Leo Hamminger
Reply all
Reply to author
Forward
0 new messages