math formulas

albert

unread,

Aug 26, 2010, 11:27:21 AM8/26/10

to tesseract-ocr

Hi,

I need an open OCR library which is able to scan complex printed math
formulas (for example some formulas which were generated via LaTeX). I
want to get some LaTeX-like output (or just some AST-like data).

Can Tesseract do this? Is there something like this already? Or are
current OCR technics just able to parse line-oriented text?

Thanks,
Albert

Jimmy O'Regan

unread,

Aug 27, 2010, 5:53:56 AM8/27/10

to tesser...@googlegroups.com

Tesseract does not do that. There's an open enhancement request that
might have more information:
http://code.google.com/p/tesseract-ocr/issues/detail?id=270

--
<Leftmost> jimregan, that's because deep inside you, you are evil.
<Leftmost> Also not-so-deep inside you.

Albert Zeyer

unread,

Aug 27, 2010, 9:06:30 AM8/27/10

to tesser...@googlegroups.com

Am 27.08.10 11:53, schrieb Jimmy O'Regan:

On 26 August 2010 16:27, albert <albert...@rwth-aachen.de> wrote:

Hi,

I need an open OCR library which is able to scan complex printed math
formulas (for example some formulas which were generated via LaTeX). I
want to get some LaTeX-like output (or just some AST-like data).

Can Tesseract do this? Is there something like this already? Or are
current OCR technics just able to parse line-oriented text?

Tesseract does not do that. There's an open enhancement request that
might have more information:
http://code.google.com/p/tesseract-ocr/issues/detail?id=270

Ah, but I am asking for more than just be able to scan math symbols. I want to have support to scan full formulas which can be quite complex. A combination of \frac, \int, \sum, etc. It must not only detect the symbols, it must also see how they belong together (for example the numerator and the denominator in a fraction).

Is it possible to extend Tesseract to be able to do this or is some heavy redesign of the whole engine needed (and some fundamental other technics) to do this?

lab

unread,

Aug 28, 2010, 5:43:09 PM8/28/10

to tesseract-ocr

Hi Albert,

Tesseract cannot read display formulas, its fundamental model is only
linear.
Unless and until that changes, the best you can hope for is
recognizing symbols
in text, and you will have to watch out for problems with superscripts
and subscripts.

There is a project which claims to have that capability (here:
http://www.inftyproject.org/en/software.html#InftyReader) but it isn't
Free Software and only runs on Windows machines,
and I haven't any personal experience with it. Caveat emptor.

Cheers,
Laird Breyer

On Aug 27, 11:06 pm, Albert Zeyer <albert.ze...@rwth-aachen.de> wrote:

> //

Jimmy O'Regan

unread,

Aug 28, 2010, 7:09:32 PM8/28/10

to tesser...@googlegroups.com

On 27 August 2010 14:06, Albert Zeyer <albert...@rwth-aachen.de> wrote:
> Am 27.08.10 11:53, schrieb Jimmy O'Regan:
>
> On 26 August 2010 16:27, albert <albert...@rwth-aachen.de> wrote:
>

[I don't know what e-mail client you're using, but it's completely
useless at quoting text]

>
> Ah, but I am asking for more than just be able to scan math symbols. I want
> to have support to scan full formulas which can be quite complex. A
> combination of \frac, \int, \sum, etc.

I realise that. If I had thought you wanted to recognise individual
symbols, I would have told you to retrain for those characters.

As it is, I pointed you to the enhancement request, which, as you seem
to not have read it, has some - admittedly, not much - extra
information on the topic.

> It must not only detect the symbols,
> it must also see how they belong together (for example the numerator and the
> denominator in a fraction).
>
> Is it possible to extend Tesseract to be able to do this or is some heavy
> redesign of the whole engine needed (and some fundamental other technics) to
> do this?
>

The only current system available for maths recognition - the link is
in the enhancement request - contains its maths recognition as a
separate engine. I don't think that's strictly necessary, but maths
would need to be processed in an entirely different way, and a formula
detection mechanism would be required to ensure it is handled in a
different way. At the very least, the formula would need to be
segmented into a grid, because relative position and size is much more
significant than in text - not just in detecting
superscripts/subscripts, but also in determining if pi means pi or
product, etc.

Albert Zeyer

unread,

Aug 28, 2010, 7:46:41 PM8/28/10

to tesser...@googlegroups.com

Hi Laird, hi Jimmy,

Thanks for your answers.

lab wrote:
> Tesseract cannot read display formulas, its fundamental model is only
> linear.
> Unless and until that changes, the best you can hope for is
> recognizing symbols
> in text, and you will have to watch out for problems with superscripts
> and subscripts.

That is what I thought.

> There is a project which claims to have that capability (here:
> http://www.inftyproject.org/en/software.html#InftyReader) but it isn't
> Free Software and only runs on Windows machines,
> and I haven't any personal experience with it. Caveat emptor.

Several people have linked that project now. However, I don't have
Windows to even test that and I am searching esp. for a free/open and
cross-platform solution which I can use in own projects.

Jimmy O'Regan wrote:
> [I don't know what e-mail client you're using, but it's completely
> useless at quoting text]

[Yea I know, that's Thunderbird...]

> As it is, I pointed you to the enhancement request, which, as you seem
> to not have read it, has some - admittedly, not much - extra
> information on the topic.

Ah sorry, I missunderstood the request. The description of it is just
about the symbols that is why I thought this request is about symbols.
The original poster only added in an additional comment that this
request may be extended to full formula recognition -- which is in my
eyes a very different request, so it would have fit better into another,
separated request.

Also, despite the link to the Inftyproject and a comment that it is not
open source, the rest of the discussion is just about symbol recognition
(and esp. about how Detexify works).

>> Is it possible to extend Tesseract to be able to do this or is some heavy
>> redesign of the whole engine needed (and some fundamental other technics) to
>> do this?

> The only current system available for maths recognition - the link is
> in the enhancement request - contains its maths recognition as a
> separate engine. I don't think that's strictly necessary, but maths
> would need to be processed in an entirely different way, and a formula
> detection mechanism would be required to ensure it is handled in a
> different way. At the very least, the formula would need to be
> segmented into a grid, because relative position and size is much more
> significant than in text - not just in detecting
> superscripts/subscripts, but also in determining if pi means pi or
> product, etc.

Thanks for this evalutation.

I will see what I can do. Maybe I will try to play around with this a
bit myself. It was anyway just for a small side project for me so I am
not sure yet how much time I want to invest into this. I will let you
know if I have something interesting for you.

Cya,
Albert

Jimmy O'Regan

unread,

Aug 29, 2010, 9:18:27 PM8/29/10

to tesser...@googlegroups.com

On 29 August 2010 00:46, Albert Zeyer <albert...@rwth-aachen.de> wrote:

> Jimmy O'Regan wrote:
>> As it is, I pointed you to the enhancement request, which, as you seem
>> to not have read it, has some - admittedly, not much - extra
>> information on the topic.
>
> Ah sorry, I missunderstood the request. The description of it is just about
> the symbols that is why I thought this request is about symbols. The
> original poster only added in an additional comment that this request may be
> extended to full formula recognition -- which is in my eyes a very different
> request, so it would have fit better into another, separated request.
>
> Also, despite the link to the Inftyproject and a comment that it is not open
> source, the rest of the discussion is just about symbol recognition (and
> esp. about how Detexify works).
>

Well, like I said, not much information, but the issue tracker is a
good place to keep notes, and there is a link from the Infty site to
papers on various aspects of their system.

>>> Is it possible to extend Tesseract to be able to do this or is some heavy
>>> redesign of the whole engine needed (and some fundamental other technics)
>>> to
>>> do this?
>>
>> The only current system available for maths recognition - the link is
>> in the enhancement request - contains its maths recognition as a
>> separate engine. I don't think that's strictly necessary, but maths
>> would need to be processed in an entirely different way, and a formula
>> detection mechanism would be required to ensure it is handled in a
>> different way. At the very least, the formula would need to be
>> segmented into a grid, because relative position and size is much more
>> significant than in text - not just in detecting
>> superscripts/subscripts, but also in determining if pi means pi or
>> product, etc.
>
> Thanks for this evalutation.
>
> I will see what I can do. Maybe I will try to play around with this a bit
> myself. It was anyway just for a small side project for me so I am not sure
> yet how much time I want to invest into this. I will let you know if I have
> something interesting for you.

The only paper I've read on formula detection was rather dated, and
amounted basically to looking for a distribution of numbers,
individual letters, and math-like symbols (and that would erroneously
consider everything that looks like '(1)' to be maths).

Look at the tab detection code - if you have a bunch of math-like
symbols *and* a set of vertically aligned equals signs, then you
probably have a proof; if you have large [] containing aligned
math-likes, you probably have a matrix. You'd have something that's
surely more reliable than checking for a bunch of random numbers and
letters, plus you can turn a large task into a set of small tasks.

Reply all

Reply to author

Forward