Tesseract does not do that. There's an open enhancement request that
might have more information:
http://code.google.com/p/tesseract-ocr/issues/detail?id=270
--
<Leftmost> jimregan, that's because deep inside you, you are evil.
<Leftmost> Also not-so-deep inside you.
On 26 August 2010 16:27, albert <albert...@rwth-aachen.de> wrote:Hi, I need an open OCR library which is able to scan complex printed math formulas (for example some formulas which were generated via LaTeX). I want to get some LaTeX-like output (or just some AST-like data). Can Tesseract do this? Is there something like this already? Or are current OCR technics just able to parse line-oriented text?Tesseract does not do that. There's an open enhancement request that might have more information: http://code.google.com/p/tesseract-ocr/issues/detail?id=270
[I don't know what e-mail client you're using, but it's completely
useless at quoting text]
>
> Ah, but I am asking for more than just be able to scan math symbols. I want
> to have support to scan full formulas which can be quite complex. A
> combination of \frac, \int, \sum, etc.
I realise that. If I had thought you wanted to recognise individual
symbols, I would have told you to retrain for those characters.
As it is, I pointed you to the enhancement request, which, as you seem
to not have read it, has some - admittedly, not much - extra
information on the topic.
> It must not only detect the symbols,
> it must also see how they belong together (for example the numerator and the
> denominator in a fraction).
>
> Is it possible to extend Tesseract to be able to do this or is some heavy
> redesign of the whole engine needed (and some fundamental other technics) to
> do this?
>
The only current system available for maths recognition - the link is
in the enhancement request - contains its maths recognition as a
separate engine. I don't think that's strictly necessary, but maths
would need to be processed in an entirely different way, and a formula
detection mechanism would be required to ensure it is handled in a
different way. At the very least, the formula would need to be
segmented into a grid, because relative position and size is much more
significant than in text - not just in detecting
superscripts/subscripts, but also in determining if pi means pi or
product, etc.
Thanks for your answers.
lab wrote:
> Tesseract cannot read display formulas, its fundamental model is only
> linear.
> Unless and until that changes, the best you can hope for is
> recognizing symbols
> in text, and you will have to watch out for problems with superscripts
> and subscripts.
That is what I thought.
> There is a project which claims to have that capability (here:
> http://www.inftyproject.org/en/software.html#InftyReader) but it isn't
> Free Software and only runs on Windows machines,
> and I haven't any personal experience with it. Caveat emptor.
Several people have linked that project now. However, I don't have
Windows to even test that and I am searching esp. for a free/open and
cross-platform solution which I can use in own projects.
Jimmy O'Regan wrote:
> [I don't know what e-mail client you're using, but it's completely
> useless at quoting text]
[Yea I know, that's Thunderbird...]
> As it is, I pointed you to the enhancement request, which, as you seem
> to not have read it, has some - admittedly, not much - extra
> information on the topic.
Ah sorry, I missunderstood the request. The description of it is just
about the symbols that is why I thought this request is about symbols.
The original poster only added in an additional comment that this
request may be extended to full formula recognition -- which is in my
eyes a very different request, so it would have fit better into another,
separated request.
Also, despite the link to the Inftyproject and a comment that it is not
open source, the rest of the discussion is just about symbol recognition
(and esp. about how Detexify works).
>> Is it possible to extend Tesseract to be able to do this or is some heavy
>> redesign of the whole engine needed (and some fundamental other technics) to
>> do this?
> The only current system available for maths recognition - the link is
> in the enhancement request - contains its maths recognition as a
> separate engine. I don't think that's strictly necessary, but maths
> would need to be processed in an entirely different way, and a formula
> detection mechanism would be required to ensure it is handled in a
> different way. At the very least, the formula would need to be
> segmented into a grid, because relative position and size is much more
> significant than in text - not just in detecting
> superscripts/subscripts, but also in determining if pi means pi or
> product, etc.
Thanks for this evalutation.
I will see what I can do. Maybe I will try to play around with this a
bit myself. It was anyway just for a small side project for me so I am
not sure yet how much time I want to invest into this. I will let you
know if I have something interesting for you.
Cya,
Albert
Well, like I said, not much information, but the issue tracker is a
good place to keep notes, and there is a link from the Infty site to
papers on various aspects of their system.
>>> Is it possible to extend Tesseract to be able to do this or is some heavy
>>> redesign of the whole engine needed (and some fundamental other technics)
>>> to
>>> do this?
>>
>> The only current system available for maths recognition - the link is
>> in the enhancement request - contains its maths recognition as a
>> separate engine. I don't think that's strictly necessary, but maths
>> would need to be processed in an entirely different way, and a formula
>> detection mechanism would be required to ensure it is handled in a
>> different way. At the very least, the formula would need to be
>> segmented into a grid, because relative position and size is much more
>> significant than in text - not just in detecting
>> superscripts/subscripts, but also in determining if pi means pi or
>> product, etc.
>
> Thanks for this evalutation.
>
> I will see what I can do. Maybe I will try to play around with this a bit
> myself. It was anyway just for a small side project for me so I am not sure
> yet how much time I want to invest into this. I will let you know if I have
> something interesting for you.
The only paper I've read on formula detection was rather dated, and
amounted basically to looking for a distribution of numbers,
individual letters, and math-like symbols (and that would erroneously
consider everything that looks like '(1)' to be maths).
Look at the tab detection code - if you have a bunch of math-like
symbols *and* a set of vertically aligned equals signs, then you
probably have a proof; if you have large [] containing aligned
math-likes, you probably have a matrix. You'd have something that's
surely more reliable than checking for a bunch of random numbers and
letters, plus you can turn a large task into a set of small tasks.