How difficult would it be to use OCR to scan some
mathematics (such as mathematics produced using
the LaTeX typesetting system) and have a LaTeX
source file output as a result of applying OCR
software on the scanned in mathematics?
What are the current free and comercial
OCR software available for this purpose?
Finally does anyone know whether
Google uses its own home-brewed
OCR software to scan patents
and other documents or do
they use publicly available
free or commercial software
for their purposes?
Thanks,
John Goche
One program that tries to recognize formulas in scanned images is
http://www.inftyproject.org/en/software.html#InftyReader
Regards,
Christian
> How difficult would it be to use OCR to scan some
> mathematics (such as mathematics produced using
> the LaTeX typesetting system) and have a LaTeX
> source file output as a result of applying OCR
> software on the scanned in mathematics?
Next to impossible. OCR recognizes the characters, but not their meaning.
What, then, does InftyReader
http://www.inftyproject.org/en/software.html#InftyReader
do?
>OCR recognizes the characters, but not their meaning.
Well, "meaning" is a somewhat vague term. It is "only" the syntactical
structure
of the mathematical formula, given a raster image of it, that needs to be
recognized. The "meaning" of the formula is, IMHO, something else again.
But, certainly, even recognizing "only" the syntactical structure of a
formula,
given a raster image of it, is a non-trivial problem, to say the least.
Regards,
Christian
Never looked at it, but for my guess, see below.
>> OCR recognizes the characters, but not their meaning.
>
> Well, "meaning" is a somewhat vague term. It is "only" the syntactical
> structure
> of the mathematical formula, given a raster image of it, that needs to be
> recognized. The "meaning" of the formula is, IMHO, something else again.
> But, certainly, even recognizing "only" the syntactical structure of a
> formula,
> given a raster image of it, is a non-trivial problem, to say the least.
It shouldn't be so hard to identify, say, a boldface 'v' and have it
output \mathbf{v} or something like that. I suppose that's what the
InftyReader program does. But it cannot identify that that particular
symbol is referring to a vector, and that many other similar ones are
also vectors.
It cannot also identify common constructions in the particular field
used in the document. To give a simple example, if it sees output
something like $A^T$ it can output that, but it cannot identify that as
a notation for the transpose. So it cannot output something like
$\transp{A}$ (and \transp is a new command defined to output the
transpose of a matrix).
So you should be able to get a visual representation of the formula, but
no semantics. Which means that if you intend to later edit it things
will be somewhat hard, especially if you want to change a notation
(using $A^t$ for all transposes, for example).
Maybe you are underestimating InftyReader. Why not have a
look at the examples that are found at
http://www.inftyproject.org/en/demo.html#0002
It seems that InftyReader can correctly recognize such things as
sums (greek sigma signs with summation index, lower and upper limit)
and the like: which means that it apparently can recognize some
of the syntactical structure of mathematical formulas.
> It cannot also identify common constructions in the particular field
> used in the document. To give a simple example, if it sees output
> something like $A^T$ it can output that, but it cannot identify that as
> a notation for the transpose.
IMHO you fall into the trap of confusing recognition of syntactical
structure with recognition of "meaning" here: If the program can
recognize T as a superscript character, and correctly typeset it in
LaTeX as such, that is all you can reasonably expect from it.
You cannot expect a mathematician to always grasp "the" (intended)
meaning of some formula (qua syntactical structure) from a mere
scan of the formula either (in some cases, he would have to examine
the surrounding text, maybe from page 1 to page 499 or so...)
> So it cannot output something like
> $\transp{A}$ (and \transp is a new command defined to output the
> transpose of a matrix).
That's agreed (though for those cases, where notation is sufficiently
standardized - which is not always the case among mathematicians,
you know - it would not seem to be a major problem to do some
further "recognition" of such standard "meanings" assigned to
standardized syntax, once the syntactical structure has been recovered
from a mere raster image of the formula). But certainly, the program
does not understand the overall text that assigns "meaning" to
possibly quite "nonstandard syntax".
> So you should be able to get a visual representation of the formula,
I thought that the original poster (John Goche) wanted basically just
that.
> but no semantics.
Well, yes, in a way: and that's just what I wrote. Recognition of
syntax is already a good thing. Recognition of "meaning" from
a mere local[!] scan of a formula, is not even generally possible
for a mathematician.
>Which means that if you intend to later edit it things
> will be somewhat hard, especially if you want to change a notation
> (using $A^t$ for all transposes, for example).
I think you are expecting too much. If InftyReader can deliver what
it claims, it is already doing a great job indeed.
Regards,
Christian
> IMHO you fall into the trap of confusing recognition of syntactical
> structure with recognition of "meaning" here: If the program can
> recognize T as a superscript character, and correctly typeset it in
> LaTeX as such, that is all you can reasonably expect from it.
> You cannot expect a mathematician to always grasp "the" (intended)
> meaning of some formula (qua syntactical structure) from a mere
> scan of the formula either (in some cases, he would have to examine
> the surrounding text, maybe from page 1 to page 499 or so...)
Yes, but a language like LaTeX (or MathML), if correctly applid, can
actually represent that meaning. That is why it is next to impossible to
generate the full meaning of a LaTeX source from a scan of it's output.
You would need some sort of expert system, that replaces the
mathematicians brain.
The same is of course true for text structure. An OCR system may recognise
that a line is written in 14 pt sans serif bold extended, but that this
means "heading of second order" it has no way to grasp. LaTeX represents
that meaning as \section{}, and an experienced human reader of the output
would intuitively grasp it.