I think that Tesseract, in order to be a successful project, must
be much more clear about what it is offering.
Now many people believe it is "an OCR program" that can function as
an alternative to commercial end user products. Some open source
software in other fields (especially OpenOffice and Firefox) can
meet such expectations. So it's natural that complete beginners
come to this list with basic questions about what a bitmap image
is. The commercial end user products would not bother their
customers with such details.
But today's Tesseract is much more like a subroutine library
that requires or at least assumes that its users are programmers.
The experts on this list are not really interested in explaining
what a bitmap image is. This mismatch comes from the failure to
explain what Tesseract is.
--
Lars Aronsson (la...@aronsson.se)
Aronsson Datateknik - http://aronsson.se
From the README:
"About the Engine
This code is a raw OCR engine. It has NO PAGE LAYOUT ANALYSIS, NO
OUTPUT FORMATTING, and NO UI."
What's unclear about that?
> Now many people believe it is "an OCR program" that can function as
> an alternative to commercial end user products.
Those people clearly haven't bothered to read the README.
> Some open source
> software in other fields (especially OpenOffice and Firefox) can
> meet such expectations. So it's natural that complete beginners
> come to this list with basic questions about what a bitmap image
No, it's not, really. Nobody comes to the Firefox mailing list asking
what a webpage is.
> is. The commercial end user products would not bother their
> customers with such details.
>
> But today's Tesseract is much more like a subroutine library
> that requires or at least assumes that its users are programmers.
There are a number of GUIs out there for Tesseract, both open source
and commercial. OCRFeeder is the last one I saw a demo of; it's quite
nice. If you want to point and click at things and no think about what
you're doing, maybe you should use that.
> The experts on this list are not really interested in explaining
> what a bitmap image is. This mismatch comes from the failure to
> explain what Tesseract is.
It comes from the failure to read the explanation of what it is.
People are lazy, sure, I understand that. But I for one don't intend
to spend a whole lot of time accommodating that.
In future, please do not hijack threads. Your interjection has nothing
to do with the question at hand -- that image would pose a similar
problem for commercial OCR systems, too. I'll bet you a beer that
FineReader will pick nothing out of that image either, and FineReader
does not make any attempt to rescale images.
--
<Leftmost> jimregan, that's because deep inside you, you are evil.
<Leftmost> Also not-so-deep inside you.
I'm going to go out on a limb here, and guess that you downloaded
these images from a digital library or some other online source. If
they make higher resolution images available, download those - but
it's likely they don't. Camus died in 1960; his works are covered by
copyright (in Europe) until 2031 - it's quite likely that the
resolution was chosen specifically so nobody would be able to use OCR
on the scans.
Just use pdfimages then (it comes with xpdf), and use ImageMagick's
convert to convert from pbm to tiff. The PDF as is looks like it's
ideal for OCR (and the pbm images extracted will be the same).