need help converting .jpg files to .tif for OCR

138 views
Skip to first unread message

fontenot.1031

unread,
Jul 3, 2010, 2:23:00 AM7/3/10
to tesseract-ocr
Hey. I have a bunch of .jpg files of the pages of the book L'Etranger
that I need to OCR. However, when I convert them into a .tif file so
that tesseract can read them it doesn't read anything (even though the
text is fairly clear).
I'm using this to convert the .jpg files into .tif files:

convert page-4.jpg -depth 2 page-4.tif

Then when I execute: tesseract page-4.tif page-4 -l fra I just get a
text file with two empty lines.

Here's a link to the exact .jpg I'm using:
http://imgur.com/j7f5E.jpg

Does anyone know what I might be doing wrong?

Eugene Reimer

unread,
Jul 3, 2010, 6:36:42 PM7/3/10
to tesser...@googlegroups.com
You'll need to upscale the image. Before reducing it to
Black-and-White. Reducing to B+W isn't essential.

fontenot.1031

unread,
Jul 4, 2010, 5:47:07 PM7/4/10
to tesseract-ocr
> You'll need to upscale the image. Before reducing it to

Thanks for responding. I really appreciate it. Can you tell me what
upscaling is or how to do it with ImageMagick? I don't know that much
about images, jpeg or tiff. Thanks a lot. (also I think the imgur link
is messed up because the version on my computer is a lot bigger /
clearer).

nguyenq

unread,
Jul 4, 2010, 6:42:52 PM7/4/10
to tesseract-ocr
If your images are at least 200 DPI, you can use VietOCR, which can
accept various common image formats as input -- no conversion to TIFF
is needed.

http://vietocr.sf.net

Eugene Reimer

unread,
Jul 4, 2010, 9:05:07 PM7/4/10
to tesser...@googlegroups.com
Scaling by a factor that's bigger than one. Just google for
"imagemagick scaling".

Lars Aronsson

unread,
Jul 5, 2010, 2:23:18 AM7/5/10
to tesser...@googlegroups.com

I think that Tesseract, in order to be a successful project, must
be much more clear about what it is offering.

Now many people believe it is "an OCR program" that can function as
an alternative to commercial end user products. Some open source
software in other fields (especially OpenOffice and Firefox) can
meet such expectations. So it's natural that complete beginners
come to this list with basic questions about what a bitmap image
is. The commercial end user products would not bother their
customers with such details.

But today's Tesseract is much more like a subroutine library
that requires or at least assumes that its users are programmers.
The experts on this list are not really interested in explaining
what a bitmap image is. This mismatch comes from the failure to
explain what Tesseract is.


--
Lars Aronsson (la...@aronsson.se)
Aronsson Datateknik - http://aronsson.se


Jimmy O'Regan

unread,
Jul 5, 2010, 5:59:45 AM7/5/10
to tesser...@googlegroups.com
On 5 July 2010 07:23, Lars Aronsson <la...@aronsson.se> wrote:
> On 07/04/2010 11:47 PM, fontenot.1031 wrote:
>>>
>>> You'll need to upscale the image.  Before reducing it to
>>
>> Thanks for responding. I really appreciate it. Can you tell me what
>> upscaling is or how to do it with ImageMagick? I don't know that much
>> about images, jpeg or tiff. Thanks a lot. (also I think the imgur link
>> is messed up because the version on my computer is a lot bigger /
>> clearer).
>
> I think that Tesseract, in order to be a successful project, must
> be much more clear about what it is offering.
>

From the README:

"About the Engine

This code is a raw OCR engine. It has NO PAGE LAYOUT ANALYSIS, NO
OUTPUT FORMATTING, and NO UI."

What's unclear about that?

> Now many people believe it is "an OCR program" that can function as
> an alternative to commercial end user products.

Those people clearly haven't bothered to read the README.

> Some open source
> software in other fields (especially OpenOffice and Firefox) can
> meet such expectations. So it's natural that complete beginners
> come to this list with basic questions about what a bitmap image

No, it's not, really. Nobody comes to the Firefox mailing list asking
what a webpage is.

> is. The commercial end user products would not bother their
> customers with such details.
>
> But today's Tesseract is much more like a subroutine library
> that requires or at least assumes that its users are programmers.

There are a number of GUIs out there for Tesseract, both open source
and commercial. OCRFeeder is the last one I saw a demo of; it's quite
nice. If you want to point and click at things and no think about what
you're doing, maybe you should use that.

> The experts on this list are not really interested in explaining
> what a bitmap image is. This mismatch comes from the failure to
> explain what Tesseract is.

It comes from the failure to read the explanation of what it is.
People are lazy, sure, I understand that. But I for one don't intend
to spend a whole lot of time accommodating that.

In future, please do not hijack threads. Your interjection has nothing
to do with the question at hand -- that image would pose a similar
problem for commercial OCR systems, too. I'll bet you a beer that
FineReader will pick nothing out of that image either, and FineReader
does not make any attempt to rescale images.

--
<Leftmost> jimregan, that's because deep inside you, you are evil.
<Leftmost> Also not-so-deep inside you.

Jimmy O'Regan

unread,
Jul 5, 2010, 6:18:13 AM7/5/10
to tesser...@googlegroups.com
On 5 July 2010 11:02, Jimmy O'Regan <jor...@gmail.com> wrote:
> If at all possible, you should scan the pages again at a higher
> resolution. Rescaled images will give poor output.
>

I'm going to go out on a limb here, and guess that you downloaded
these images from a digital library or some other online source. If
they make higher resolution images available, download those - but
it's likely they don't. Camus died in 1960; his works are covered by
copyright (in Europe) until 2031 - it's quite likely that the
resolution was chosen specifically so nobody would be able to use OCR
on the scans.

fontenot.1031

unread,
Jul 5, 2010, 5:00:02 PM7/5/10
to tesseract-ocr
Looks like I got a better result by using some different parameters
with imagemagick.

Using: convert -trim -posterize 9 +matte -geometry 650 -linewidth 1 -
identify -enhance +dither -colors 16 +contrast -density 88 -black-
point-compensation -quality 90 -unsharp 0.7x1.1+2.0+0
CamusLetranger.pdf pages/page.jpg

I got .jpg files that look like these: http://imgur.com/iayVG.jpg
And when converted to .tif and ran tesseract on it I got this output:
I ,
ÀU]OURD’¥IIîî, mâlîlêlll est morte. Ou peubêtrc
hier, jc nc sais pas. _|‘al reçu un télégramme
de l'asile : u Mère décédée. Enterrement de-
main. Sentiments dnsninguès. sa Ccla nc veut
rien dire. C'étaiL peut-être hier.
L'asile de vieillards est à Mzircngo. à quatre-
vingls kilomètres d’Alger. jc prendrai l`aut¤-
bus à deux hemcs ct j'a11·ivcral dans I’après·
mrdi. Ainsi. je pourrai veiller et jc rcntrcrai
demain soir. fai demandé deux jours dû
congé à mon patron ct il ne pouvait pas mc
lcs rcfuscr avec une excuse parcillc. Mais il
n':wa,ît pas llair coment. je lui ai même dit. :
u Ce h"€5I pas de ma faute. sa Il n'a pas
répondu. fai pensé alors que jc n’:~.ura.is pas
dû lui dirc ccla,. En somme, je n'avais pas à

Which is okay-ish. I can re-interpret most of the original text and
fix the errors.

My question is: are they any other better options to use when
converting from pdf to .jpg?

> it's quite likely that the resolution was chosen specifically so nobody would be able to use OCR on the scans.

The original PDF is of high quality. Here's a link to it:
http://www.lecanardduloir.com/Docs/CamusLetranger.pdf

Jimmy O'Regan

unread,
Jul 6, 2010, 1:18:42 PM7/6/10
to tesser...@googlegroups.com
On 5 July 2010 22:00, fontenot.1031 <fonten...@gmail.com> wrote:
> My question is: are they any other better options to use when
> converting from pdf to .jpg?
>
>> it's quite likely that the resolution was chosen specifically so nobody would be able to use OCR on the scans.
>
> The original PDF is of high quality. Here's a link to it:
> http://www.lecanardduloir.com/Docs/CamusLetranger.pdf

Just use pdfimages then (it comes with xpdf), and use ImageMagick's
convert to convert from pbm to tiff. The PDF as is looks like it's
ideal for OCR (and the pbm images extracted will be the same).

nguyenq

unread,
Jul 7, 2010, 11:19:35 PM7/7/10
to tesseract-ocr
After splitting your CamusLetranger.pdf file into 50-page sections, I
fed into VietOCR (2.0 Beta), which uses GhostScript to convert PDF to
PNG format, and got this result which seems acceptable:

I
AuJoURD'HU1, maman est morte. Ou peut-être
hier, je ne sais pas. fai reçu un télégramme
de l'asi1e : << Mère décédée. Enterrement de-
main. Sentiments distingués. ›› Cela ne veut
rien dire. C'ótait peut-être hier.
I_'asi1e de vieillards est à Marengo, à quatre-
vingts kilomètres d'Alger. je prendrai l'auto-
bus à deux heures et j'arriverai dans l'après-
midi. Ainsi, je pourrai veiller et je rentrerai
demain soir. fai demandé deux jours de
congé à mon patron et il ne pouvait pas me
les refuser avec une excuse pareille. Mais il
n'avait pas l'air content. Je lui ai même dit :
<< Ce n'est pas de ma faute. ›› Il n'a pas
répondu. ]'ai pensé alors que je n'aurais pas
dû lui dire cela. En somme, je n'avais pas à
Reply all
Reply to author
Forward
0 new messages