Tesseract 3.03: PDF-OCR generated PDFs show coding artefacts => do not use lossy (jpg) compression! Use lossless compression (png)!!

947 views
Skip to first unread message

Tom

unread,
Jul 28, 2014, 3:52:50 AM7/28/14
to tesser...@googlegroups.com
Using the PDF-OCR option I noticed that the Tesseract-generated mixed-mode PDFs (original image-PDF plus OCR-ed text) show coding artefacts which were not present in the input image files (I use ImageMagick convert to render one image (png or bmp) per PDF-input-page).

So I propose to change Tesseract PDF-OCR mode
  • do not use lossy compression
  • use lossless compression (png)

when rendering the final mixed-mode PDF output files.


Tom

unread,
Jul 28, 2014, 4:00:53 PM7/28/14
to tesser...@googlegroups.com
Commandline:

 
# the convert command (part of Imagemagick) creates a clean lossless compressed image 1.png
# if you already have a png with characters and digits in it, you do not need the following command:
convert -density 300x300 -depth 8 1.pdf 1.png
 
# the Tesseract is called and creates a mixed mode pdf with filename "1.png.pdf"
# this output shows coding artefacts between the characters and digits if you enlarge the view
# I can supply you with images (on request)
tesseract -l eng 1.png 1.png pdf

Tom

unread,
Jul 28, 2014, 4:03:18 PM7/28/14
to tesser...@googlegroups.com


Am Montag, 28. Juli 2014 09:52:50 UTC+2 schrieb Tom:

Sriranga(80yrs)

unread,
Jul 28, 2014, 11:35:31 PM7/28/14
to tesser...@googlegroups.com
Thanks for the commandline furnished by you for benefit of community. Also I like to have your images also.


--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/5b80105f-8db1-42bb-bf2d-3806ea0c052f%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Tom

unread,
Jul 29, 2014, 4:48:33 AM7/29/14
to tesser...@googlegroups.com

Please find attached a zip file with the two files:
  • 1.png (the input to Tesseract -- no artefacts, zoom to +400%)
  • 1.png.pdf (the outpit from Tesseract with mode pdf -- artefacts! zoom to +400%)

Remark: the 1.png file was too big (in resolution) to be uploaded here directly, so you will find the 1.png file (input) only inside the zip file.Let me know, if you need more, and please confirm, that you can see the compression artefacts in the 1.png.pdf (zoom to 400% !!), between the characters.


When using a lossy compression, I am pretty sure, that 1.png.pdf will be the same quality as the input. This is the goal to be solved within the scope of this bug issue.

1-image-png--and--ocr-pdf.zip
1.png.pdf

Tom

unread,
Jul 30, 2014, 1:02:02 PM7/30/14
to tesser...@googlegroups.com

Today I digged into the code of Tesseract and also Leptonica, and I *found* the reason, we can say, the bug (when you have in mind that the present application is to use Tesseract to OCR a text file, or a file with images and texts like flowcharts and so on).

The (easy) fix will be supplied separately.

Tom

unread,
Jul 30, 2014, 5:57:28 PM7/30/14
to tesser...@googlegroups.com

Jim O'Regan

unread,
Jul 30, 2014, 7:00:42 PM7/30/14
to tesser...@googlegroups.com
On 30 July 2014 18:02, Tom <syr...@gmail.com> wrote:
>
> Today I digged into the code of Tesseract and also Leptonica, and I *found*
> the reason, we can say, the bug

I looked into it yesterday, before I reclassified your issue.

The problem is that your image, which visibly contains only black and
white, reports itself as full colour (the colour depth is high, so
Leptonica assumes the best compression will be JPEG). Fix the image,
and you ought to have lossless compression (the other two possible
schemes, Flate (= zip) and G4, are both lossless).

--
<Sefam> Are any of the mentors around?
<jimregan> yes, they're the ones trolling you

Tom

unread,
Jul 30, 2014, 7:11:52 PM7/30/14
to tesser...@googlegroups.com
For the application of Tesseract as OCR engine for texts (with or without images, B/W or colour), everything else than lossless compression is stupid.So respectfully stated, I cannot accept your "work-around".

Please see my patch (on Github). It fully fixes the issue - we are talking only about the PDF mode-, and the resulting files are smaller (I already checked this).

Tom

unread,
Jul 31, 2014, 3:27:46 PM7/31/14
to tesser...@googlegroups.com
@jimregan

Dear Jim,

thanks for your explanantion, I also studied to two codes (one part is in Leptonica, the other, more important in Tesseract). I think, forcing to use "FLATE" just before the image is rendered into the PDF page is the best solution, I kindly ask you to try my (short and easy) patch and to inspect the generated files, which also were smaller in my test cases.

Please let me know, if you want me to perform some test cases with B/W and also colored pages (text plus images and so), but if step can be skipped, I would be happy because I haven't that much time. On the other hand, I really want to have my patch pulled in, or an additional command line parameter like "--force-lossless-compression" for the "pdf" mode.

zdenko podobny

unread,
Jul 31, 2014, 5:14:30 PM7/31/14
to tesser...@googlegroups.com
I do not have to time to have a look on this issue yet, but forcing user to use lossless compression is not right way IMO.
Right way is to implement option for user to force tesseract to use lossless compression, but this feature is not provided by your "patch"...

Zdenko


--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.

Jim O'Regan

unread,
Jul 31, 2014, 6:09:57 PM7/31/14
to tesser...@googlegroups.com
On 31 July 2014 20:27, Tom <syr...@gmail.com> wrote:
> Dear Jim,
>

Tom, I want to say before anything else that I very much appreciate
this followup message. I'm glad that you took the time to rephrase
your position.

> thanks for your explanantion, I also studied to two codes (one part is in
> Leptonica, the other, more important in Tesseract). I think, forcing to use
> "FLATE" just before the image is rendered into the PDF page is the best
> solution, I kindly ask you to try my (short and easy) patch and to inspect
> the generated files, which also were smaller in my test cases.
>

For your use case, undoubtedly. For other use cases, I'm not quite
convinced. For many users of Tesseract, the input will be a full
colour scan of a page image, and the objective will be to have the
smallest file size. I'm quite sure that this use case is what lead to
the PDF feature -- I don't think it's a coincidence that the Tesseract
team are located in the Google Books building, and Google Books offers
such PDFs!

As you've identified in this message, it wasn't my intention to offer
a work around -- in fact, I think there may be an extra issue here,
that Tesseract is perhaps a little too willing to believe the colour
depth reported in the image. That requires some more investigation.

Quite aside from the issue at hand, I think it's worth telling you
that, in general, sending a patch that comments out code to an open
source project will (usually) result in automatic rejection. Remove
the code, or don't -- don't leave ugly commented code.

> Please let me know, if you want me to perform some test cases with B/W and
> also colored pages (text plus images and so), but if step can be skipped, I
> would be happy because I haven't that much time. On the other hand, I really
> want to have my patch pulled in, or an additional command line parameter
> like "--force-lossless-compression" for the "pdf" mode.
>

Having it as an option is, I think, the best for everyone, and all use
cases. Otherwise, the tests are unavoidable. I think this would be a
good option for users to have -- that's why I commented on the issue
-- so I'd be happy to add it (I should have time over the weekend).

Tom

unread,
Jul 31, 2014, 11:52:04 PM7/31/14
to tesser...@googlegroups.com


Am Donnerstag, 31. Juli 2014 23:14:30 UTC+2 schrieb zdenop:
I do not have to time to have a look on this issue yet, but forcing user to use lossless compression is not right way IMO.
Right way is to implement option for user to force tesseract to use lossless compression, but this feature is not provided by your "patch"...

@zdenop
@jimregan
 
Dear zdenop, dear Jim

yes, thanks. I was thinking about an option --force-lossless-compression , but after having inspected the http://tesseract-ocr.googlecode.com/svn/trunk/doc/tesseract.1.html documentation manual page, I think, that Tesseract does not support (apart from a few) command line options, Instead, it (mainly) supports to have options in a config file.

So I will modify my code so that lossless compression can be forced by enabling it by means of a switch in the config file.

Question 1
========

Please can you let me know, if you like my approach (config parameter), or if you would also support my proposal for a command line switch (--force-lossless-compression).

BTW, it was and is clear to me, that a final patch must not contain out-commented (dead) code.


Question 2
========

Where we are at it, I have a question: I may be wrong, but inspecting the code I found some pieces indicating a "multi-page" actions. My question: Is Tesseract also supporting the OCR-ing of a PDF having many pages ?

At the moment I have a script (using pdftk/PDFToolkit) to split a PDF into single image files, which I then convert one-by-one via Tesseract * * pdf option, which I then have to collate again by another script into the final single mixed-mode output PDF file.

Are there initiatives to integrate this into Tesseract ?

zdenko podobny

unread,
Aug 2, 2014, 10:34:03 AM8/2/14
to tesser...@googlegroups.com
In general:

  • for new question create new email please
  • regarding issue tracker - add there patch. Do not post there code or link to code change.
Other comments are below inline...
Zdenko

On Fri, Aug 1, 2014 at 5:52 AM, Tom <syr...@gmail.com> wrote:


Am Donnerstag, 31. Juli 2014 23:14:30 UTC+2 schrieb zdenop:
I do not have to time to have a look on this issue yet, but forcing user to use lossless compression is not right way IMO.
Right way is to implement option for user to force tesseract to use lossless compression, but this feature is not provided by your "patch"...

@zdenop
@jimregan
 
Dear zdenop, dear Jim

yes, thanks. I was thinking about an option --force-lossless-compression , but after having inspected the http://tesseract-ocr.googlecode.com/svn/trunk/doc/tesseract.1.html documentation manual page, I think, that Tesseract does not support (apart from a few) command line options, Instead, it (mainly) supports to have options in a config file.

Thanks for heads up - man page was not updated for a longer time. If you run "tesseract --help" (in 3.03 version) you can see there much more options. Nick implemented to possibility to define control parameter from command line:

 -c configvar=value    set value for control parameter. Multiple -c arguments are allowed.
 
So I will modify my code so that lossless compression can be forced by enabling it by means of a switch in the config file.

Question 1
========

Please can you let me know, if you like my approach (config parameter), or if you would also support my proposal for a command line switch (--force-lossless-compression).

IMO you need to implement "config parameter" if you want to create command line switch (that is not necessary as you can pass config parameter at command line).
Create it that way so user can force type of encoding (jpeg, g4, flate - it would be great if leptonica will support there also jbig2[1] ;-) ) As a default option leave the current behavior (so selectDefaultPdfEncoding choose the type of encoding)

Tom

unread,
Aug 2, 2014, 10:59:02 AM8/2/14
to tesser...@googlegroups.com


Am Samstag, 2. August 2014 16:34:03 UTC+2 schrieb zdenop:
In general:

  • regarding issue tracker - add there patch. Do not post there code or link to code change.

zdenko podobny

unread,
Aug 2, 2014, 2:17:10 PM8/2/14
to tesser...@googlegroups.com
I know and you put there link to your github repot...

Zdenko


--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.

Tom

unread,
Aug 7, 2014, 2:37:00 AM8/7/14
to tesser...@googlegroups.com
Introduction of a new configuration variable "tessedit_lossless_compression"

See

for the solution (patch and discussion about the smaller filesize. Lossless compression does not introduce coding artefacts when rendering PDF output files.)


Tom

unread,
Sep 18, 2014, 3:46:45 PM9/18/14
to tesser...@googlegroups.com
see a 400% zoom here https://i.imgur.com/37J4kSn.png



Am Montag, 28. Juli 2014 09:52:50 UTC+2 schrieb Tom:
Reply all
Reply to author
Forward
0 new messages