jbig2 encoding in PDF output file

Spiel Maus

unread,

Jun 15, 2015, 3:55:08 PM6/15/15

to tesser...@googlegroups.com

Hello,

is there a possibility to tell tesseract to use jbig2 for image encoding in PDF output file? This would result in smaller files for bitonal text scans, about half the size in my case.

Encoding the images with jbig2enc before running OCR has no impact. tesseract seems to "destroy" compression by using its own encoding.

Thanks in advance.

supriya Das

unread,

Jun 15, 2015, 11:28:38 PM6/15/15

to tesser...@googlegroups.com

Hello Spiel Maus,

I never tried to save as pdf output. Can you help me in which version it is possible.

Jeff Breidenbach

unread,

Jun 29, 2015, 3:57:08 AM6/29/15

to tesser...@googlegroups.com

Not available currently, and pretty major effort required to make it happen,

both in Leptonica and Tesseract's PDF output module. No plans to work

on this. For other formats we try hard to not re-encode during PDF generation

whenever practical.

Tom Morris

unread,

Jul 1, 2015, 2:40:35 PM7/1/15

to tesser...@googlegroups.com

There's a JBIG2 encoder here: https://github.com/agl/jbig2enc Since it uses Leptonica for some of its internal operations, adding it to Leptonica might be a little cyclical (or require some restructuring).

While Jeff obviously has more experience than I, it seems like it should be a straightforward integration. Non-trivial to be sure, but certainly doable. The PDF output module already supports multiple encodings, including the 1-bit G4, so it *seems* like it should mainly be a matter of filtering/transforming the segments in the JBIG2 stream and creating stream for the global symbols, if needed.

Jeff - are there particular troublespots that you foresee if someone were to tackle this?

Tom

Jeff Breidenbach

unread,

Jul 18, 2015, 12:11:44 AM7/18/15

to tesser...@googlegroups.com

JBIG2 is a mutlipage image format, but is different from - for example - multipage tiff

because the images are not independently compressed. They share compression

data, specifically a symbol dictionary.

There are three possible approaches here:

1. Have Tesseract accept JBIG2 images produced by jbig2enc and embed them

into PDF without modification,

2. Have Tesseract actually do JBIG2 compression.

3. Have Tesseract do image segmentation, compress some parts of the page

as JBIG2, other parts as JP2K, and store the results in PDF in a mixed raster

format.

I'm only going to discuss #1 because it is simplest and matches the current

'try to never transcode' philosophy. We'd need a JBIG2 decoder in Leptonica.

That's probably straightforward but still a very solid chunk of work.

Then, there is what to do in Tesseract. The PDF rendering module would need to learn

about the symbol dictionary (or dictionaries) and add it to collection of PDF objects.

It will need an understanding of what's going on much better than what we currently

use, which is simply 'Hey, what image file belongs to this page? Let's try to inline it

unchanged,'

https://github.com/tesseract-ocr/tesseract/blob/master/api/pdfrenderer.cpp#L811

Now the good news is the PDF rendering module is really small and is not cemented

down by a whole bunch of unnecessary abstraction layers. And I know it's possible

because I've personally done it with colleagues elsewhere.

But it is a pretty significant effort, and I'm honestly not sure it's worth putting inside

Tesseract. Maybe a better approach is post processing, with a PDF to PDF converter

that uses approach #3. This is the winning strategy for Linearization, which can be

done on a Tesseract produced PDF using QPDF.

Tom Morris

unread,

Jul 18, 2015, 1:48:20 AM7/18/15

to tesser...@googlegroups.com

Thanks for the analysis and feedback, Jeff.

Unfortunately, I don't know much about QPDF (and SourceForge's storage problems are preventing me from learning any more), but doing #3 externally using a tool like QPDF, perhaps in conjunction with doing #1 in Tesseract itself, sound like reasonable options.

Tom

--
You received this message because you are subscribed to a topic in the Google Groups "tesseract-ocr" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/tesseract-ocr/NPfR1_ZkoTA/unsubscribe.
To unsubscribe from this group and all its topics, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/0f9e6702-a759-4053-b9be-42bc96c1d547%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Reply all

Reply to author

Forward