Suggestions on running PDFs through Tesseract without losing vector graphics?

hmmwhat...@gmail.com

unread,

Aug 30, 2015, 2:04:19 PM8/30/15

to tesseract-ocr

Hello everyone,

I have a digital copy of a book I own that was delivered to me in what might be the most inconvenient of formats - one PDF per page, with all non-image data on the page - text included - converted to vector shapes. While I can re-combine the pages together, add bookmarks/page numbers/etc. with jPDFTweak, this still leaves me with the problem of not being able to search the book, as all of the text has been converted to vector shapes.

I thought I would use Tesseract, but I can't seem to find the latest Windows binaries or determine whether or not there's some workflow for doing OCR on a PDF, then mixing the hOCR output back into the same PDF without having to convert the image to a TIFF first. I'd like to not have to convert the PDFs into TIFFs and merge the TIFF into a PDF, as this would cause the vector shapes to get converted to a raster format.

Can anyone provide some insight on how to do this without pulling my hair out?

Jeff Breidenbach

unread,

Sep 4, 2015, 2:01:38 AM9/4/15

to tesseract-ocr

This would be ridiculously hard to implement.

Jeff Breidenbach

unread,

Sep 5, 2015, 12:38:20 AM9/5/15

to tesseract-ocr

But I would like to see an example PDF - one of the simpler ones - just to see how the vector graphics were done. Please do not get your hopes up.

hmmwhat...@gmail.com

unread,

Sep 10, 2015, 2:31:03 AM9/10/15

to tesseract-ocr

On Friday, September 4, 2015 at 9:38:20 PM UTC-7, Jeff Breidenbach wrote:

But I would like to see an example PDF - one of the simpler ones - just to see how the vector graphics were done. Please do not get your hopes up.

I would upload a page, but unfortunately I'd be worried about running afoul of any copyright restrictions upon the book.

As far as I can tell, the text is implemented with each letter (or, in the case of dotted letters, contiguous portions of letters) being a single closed vector shape.

It's analogous to selecting paragraph/freeform text in a vector graphics/publishing program (CorelDRAW, Illustrator, etc.) and selecting "Convert to Curves" (or whatever the relevant option is named - that's what CorelDRAW calls it, I'm not 100% sure on Adobe products)

Tom Morris

unread,

Sep 10, 2015, 1:10:18 PM9/10/15

to tesseract-ocr

On Thursday, September 10, 2015 at 2:31:03 AM UTC-4, hmmwhat...@gmail.com wrote:

On Friday, September 4, 2015 at 9:38:20 PM UTC-7, Jeff Breidenbach wrote:
But I would like to see an example PDF - one of the simpler ones - just to see how the vector graphics were done. Please do not get your hopes up.

I would upload a page, but unfortunately I'd be worried about running afoul of any copyright restrictions upon the book.

I suspect a single representative page used in this educational context would qualify for "fair use" under U.S. copyright law, but it's your call. Even if you don't publish a page, I'd be curious who the publisher/imprint is and whether this format is standard practice for them.

As far as I can tell, the text is implemented with each letter (or, in the case of dotted letters, contiguous portions of letters) being a single closed vector shape.

Dotted letters?!?! I hope you're not hoping to recognize those too.

I agree with Jeff that this sounds like a difficult task and it seems like a lot of work for a one-of, but I think it's doable. A searchable PDF is basically an image layer with an invisible text layer registered on top of it. I suspect that, instead of a base image layer, you could have a base vector graphics layer with a registered invisible text layer over it.

My imagined pipeline would be something like:

- page segmentation - using either the PDF (depending on what info is available there) or a rasterized version of the page. This will give you a page layout breakdown by block type (text, image, drawings).

- rasterize - either just the text blocks or the entire page at a good resolution for OCR work

- OCR - get text along with coordinates for each word/line

- PDF assembly - crack open the original PDF, copy its contents, and insert the invisible text with the coordinates registered to the correct place on the underlying vector graphic text (see Tess sources for one example of how this is done)

Hopefully you are either going to be searching for a LOT of words in the book to make this worthwhile or are willing to write off the time investment as a science experiment.

Tom

Jeff Breidenbach

unread,

Sep 10, 2015, 8:20:42 PM9/10/15

to tesseract-ocr

If the PDF embedded vector graphics similar to how it does rasters, then

this becomes somewhat practical. For example, if the vector graphics were

an embedded SVG then we'd pull that out (similar to how pdfimage from

poppler can pull out embedded rasters.) Then we'd teach Leptonica to

read the SVG into a pix, which is a rasterization operation. At PDF generation

time we'd throw out the pix and instead use the original SVG.

But I don't think it works that way at all. I think the vector graphics

commands are likely to be direct PDF primitives and therefore way, way,

way too hard play with in this fashion.

I think a more likely approach (but still very unlikely!) is using a

modified Tesseract to create a new PDF containing invisible text

layer and nothing else. Then hope someone has written a general

purpose "composite two PDF files on top of each other" program

and use it to merge. Such a merging program would be pretty difficult

to write, and it is hard for me to imagine why it would exist.

hmmwhat...@gmail.com

unread,

Sep 11, 2015, 1:42:39 AM9/11/15

to tesseract-ocr

I don't know a whole lot about how Tesseract/Leptonica/pdfimage/etc. work, but would it potentially be possible to dump the entire page into one large raster image and then use the segmentation data to cut out just the part that "looks like" text? I know that pdftk (well, jPDFTweak, but that's essentially a front-end for pdftk) and other similar libraries can shove a whole page into a single raster image file, which might be helpful for this situation.

I'll see if I can either post one of the page PDFs with enough of the text/images removed to make it more-or-less useless for anything other than researching how the text is stored or just make a differing PDF with the text converted to curves in a similar fashion. The issue is that it's a relatively big-named higher education publisher, and I'd rather not get smacked with some legal nasty-gram for doing something that could probably be construed as piracy if someone felt like trying to ruin my day.

Tom Morris

unread,

Sep 11, 2015, 11:13:46 AM9/11/15

to tesser...@googlegroups.com

I don't know where all this complexity came from. PDF rasterizers have existed since the format was invented. GhostScript is one popular open source alternative. It could either be used directly or through a tool that embeds it such as ImageMagick.

Tools like Apache PDFBox can be used to add the hidden text layer (ie optional content group) back in to the PDF.

You could write a custom program to call the APIs for the various components or you could just use shell scripts to string together a bunch of commands. I'm sure there are a lot of fiddly little details to get it to work, have the text be properly registered, etc, but I'm pretty sure it's possible to do.

Tom

--
You received this message because you are subscribed to a topic in the Google Groups "tesseract-ocr" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/tesseract-ocr/pA415qJRRkQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/fc6398f2-2007-4ba6-9a67-b476bea89615%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Jeff Breidenbach

unread,

Sep 11, 2015, 1:18:13 PM9/11/15

to tesseract-ocr

It's the parsing and manipulation of PDF that scares me. Thanks for pointing

out PDFBox, it looks pretty amazing. It even has the program that I was

speculating about.

https://pdfbox.apache.org/1.8/commandline.html#overlayPDF

Reply all

Reply to author

Forward