Suggestions on running PDFs through Tesseract without losing vector graphics?

389 views
Skip to first unread message

hmmwhat...@gmail.com

unread,
Aug 30, 2015, 2:04:19 PM8/30/15
to tesseract-ocr
Hello everyone,

I have a digital copy of a book I own that was delivered to me in what might be the most inconvenient of formats - one PDF per page, with all non-image data on the page - text included - converted to vector shapes. While I can re-combine the pages together, add bookmarks/page numbers/etc. with jPDFTweak, this still leaves me with the problem of not being able to search the book, as all of the text has been converted to vector shapes.

I thought I would use Tesseract, but I can't seem to find the latest Windows binaries or determine whether or not there's some workflow for doing OCR on a PDF, then mixing the hOCR output back into the same PDF without having to convert the image to a TIFF first. I'd like to not have to convert the PDFs into TIFFs and merge the TIFF into a PDF, as this would cause the vector shapes to get converted to a raster format.

Can anyone provide some insight on how to do this without pulling my hair out?

Jeff Breidenbach

unread,
Sep 4, 2015, 2:01:38 AM9/4/15
to tesseract-ocr
This would be ridiculously hard to implement.

Jeff Breidenbach

unread,
Sep 5, 2015, 12:38:20 AM9/5/15
to tesseract-ocr
But I would like to see an example PDF - one of the simpler ones - just to see how the vector graphics were done. Please do not get your hopes up.

hmmwhat...@gmail.com

unread,
Sep 10, 2015, 2:31:03 AM9/10/15
to tesseract-ocr
On Friday, September 4, 2015 at 9:38:20 PM UTC-7, Jeff Breidenbach wrote:
But I would like to see an example PDF - one of the simpler ones - just to see how the vector graphics were done. Please do not get your hopes up.

I would upload a page, but unfortunately I'd be worried about running afoul of any copyright restrictions upon the book.

As far as I can tell, the text is implemented with each letter (or, in the case of dotted letters, contiguous portions of letters) being a single closed vector shape.

It's analogous to selecting paragraph/freeform text in a vector graphics/publishing program (CorelDRAW, Illustrator, etc.) and selecting "Convert to Curves" (or whatever the relevant option is named - that's what CorelDRAW calls it, I'm not 100% sure on Adobe products)

Tom Morris

unread,
Sep 10, 2015, 1:10:18 PM9/10/15
to tesseract-ocr
On Thursday, September 10, 2015 at 2:31:03 AM UTC-4, hmmwhat...@gmail.com wrote:
On Friday, September 4, 2015 at 9:38:20 PM UTC-7, Jeff Breidenbach wrote:
But I would like to see an example PDF - one of the simpler ones - just to see how the vector graphics were done. Please do not get your hopes up.

I would upload a page, but unfortunately I'd be worried about running afoul of any copyright restrictions upon the book.

I suspect a single representative page used in this educational context would qualify for "fair use" under U.S. copyright law, but it's your call.  Even if you don't publish a page, I'd be curious who the publisher/imprint is and whether this format is standard practice for them.

As far as I can tell, the text is implemented with each letter (or, in the case of dotted letters, contiguous portions of letters) being a single closed vector shape.

Dotted letters?!?!  I hope you're not hoping to recognize those too.

I agree with Jeff that this sounds like a difficult task and it seems like a lot of work for a one-of, but I think it's doable.  A searchable PDF is basically an image layer with an invisible text layer registered on top of it.  I suspect that, instead of a base image layer, you could have a base vector graphics layer with a registered invisible text layer over it.

My imagined pipeline would be something like:

- page segmentation - using either the PDF (depending on what info is available there) or a rasterized version of the page.  This will give you a page layout breakdown by block type (text, image, drawings).
- rasterize - either just the text blocks or the entire page at a good resolution for OCR work
- OCR - get text along with coordinates for each word/line
- PDF assembly - crack open the original PDF, copy its contents, and insert the invisible text with the coordinates registered to the correct place on the underlying vector graphic text (see Tess sources for one example of how this is done)

Hopefully you are either going to be searching for a LOT of words in the book to make this worthwhile or are willing to write off the time investment as a science experiment.

Tom

Jeff Breidenbach

unread,
Sep 10, 2015, 8:20:42 PM9/10/15
to tesseract-ocr
If the PDF embedded vector graphics similar to how it does rasters, then 
this becomes somewhat practical. For example, if the vector graphics were 
an embedded SVG then we'd pull that out (similar to how pdfimage from
poppler can pull out embedded rasters.) Then we'd teach Leptonica to
read the SVG into a pix, which is a rasterization operation. At PDF generation
time we'd throw out the pix and instead use the original SVG.
But I don't think it works that way at all. I think the vector graphics 
commands are likely to be direct PDF primitives and therefore way, way, 
way too hard play with in this fashion.

I think a more likely approach (but still very unlikely!) is using a 
modified Tesseract to create a new PDF containing invisible text 
layer and nothing else. Then hope someone has written a general 
purpose "composite two PDF files on top of each other" program 
and use it to merge. Such a merging program would be pretty difficult
to write, and it is hard for me to imagine why it would exist.


hmmwhat...@gmail.com

unread,
Sep 11, 2015, 1:42:39 AM9/11/15
to tesseract-ocr

I don't know a whole lot about how Tesseract/Leptonica/pdfimage/etc. work, but would it potentially be possible to dump the entire page into one large raster image and then use the segmentation data to cut out just the part that "looks like" text? I know that pdftk (well, jPDFTweak, but that's essentially a front-end for pdftk) and other similar libraries can shove a whole page into a single raster image file, which might be helpful for this situation.

I'll see if I can either post one of the page PDFs with enough of the text/images removed to make it more-or-less useless for anything other than researching how the text is stored or just make a differing PDF with the text converted to curves in a similar fashion. The issue is that it's a relatively big-named higher education publisher, and I'd rather not get smacked with some legal nasty-gram for doing something that could probably be construed as piracy if someone felt like trying to ruin my day.

Tom Morris

unread,
Sep 11, 2015, 11:13:46 AM9/11/15
to tesser...@googlegroups.com
I don't know where all this complexity came from.  PDF rasterizers have existed since the format was invented.  GhostScript is one popular open source alternative.  It could either be used directly or through a tool that embeds it such as ImageMagick.

Tools like Apache PDFBox can be used to add the hidden text layer (ie optional content group) back in to the PDF.

You could write a custom program to call the APIs for the various components or you could just use shell scripts to string together a bunch of commands.  I'm sure there are a lot of fiddly little details to get it to work, have the text be properly registered, etc, but I'm pretty sure it's possible to do.

Tom

--
You received this message because you are subscribed to a topic in the Google Groups "tesseract-ocr" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/tesseract-ocr/pA415qJRRkQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/fc6398f2-2007-4ba6-9a67-b476bea89615%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Jeff Breidenbach

unread,
Sep 11, 2015, 1:18:13 PM9/11/15
to tesseract-ocr
It's the parsing and manipulation of PDF that scares me. Thanks for pointing
out PDFBox, it looks pretty amazing. It even has the program that I was 
Reply all
Reply to author
Forward
0 new messages