Creating searchable image PDFs using Ghostscript

Geico Caveman

unread,

Sep 6, 2008, 8:12:21 PM9/6/08

to

Hello,

What I have :

1. A bunch of pnm images generated by xsane from a manual.
2. Subsequently, these images are split up and cleaned using unpaper.
3. A working installation of ghostscript on Ubuntu.
4. A growing knowledge of the internal structure of PDFs as gleaned from the
Adobe PDF reference and the pdfmark reference.
5. A fine OCR tool (which is accurate enough for my purposes) - tesseract.
6. The following shell script that I wrote :

(just a segment)

convert -density 300 outputimage-$m.pgm outputimage-$m.tiff
tesseract outputimage-$m.tiff outputimage-$m
rm outputimage-$m.tiff
convert -density 300 outputimage-$n.pgm outputimage-$n.tiff
tesseract outputimage-$n.tiff outputimage-$n
rm outputimage-$n.tiff
potrace --level3 --backend postscript -r300 outputimage-$m.pgm -o
page-$m.ps
potrace --level3 --backend postscript -r300 outputimage-$n.pgm -o
page-$n.ps
echo "[ /Author (User)
/Creator (User)
/Producer (Ghostscript+potrace)
/Keywords (`cat outputimage-$m.txt`)
/DOCINFO pdfmark
/F (outputimage-$m.txt) (r) file def
[ /_objdef {mystream} /type /stream /OBJ pdfmark
[ {mystream} F /PUT pdfmark
[ /MyPrivateAnnotmyStreamData {mystream}
/SubType /Text
/Rect [ 10 10 30 30 ]
/Contents (`cat outputimage-$m.txt`)
/SrcPg 1
/Open false
/Color [1 1 0]
/Title (Tesseract - OCR)
/ANN pdfmark
[ /Name <feff 0041 0073>
/FS<<
/Type /Filespec
/F (outputimage-$m.txt)
/EF << /F {fstream} >>
>>
/EMBED pdfmark
[ /PageMode /UseOutline
/Page 1 /View [/Fit]
/DOCVIEW pdfmark
[ /Subtype /Text
/Title (Tesseract - OCR)
/Alt (`cat outputimage-$m.txt`)
/StPNE pdfmark
[ {Catalog} <</Markinfo <</Marked true>>>> /PUT pdfmark" > pdfmarks-$m

gs -q -dBATCH -dNOPAUSE -sDEVICE=pdfwrite -r300x300 -sPAPERSIZE=letter -dDOPDFMARKS -dPDFSETTINGS=/ebook -dPermissions=-4 -sOutputFile=pageannotated-$m.pdf
page-$m.ps pdfmarks

pdftk pageannotated-*.pdf cat output pdfjoined.pdf

What I do not have :

1. Properly created PDFs that are searchable.

Any ideas how one might embed OCR'ed text invisibly behind the page ?

I am aware of the amazing tool called gscan2pdf that does what I am seeking
to do here using a perl library called PDF2:API. That tool however makes
temporary copies of all the pnm and other files on the way and requires a
gigantic amount of temporary storage space (think 10G+) for the number of
pages I am trying to put together in a searchable PDF manual (think 1400
pages).

Ken Starks

unread,

Sep 7, 2008, 4:42:09 AM9/7/08

to

Geico Caveman wrote:
>
> 1. Properly created PDFs that are searchable.
>
> Any ideas how one might embed OCR'ed text invisibly behind the page ?

I believe Acrobat, when it embeds OCR'ed text invisibly, uses
'layers' -- they can be switched on and off.

http://www.acrobatusers.com/forums/ask_an_expert/questions/browse/layers/

I don't know about Ghostscript, which has advanced greatly since
I used it for any serious work, but here are a couple of
alternatives.

1. Solid PDF Creator Plus.
It claims to to do searchable OCR embedding.

Not open source, but also not expensive. I have just
got a free copy because I had previously bought a copy of
an ebook from 'turbocash'

http://www.soliddocuments.com/features.htm?product=SolidPDFCreator#CreateScanToPDF

2. Using LaTeX and the 'attachfile' package. With this you can embed any
file into your pdf, and your users can later extract it.

http://www.ctan.org/tex-archive/help/Catalogue/entries/attachfile.html

Cheers,
Ken

Geico Caveman

unread,

Sep 7, 2008, 2:17:47 PM9/7/08

to

Thanks for your response.

>>
>> Any ideas how one might embed OCR'ed text invisibly behind the page ?
>
> I believe Acrobat, when it embeds OCR'ed text invisibly, uses
> 'layers' -- they can be switched on and off.
>
> http://www.acrobatusers.com/forums/ask_an_expert/questions/browse/layers/

Since PDF is an open spec, this should be documented somewhere (the
technique and the markup and not just the tool).

Acrobat is not an option as we do not use windows or mac in this production
environment.

In any case, its a GUI tool, and very inefficient option for such large
documents.

>
>
> I don't know about Ghostscript, which has advanced greatly since
> I used it for any serious work, but here are a couple of
> alternatives.
>
> 1. Solid PDF Creator Plus.
> It claims to to do searchable OCR embedding.
>
> Not open source, but also not expensive. I have just
> got a free copy because I had previously bought a copy of
> an ebook from 'turbocash'
>
>
>

This again is a windows tool.

Your response raises questions about whether PDF is an open spec or not. If
PDF has a feature, it should be documented in the standard somewhere. I am
not asking about implementations - I am asking about the precise markup
needed.

If you refer to the original post, I have tried pdfmark with various
promising sounding options and they do not do what I am trying to do.

> 2. Using LaTeX and the 'attachfile' package. With this you can embed any
> file into your pdf, and your users can later extract it.
>
> http://www.ctan.org/tex-archive/help/Catalogue/entries/attachfile.html

This is irrelevant. I am looking to create searchable image PDFs, not
extractable ones.

Ken Starks

unread,

Sep 8, 2008, 4:34:43 AM9/8/08

to

Is pdf an open spec or not?
The best answer is found, as far as I know, at

http://www.adobe.com/devnet/pdf/pdf_reference.html

(Roughly, pdf 1.7 is entirely open equivalent to ISO 32000.
Adobe's later versions not so, but will be documented
by Adobe. We can hope that any third parties who provide
extensions, will also document them.)

Although neither 'layers' nor 'searchable images' are mentioned as
such, I suspect that all the information you need to 'roll your own'
is there somewhere, most likely in the sections about transparency,
transparency groups, and so on.

Good luck,
Ken.

Aandi Inston

unread,

Sep 8, 2008, 5:17:19 AM9/8/08

to

Geico Caveman <spammers...@spam.invalid> wrote:

>>> Any ideas how one might embed OCR'ed text invisibly behind the page ?

...

>Since PDF is an open spec, this should be documented somewhere (the
>technique and the markup and not just the tool).

PDF is fully documented, but the documentation doesn't say which
implementation choices are made by which tools. That is to say, which
of several possible techniques it chooses to use in a particular bit
of software.

But PDF is not a mark-up language, and to describe the technique as
"mark up" is a bit of an oversimplification.

I understand invisible text from OCR in Acrobat is done by setting the
text rendering mode to 3, see table 5.3 in the PDF Reference. This is
an old technique and predates the use of layers. Other tools may use
different techniques.

>>
>Your response raises questions about whether PDF is an open spec or not. If
>PDF has a feature, it should be documented in the standard somewhere.

It is. PDF is now an ISO Standard, the open standard everyone seemed
to want so much. ISO sell ISO 32000 for CHF 370 (about 330 US
dollars).

>If you refer to the original post, I have tried pdfmark with various
>promising sounding options and they do not do what I am trying to do.

PDFMark is an implementation detail, and nothing to do with the PDF
standard. I am not sure whether there is any method to make this kind
of invisible text with Distiller or Distiller-like products.
----------------------------------------
Aandi Inston
Please support usenet! Post replies and follow-ups, don't e-mail them.