Removing watermarks from pdfs (pdfparanoia)

1,242 views
Skip to first unread message

Bryan Bishop

unread,
Feb 5, 2013, 3:20:22 PM2/5/13
to diybio, open-science, chminf-l, Bryan Bishop, liberat...@lists.stanford.edu, Transhuman Tech
How about removing those pesky watermarks from pdfs? Sometimes they completely obfuscate the contents of a paper we're trying to read, or sometimes they have more sinister purposes.

Working proof of concept:

https://github.com/kanzure/pdfparanoia

Discussion history:

People who could theoretically benefit from this:

To get source code:


To install:

sudo pip install pdfparanoia

or:

sudo easy_install pdfparanoia

Right now there's IEEE and AIP support. I need more samples to work with.

- Bryan
http://heybryan.org/
1 512 203 0507

Bryan Bishop

unread,
Feb 5, 2013, 4:15:48 PM2/5/13
to Peter Murray-Rust, Bryan Bishop, diybio, open-science, chminf-l, liberat...@lists.stanford.edu, Transhuman Tech
On Tue, Feb 5, 2013 at 3:09 PM, Peter Murray-Rust <pm...@cam.ac.uk> wrote:
PDF2SVG should be able to do this (http://bitbucket.org/petermr/pdf2svg). It should also remove the side annotations about which library the PDF was downloaded from. Send me one and I'll see.

Is there a svg2pdf? The problem with using pdfquery is that it can only generate an xml format, and at first it looks like pdfxml, except Adobe came up with a "standard" called pdfxml that looks completely different. So getting things back into pdf seems to be difficult.

Cathal Garvey

unread,
Feb 6, 2013, 5:56:32 AM2/6/13
to diy...@googlegroups.com
Check this one out: https://mat.boum.org/

On 05/02/13 20:20, Bryan Bishop wrote:
> How about removing those pesky watermarks from pdfs? Sometimes they
> completely obfuscate the contents of a paper we're trying to read, or
> sometimes they have more sinister purposes.
>
> Working proof of concept:
>
> https://github.com/kanzure/pdfparanoia
> https://pypi.python.org/pypi/pdfparanoia
>
> Discussion history:
> https://groups.google.com/group/science-liberation-front/t/c68964cf55d8f6fa
>
> People who could theoretically benefit from this:
> http://scholar.google.com/scholar?q=%22Authorized+licensed+use+limited+to%22
> http://scholar.google.com/scholar?q="Redistribution+subject+to+SEG+license+or+copyright"
> <http://scholar.google.com/scholar?q=%22Redistribution+subject+to+SEG+license+or+copyright%22>
> http://scholar.google.com/scholar?q="Redistribution+subject+to+AIP"
> <http://scholar.google.com/scholar?q=%22Redistribution+subject+to+AIP%22>
> http://scholar.google.com/scholar?q="Downloaded+from+http%3A%2F%2Fpubs.acs.org+on"
> <http://scholar.google.com/scholar?q=%22Downloaded+from+http%3A%2F%2Fpubs.acs.org+on%22>
> http://scholar.google.com/scholar?q="Downloaded+*+*+2001..2013+to+*"
> <http://scholar.google.com/scholar?q=%22Downloaded+*+*+2001..2013+to+*%22>
>
> To get source code:
>
> git clone git://github.com/kanzure/pdfparanoia.git
> <http://github.com/kanzure/pdfparanoia.git>
>
> To install:
>
> sudo pip install pdfparanoia
>
> or:
>
> sudo easy_install pdfparanoia
>
> Right now there's IEEE and AIP support. I need more samples to work with.
>
> - Bryan
> http://heybryan.org/
> 1 512 203 0507
>
> --
> -- You received this message because you are subscribed to the Google
> Groups DIYbio group. To post to this group, send email to
> diy...@googlegroups.com. To unsubscribe from this group, send email to
> diybio+un...@googlegroups.com. For more options, visit this group
> at https://groups.google.com/d/forum/diybio?hl=en
> Learn more at www.diybio.org
> ---
> You received this message because you are subscribed to the Google
> Groups "DIYbio" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to diybio+un...@googlegroups.com.
> To post to this group, send email to diy...@googlegroups.com.
> Visit this group at http://groups.google.com/group/diybio?hl=en.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>

Bjonnh

unread,
Feb 6, 2013, 9:29:45 AM2/6/13
to diy...@googlegroups.com
On Wed, Feb 06, 2013 at 10:56:32AM +0000, Cathal Garvey wrote:
> Check this one out: https://mat.boum.org/
>
This is not enough to remove metadatas, there are white text
watermarks sometimes, classic text
watermarks (like "downloaded from…"), comments watermarks (pdf format comments, not the one you see on
Adobe products) which are invisible except when you open the file with
a text/hex editor…

I think the best way to find which kind of watermark you have is to
compare two files from two different providers if it's possible.

Cathal Garvey

unread,
Feb 6, 2013, 1:12:44 PM2/6/13
to diy...@googlegroups.com
Very true; MAT only specialises in finding binary metadata like what
software made the file, etc.: to remove "text" metadata like embedded
IPs, identifying front-pages, etc, you'd need to profile what if
anything is done by a particular publisher to their PDFs, and have a
tool that removes this data specially.

For example, to remove a frontpage, you might need to "explode" the PDF
into images, discard the first image, and recompress into a new PDF.

To remove text/images embedded on the bottom of each PDF page, you could
do the same except use imagemagick on each image before recompression.

Major disadvantage to this route is that it would convert a text +
images PDF (high compression ratio, easy to extract text for re-use)
into an images-only PDF (large file size, poor compression, impossible
to extract text without OCR).

If you can extract text of course, you could try extracting text +
images and perhaps script the creation of an entirely new PDF file. This
is the opposite approach; instead of blacklisting content ("This bit
contains IP address info"), you're whitelisting content ("These bits are
the text and images that form the actual paper").

Bjonnh

unread,
Feb 6, 2013, 1:15:48 PM2/6/13
to diy...@googlegroups.com
On Wed, Feb 06, 2013 at 06:12:44PM +0000, Cathal Garvey wrote:
> Very true; MAT only specialises in finding binary metadata like what
> software made the file, etc.: to remove "text" metadata like embedded
> IPs, identifying front-pages, etc, you'd need to profile what if
> anything is done by a particular publisher to their PDFs, and have a
> tool that removes this data specially.
>
> For example, to remove a frontpage, you might need to "explode" the PDF
> into images, discard the first image, and recompress into a new PDF.
With softwares like ghostscript, pdftext and many others, you can just
remove the page without any conversion.
>
> To remove text/images embedded on the bottom of each PDF page, you could
> do the same except use imagemagick on each image before
> recompression.
You can also do a conversion to postscript and use a script to remove the
nasty part.
>
> Major disadvantage to this route is that it would convert a text +
> images PDF (high compression ratio, easy to extract text for re-use)
> into an images-only PDF (large file size, poor compression, impossible
> to extract text without OCR).
For sure !
>
> If you can extract text of course, you could try extracting text +
> images and perhaps script the creation of an entirely new PDF file. This
> is the opposite approach; instead of blacklisting content ("This bit
> contains IP address info"), you're whitelisting content ("These bits are
> the text and images that form the actual paper").
Hmmm interesting idea ! But some publishing software (LaTeX works well
for this) make strange things with sentences posititions and stuff. So
you couldn't get a good text-flow without manual intervention. Maybe
the best way to do this would be to use the HTML version, but they
exist only for recent publications (and old pdf ones are pure bitmap
with sometimes an OCRed text overlay).

leaking pen

unread,
Feb 6, 2013, 1:23:54 PM2/6/13
to diy...@googlegroups.com
sinister purposes?

--

Cathal Garvey

unread,
Feb 6, 2013, 1:24:04 PM2/6/13
to diy...@googlegroups.com
> Hmmm interesting idea ! But some publishing software (LaTeX works well
> for this) make strange things with sentences posititions and stuff. So
> you couldn't get a good text-flow without manual intervention. Maybe
> the best way to do this would be to use the HTML version, but they
> exist only for recent publications (and old pdf ones are pure bitmap
> with sometimes an OCRed text overlay).

Using HTML version is something I hadn't even considered, excellent
idea. Far more malleable than PDF!

WRT old bitmapped PDFs, there's less to lose by converting to images and
re-compressing after imagemagick.

One reason I suggested exploding/recompressing is that by doing so, you
will naturally destroy lots of metadata that you might not have realised
was there, otherwise. If you directly edit the file in a format with
compatible metadata, like postscript (is it compatible?), then the tools
might blindly copy metadata back and forth if you don't know to say "No,
delete that.".. whereas the 'stupid' way, of just bitmapping, applying a
blind if necessary, and recompressing, gives you an apparently "brand
new" PDF consisting only of dumb images.

It's bloated and ugly, but it's only going to have the sort of watermark
that you can see with your naked eye; very easy to see if something is
slipping through your net!

Bjonnh

unread,
Feb 6, 2013, 1:31:17 PM2/6/13
to diy...@googlegroups.com
> It's bloated and ugly, but it's only going to have the sort of watermark
> that you can see with your naked eye; very easy to see if something is
> slipping through your net!
Not exactly, take a look at steganography techniques like this one : https://en.wikipedia.org/wiki/Steganography#Digital

Cathal Garvey

unread,
Feb 6, 2013, 1:54:58 PM2/6/13
to diy...@googlegroups.com
Ah. If one is worried about ultra-sneaky stenographic watermarks, one
could always use the same stenographic techniques to "hide" /dev/urandom
in each image? Encoding white noise over all the "spare" bits in the image?

Of course, converting from bitmap to lossy-compressed jpeg would also
probably eliminate stenographic watermarking.

Cathal Garvey

unread,
Feb 6, 2013, 1:55:58 PM2/6/13
to diy...@googlegroups.com
Bryan linked to the discussion history, which is on Science Liberation
Front's mailing list. The "sinister purpose" is republication of
scientific literature accessed through academic portals and the like,
with the removal of identifying metadata so the donating scientists
don't get in trouble.

On 06/02/13 18:23, leaking pen wrote:
> sinister purposes?
>
> On Tue, Feb 5, 2013 at 12:20 PM, Bryan Bishop <kan...@gmail.com
> <mailto:kan...@gmail.com>> wrote:
>
> How about removing those pesky watermarks from pdfs? Sometimes they
> completely obfuscate the contents of a paper we're trying to read,
> or sometimes they have more sinister purposes.
>
> Working proof of concept:
>
> https://github.com/kanzure/pdfparanoia
> https://pypi.python.org/pypi/pdfparanoia
>
> Discussion history:
> https://groups.google.com/group/science-liberation-front/t/c68964cf55d8f6fa
>
> People who could theoretically benefit from this:
> http://scholar.google.com/scholar?q=%22Authorized+licensed+use+limited+to%22
> http://scholar.google.com/scholar?q="Redistribution+subject+to+SEG+license+or+copyright"
> To get source code:
>
> git clone git://github.com/kanzure/pdfparanoia.git
> <http://github.com/kanzure/pdfparanoia.git>
>
> To install:
>
> sudo pip install pdfparanoia
>
> or:
>
> sudo easy_install pdfparanoia
>
> Right now there's IEEE and AIP support. I need more samples to work
> with.
>
> - Bryan
> http://heybryan.org/
> 1 512 203 0507 <tel:1%20512%20203%200507>
>
> --
> -- You received this message because you are subscribed to the
> Google Groups DIYbio group. To post to this group, send email to
> diy...@googlegroups.com <mailto:diy...@googlegroups.com>. To
> unsubscribe from this group, send email to
> diybio+un...@googlegroups.com
> <mailto:diybio%2Bunsu...@googlegroups.com>. For more options,
> Learn more at www.diybio.org <http://www.diybio.org>
> ---
> You received this message because you are subscribed to the Google
> Groups "DIYbio" group.
> To unsubscribe from this group and stop receiving emails from it,
> send an email to diybio+un...@googlegroups.com
> <mailto:diybio%2Bunsu...@googlegroups.com>.
> To post to this group, send email to diy...@googlegroups.com
> <mailto:diy...@googlegroups.com>.

Bryan Bishop

unread,
Feb 7, 2013, 3:21:28 AM2/7/13
to diy...@googlegroups.com, Bjonnh, Bryan Bishop
On Wed, Feb 6, 2013 at 8:29 AM, Bjonnh wrote:
> I think the best way to find which kind of watermark you have is to
> compare two files from two different providers if it's possible.

Everyone has been theorizing that there might be stenographic
watermarks, but so far I haven't found any evidence of a publisher
employing this level of sophistication in their pdf/http servers. I
have compared downloads from multiple sources through binary diffs and
checksums. So far I haven't seen any evidence of this. I would be very
interested in being notified in the event that anyone discovers a
stenographic watermark in any academic papers.

Bryan Bishop

unread,
Feb 7, 2013, 3:26:09 AM2/7/13
to diybio, Cathal Garvey, Bryan Bishop
On Wed, Feb 6, 2013 at 12:12 PM, Cathal Garvey wrote:
> For example, to remove a frontpage, you might need to "explode" the PDF
> into images, discard the first image, and recompress into a new PDF.

I don't recommend this method, because converting most pdfs into
images will cause loss of text. You can delete entire pages in the pdf
format by deleting the "stream" objects and modifying the xref table.

> To remove text/images embedded on the bottom of each PDF page, you could
> do the same except use imagemagick on each image before recompression.

Most text in a pdf document is "semantic", surrounded by pdf markup
that can be directly deleted. I can imagine there might be one or two
cases where publishers are adding an image to a pdf with your ip
address, in which case you can delete that single image. However, if
the page content is an image itself (no selectable text), then they
might have chosen to add the image into the page, in which case the
only way to remove the watermark would be to use imagemagick as you
say, and draw over the offending image. So far I haven't seen this yet
in any of the documents I have read over the years.

> Major disadvantage to this route is that it would convert a text +
> images PDF (high compression ratio, easy to extract text for re-use)
> into an images-only PDF (large file size, poor compression, impossible
> to extract text without OCR).

right..

> If you can extract text of course, you could try extracting text +
> images and perhaps script the creation of an entirely new PDF file. This
> is the opposite approach; instead of blacklisting content ("This bit
> contains IP address info"), you're whitelisting content ("These bits are
> the text and images that form the actual paper").

How would you whitelist content you've never seen before?

Bryan Bishop

unread,
Feb 7, 2013, 3:30:18 AM2/7/13
to diy...@googlegroups.com, Cathal Garvey, Bryan Bishop
On Wed, Feb 6, 2013 at 12:24 PM, Cathal Garvey wrote:
> One reason I suggested exploding/recompressing is that by doing so, you
> will naturally destroy lots of metadata that you might not have realised
> was there, otherwise.

One of the advantages of using pdfparanoia is that you can directly
remove watermarks based on what we know about what publishers are
doing, instead of blindly guessing. If there is metadata about ip
addresses, write a plugin for pdfparanoia to detect it and remove it.
(Also write a unit test, so that future contributors can make sure
your code doesn't break). So far, I haven't seen evidence of metadata
being used like this. Really, they are all extremely pdf servers like
itext that are serving up http requests for unsuspecting scholars. My
guess is that the most "advanced" watermarking infrastructure is just
some LaTeX template that is being applied for each incoming http
request.

Bryan Bishop

unread,
Feb 7, 2013, 3:32:07 AM2/7/13
to diy...@googlegroups.com, Cathal Garvey, Bryan Bishop
On Wed, Feb 6, 2013 at 12:55 PM, Cathal Garvey wrote:
> Bryan linked to the discussion history, which is on Science Liberation
> Front's mailing list. The "sinister purpose" is republication of
> scientific literature accessed through academic portals and the like,
> with the removal of identifying metadata so the donating scientists
> don't get in trouble.

Actually, the "sinister purposes" that I was referring to were
different. In context, what I said was:

"How about removing those pesky watermarks from pdfs? Sometimes they
completely obfuscate the contents of a paper we're trying to read, or
sometimes they have more sinister purposes."

The "more sinister purposes" are things like... tracking who reads
science. That's sinister.
Reply all
Reply to author
Forward
0 new messages