Removing watermarks from pdfs

147 views
Skip to first unread message

Bryan Bishop

unread,
Jan 15, 2013, 7:34:33 PM1/15/13
to science-libe...@googlegroups.com, Bryan Bishop
How about getting rid of those pesky watermarks in pdfs?

As far as I can tell, there are only visible watermarks. Invisible
watermarks can be detected by comparing the same pdf retrieved through
two different gateways (like from two different libraries). I have
checked Nature Publishing Group and Elsevier (specifically
ScienceDirect) and found no checksum differences.

But there are some culprits out there that do some nasty things to documents:

* lines of text added to the document containing an ip address,
timestamp, university name, etc. (IEEE Xplore)

* entire pages added to documents with tracking information (Wiley? I
can't remember exactly.)

* possibly some might be using CVE-2010-0188 to phone home to
publishers. PDF supports javascript and flash and other terrible
things, so it would be interesting to check if any publishers have
attempted to use these vulnerabilities to their advantage.

* there might be "hidden" information inside a pdf that changes when
you download a document, but so far no evidence of this has been found
(so I don't believe it's likely, but it's worth keeping in mind).

I think it would be useful to work on some ways to remove watermarks
from pdfs. I am aware of largely two types of pdfs that publishers
distribute. One is the feared "collection of images", which may or may
not have extra images slapped on with ip address information. The
second is a pdf with actual selectable text. The first type, with just
images everywhere, can be de-watermarked by just drawing images over
the offensive text. The second type requires some other creative
thinking, maybe just a collection of regular expressions.

For instance, here's a line that IEEE Xplore once added to a paper
that I was reading:

"Authorized licensed use limited to: University of Getting Schooled.
Downloaded on July 39, 2009 at 15:10 from IEEE Xplore. Restrictions
apply."

In fact, you can see this line appearing in other (4,000) papers that
other people have been reading:

http://scholar.google.com/scholar?q=%22Authorized+licensed+use+limited+to%22

Here's another example. AAAS/Science is of particular interest. They
attach an entire front page and add text in the margins everywhere:

"Downloaded from www.sciencemag.org on November 30, 1912"

So I think a good first step would be to collect examples of text
added to documents that need be detected by any eraser we write. In
fact, maybe all identifying information for an article should be
removed, and just replace it with an easy-to-copy-down text code (like
"blue-apple-oranges" to refer to a specific document in an index).

Does anyone else have some samples to share of nasty watermarks worth
removing? Also, any favorite ways to manipulate pdfs?

- Bryan
http://heybryan.org/
1 512 203 0507

Nathan McCorkle

unread,
Jan 15, 2013, 10:20:39 PM1/15/13
to science-libe...@googlegroups.com, Bryan Bishop


On Tuesday, January 15, 2013 4:34:33 PM UTC-8, Bryan Bishop wrote:
I think it would be useful to work on some ways to remove watermarks
from pdfs.
 
maybe just a collection of regular expressions.
 
So I think a good first step would be to collect examples of text
added to documents that need be detected by any eraser we write. In
fact, maybe all identifying information for an article should be
removed, and just replace it with an easy-to-copy-down text code (like
"blue-apple-oranges" to refer to a specific document in an index).


 Wouldn't the publishers just follow our discussion here to combat against this?

Bryan Bishop

unread,
Jan 15, 2013, 10:22:05 PM1/15/13
to Nathan McCorkle, Bryan Bishop, science-libe...@googlegroups.com
On Tue, Jan 15, 2013 at 9:20 PM, Nathan McCorkle <nmz...@gmail.com> wrote:
> Wouldn't the publishers just follow our discussion here to combat against
> this?

No, those pdfs are already downloaded.

Bryan Bishop

unread,
Jan 15, 2013, 10:23:16 PM1/15/13
to Nathan McCorkle, Bryan Bishop, science-libe...@googlegroups.com
On Tue, Jan 15, 2013 at 9:20 PM, Nathan McCorkle <nmz...@gmail.com> wrote:
> Wouldn't the publishers just follow our discussion here to combat against
> this?

I thought for a few more moments, and I'd like to add to my reply that
publishers have access to zotero/translators.git as well, and zotero
is able to stay ahead of 100s of publishers with a small team. Pretty
impressive.

Nathan McCorkle

unread,
Jan 15, 2013, 10:28:01 PM1/15/13
to science-libe...@googlegroups.com, Nathan McCorkle, Bryan Bishop


On Tuesday, January 15, 2013 7:23:16 PM UTC-8, Bryan Bishop wrote:
On Tue, Jan 15, 2013 at 9:20 PM, Nathan McCorkle <nmz...@gmail.com> wrote:
>  Wouldn't the publishers just follow our discussion here to combat against
> this?

I thought for a few more moments, and I'd like to add to my reply that
publishers have access to zotero/translators.git as well, and zotero
is able to stay ahead of 100s of publishers with a small team. Pretty
impressive.

So are the publishers watching the zotero/translators and causing work for that team?

Bryan Bishop

unread,
Jan 15, 2013, 10:40:47 PM1/15/13
to Nathan McCorkle, Bryan Bishop, science-libe...@googlegroups.com
On Tue, Jan 15, 2013 at 9:28 PM, Nathan McCorkle <nmz...@gmail.com> wrote:
> So are the publishers watching the zotero/translators and causing work for
> that team?

I've asked your question to one of the Zotero devs (Simon Kornblith,
MIT). I think he will be able to make a much more informed statement
than I can on that. I would assume that publishers don't want users
using Zotero's scrapers, but I have been known to be wrong before.

Simon Kornblith

unread,
Jan 15, 2013, 11:04:46 PM1/15/13
to science-libe...@googlegroups.com, Nathan McCorkle, Bryan Bishop

The functionality that Zotero provides isn't substantively different from saving PDFs and typing out citations by hand and falls within the realm of acceptable use. As such, the relationship between Zotero and publishers is hardly adversarial. As far as I am aware, no publisher has ever intentionally broken a Zotero translator, and some publishers have even contributed translators of their own.

Simon

Eugen Leitl

unread,
Jan 16, 2013, 9:05:16 AM1/16/13
to science-libe...@googlegroups.com, kan...@gmail.com
----- Forwarded message from Maxim Kammerer <m...@dee.su> -----

From: Maxim Kammerer <m...@dee.su>
Date: Wed, 16 Jan 2013 15:07:54 +0200
To: liberationtech <liberat...@lists.stanford.edu>
Subject: Re: [liberationtech] Removing watermarks from pdfs
Reply-To: liberationtech <liberat...@lists.stanford.edu>

On Wed, Jan 16, 2013 at 2:43 PM, Eugen Leitl <eu...@leitl.org> wrote:
> For instance, here's a line that IEEE Xplore once added to a paper
> that I was reading:
>
> "Authorized licensed use limited to: University of Getting Schooled.
> Downloaded on July 39, 2009 at 15:10 from IEEE Xplore. Restrictions
> apply."

I have removed such lines in the past via a simple “pdftk uncompress |
sed | pdftk compress” filter. IIRC, file size needs to stay the same.
I guess this approach applies to all added extra text. Added pages can
be removed using pdftk just the same.

--
Maxim Kammerer
Liberté Linux: http://dee.su/liberte
--
Unsubscribe, change to digest, or change password at: https://mailman.stanford.edu/mailman/listinfo/liberationtech

----- End forwarded message -----
--
Eugen* Leitl <a href="http://leitl.org">leitl</a> http://leitl.org
______________________________________________________________
ICBM: 48.07100, 11.36820 http://www.ativel.com http://postbiota.org
8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE

Bryan Bishop

unread,
Feb 5, 2013, 5:58:50 AM2/5/13
to science-libe...@googlegroups.com, Bryan Bishop
On Tue, Jan 15, 2013 at 6:34 PM, Bryan Bishop wrote:
How about getting rid of those pesky watermarks in pdfs?

Working proof of concept:

https://github.com/kanzure/pdfparanoia
https://pypi.python.org/pypi/pdfparanoia

To install:


or:

sudo pip install pdfparanoia

or:

sudo easy_install pdfparanoia

Right now there's IEEE and AIP support. I need more samples to work with.
Reply all
Reply to author
Forward
0 new messages