Re: [getarticles:150] My scripts to download all of Nature journals' PDF's

12 views
Skip to first unread message

Bryan Bishop

unread,
Aug 28, 2009, 3:57:12 PM8/28/09
to getar...@googlegroups.com, kan...@gmail.com, papers, diybio
On Fri, Aug 28, 2009 at 2:51 PM, JonathanCline wrote:
> The following scripts use minimal unix programs so can be run natively
> on linux/freebsd/mac & can be run on windows if cygwin is installed.
> They are simple and of course could be made into super duper GUI
> whatever, however they're designed to run on any internet machine with
> minimal other software installed, to re-start simply if they are
> stopped, and to be simple to change (because journals change web
> formatting arbitrarily).  Simple is better.  The design of the scripts
> follows a batch approach.  (1) downloads all the issue table-of-
> contents, in chronological (or reverse) order, starting from a first
> issue URL -- this takes some time however isnt normally subject to
> bandwidth limits;  (2) after (1) finishes, parses the table of
> contexnts locally to extract all the paper's URLs, then (3) slowly
> downloads each paper's URL PDF (sometimes supplimentary URL).   Some
> slight enhancement can be added so that it's possible to run the
> scripts periodically (i.e. weekly) to automatically download new
> issues as they become available, I haven't needed to do that yet.

Thanks Jonathan. I wrote some similar scripts last year for nature and
sciencedirect and would be glad to share them. I also threw up a
repository on my github account page for something called "pyscholar",
which is an attempt at using zotero scrapers in python by using xpaths
and beautifulsoup. Anyway, when I wrote the nature scraper, I made
this terrible mistake: I just downloaded all of the papers and none of
the metadata. Consequently I now have over 122,000 PDF files laying
around with only a title given to the PDF file- no information about
authors, no abstract, no DOI, etc. etc. So, don't repeat my mistake
and do things right. Something like a folder per journal, and a folder
per issue, and then either lots of symlinks or lots of generated TOCs,
would do the trick.

Good luck.

- Bryan
http://heybryan.org/
1 512 203 0507

Nathan McCorkle

unread,
Aug 28, 2009, 4:40:16 PM8/28/09
to diy...@googlegroups.com

Whoa, whoa whoa... So what is the "right" way to download this? And I'm assuming you need to be on a network/proxy with a subscription, right? There have been some Nature sub-journals that I have wanted in the past, but my school doesn't havve a subscription to, and I may have been to lazy to request it from inter-library loan...

Could you explain how to use them (scripts) a bit more, and the concept behind it? I am tired and waiting to get on a 24 hour flight right now, so I will say I am not with my best wits, so maybe you did explain this already.
 



--
Nathan McCorkle
Rochester Institute of Technology
College of Science, Biotechnology/Bioinformatics

Cory Tobin

unread,
Aug 28, 2009, 6:02:28 PM8/28/09
to diy...@googlegroups.com
Assuming you have access to these journals via your university, you
may want to check with your librarian regarding the type of contract
they have with the publishers. While some universities pay a flat
rate for unlimited access to the journal, others universities have
"pay per click" contracts. If your university is paying, say, 10
bucks per Nature article, downloading every article in the archive
could end up costing your university a ton of cash.

Anyways, on a more technical note, regarding your archive of
metadata-less PDFs, the NCBI provides an API to Pubmed so it may not
be that difficult to retrieve metadata assuming you have the title of
the articles.
http://www.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html

-Cory

Bryan Bishop

unread,
Aug 28, 2009, 6:05:38 PM8/28/09
to diybio, kan...@gmail.com
On Fri, Aug 28, 2009 at 5:02 PM, Cory Tobin wrote:
> Assuming you have access to these journals via your university, you
> may want to check with your librarian regarding the type of contract
> they have with the publishers.  While some universities pay a flat
> rate for unlimited access to the journal, others universities have
> "pay per click" contracts.  If your university is paying, say, 10
> bucks per Nature article, downloading every article in the archive
> could end up costing your university a ton of cash.

That sounds totally ridiculous, ludicrous and stupid. Citation needed :-).

Bryan Bishop

unread,
Aug 28, 2009, 7:40:10 PM8/28/09
to diy...@googlegroups.com, kan...@gmail.com
On Fri, Aug 28, 2009 at 3:40 PM, Nathan McCorkle wrote:

Jonathan's method is essentially the same as mine. By "right" I mean,
just make sure you spend some extra effort when you're writing the
scripts to put the right data where it should go. Figure out some
xpaths to extract common information from each page.

> assuming you need to be on a network/proxy with a subscription, right? There

Not necessarily a proxy. But yes, a network.

> Could you explain how to use them (scripts) a bit more, and the concept

You need to run them through the perl interpreter on the command line.

$ perl blah.pl

> behind it? I am tired and waiting to get on a 24 hour flight right now, so I

The concept is just a web scraper or "spider". The script downloads
web pages and then parses the text to extract various links and other
pieces of data.

Tom Knight

unread,
Aug 28, 2009, 8:18:02 PM8/28/09
to diy...@googlegroups.com, Tom Knight, kan...@gmail.com
Almost all contracts with university libraries prohibit mass
downloading of journal articles. They also prohibit making them
available to others, except as a personal favor or in a professional
relationship, a few at a time. Your librarian will almost certainly
hear from Nature if you run this script. I don't like this, of
course, and don't recommend publishing in these journals, but you
should at least be aware that those rules are in effect (even if you
choose to ignore them).

Cory Tobin

unread,
Aug 28, 2009, 8:26:48 PM8/28/09
to diy...@googlegroups.com
On Fri, Aug 28, 2009 at 3:05 PM, Bryan Bishop<kan...@gmail.com> wrote:
> That sounds totally ridiculous, ludicrous and stupid. Citation needed :-).

The best I can do without forwarding confidential emails is direct you
to this website:
http://www.journalprices.com/

Since many states have laws saying that public institutions have to
make details of their contracts available, the creators of this site
have requested some contracts and published various stats based on
these data. If you search for "Nature" it will show that the average
price paid per article is $14.63. That sounds pretty steep, until you
see the average price for, say, Nature Materials - $54.51. Ouch!
Although, keep in mind this is only an average from 36 universities.
Some universities are probably better at bargaining than others.

I recently approached my librarian about this, trying to figure out
the criteria for which journals they were cutting (due to the sagging
endowment) and was told that they were simply lowering the
price-per-click threshold. Any journals with per-click prices above
this threshold were being cut. I can't give out exact numbers, but I
will say the numbers shown on journalprices.com are fairly typical.

Also, if you want to request contract information from your
university, use this "State Open Records Law Request Letter Generator"
http://www.splc.org/foiletter.asp

-Cory

Bryan Bishop

unread,
Aug 28, 2009, 8:30:19 PM8/28/09
to Tom Knight, kan...@gmail.com, diy...@googlegroups.com
On Fri, Aug 28, 2009 at 7:18 PM, Tom Knight wrote:
> Almost all contracts with university libraries prohibit mass downloading of

Yes, Tom, but I was asking about the pay-per-article contracts. Do
these actually exist? Can anyone show these to me?

Tom Knight

unread,
Aug 28, 2009, 11:13:28 PM8/28/09
to diy...@googlegroups.com, Tom Knight

On Aug 28, 2009, at 8:26 PM, Cory Tobin wrote:

>
> On Fri, Aug 28, 2009 at 3:05 PM, Bryan Bishop<kan...@gmail.com>
> wrote:
>> That sounds totally ridiculous, ludicrous and stupid. Citation
>> needed :-).
>
> The best I can do without forwarding confidential emails is direct you
> to this website:
> http://www.journalprices.com/

I'm guessing (but don't know) that this is the price derived by taking
the contract price and dividing by the number of downloads.

Reply all
Reply to author
Forward
0 new messages