Thanks Jonathan. I wrote some similar scripts last year for nature and
sciencedirect and would be glad to share them. I also threw up a
repository on my github account page for something called "pyscholar",
which is an attempt at using zotero scrapers in python by using xpaths
and beautifulsoup. Anyway, when I wrote the nature scraper, I made
this terrible mistake: I just downloaded all of the papers and none of
the metadata. Consequently I now have over 122,000 PDF files laying
around with only a title given to the PDF file- no information about
authors, no abstract, no DOI, etc. etc. So, don't repeat my mistake
and do things right. Something like a folder per journal, and a folder
per issue, and then either lots of symlinks or lots of generated TOCs,
would do the trick.
Good luck.
Good luck.
- Bryan
http://heybryan.org/
1 512 203 0507
Anyways, on a more technical note, regarding your archive of
metadata-less PDFs, the NCBI provides an API to Pubmed so it may not
be that difficult to retrieve metadata assuming you have the title of
the articles.
http://www.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html
-Cory
That sounds totally ridiculous, ludicrous and stupid. Citation needed :-).
Jonathan's method is essentially the same as mine. By "right" I mean,
just make sure you spend some extra effort when you're writing the
scripts to put the right data where it should go. Figure out some
xpaths to extract common information from each page.
> assuming you need to be on a network/proxy with a subscription, right? There
Not necessarily a proxy. But yes, a network.
> Could you explain how to use them (scripts) a bit more, and the concept
You need to run them through the perl interpreter on the command line.
$ perl blah.pl
> behind it? I am tired and waiting to get on a 24 hour flight right now, so I
The concept is just a web scraper or "spider". The script downloads
web pages and then parses the text to extract various links and other
pieces of data.
The best I can do without forwarding confidential emails is direct you
to this website:
http://www.journalprices.com/
Since many states have laws saying that public institutions have to
make details of their contracts available, the creators of this site
have requested some contracts and published various stats based on
these data. If you search for "Nature" it will show that the average
price paid per article is $14.63. That sounds pretty steep, until you
see the average price for, say, Nature Materials - $54.51. Ouch!
Although, keep in mind this is only an average from 36 universities.
Some universities are probably better at bargaining than others.
I recently approached my librarian about this, trying to figure out
the criteria for which journals they were cutting (due to the sagging
endowment) and was told that they were simply lowering the
price-per-click threshold. Any journals with per-click prices above
this threshold were being cut. I can't give out exact numbers, but I
will say the numbers shown on journalprices.com are fairly typical.
Also, if you want to request contract information from your
university, use this "State Open Records Law Request Letter Generator"
http://www.splc.org/foiletter.asp
-Cory
Yes, Tom, but I was asking about the pay-per-article contracts. Do
these actually exist? Can anyone show these to me?
>
> On Fri, Aug 28, 2009 at 3:05 PM, Bryan Bishop<kan...@gmail.com>
> wrote:
>> That sounds totally ridiculous, ludicrous and stupid. Citation
>> needed :-).
>
> The best I can do without forwarding confidential emails is direct you
> to this website:
> http://www.journalprices.com/
I'm guessing (but don't know) that this is the price derived by taking
the contract price and dividing by the number of downloads.