I wanted to float an idea for a new project to gauge interest.
I'm back on the academic circuit these days and have discovered
something that particularly irks me. When writing for an academic
publication (or just reading for personal interest) you typically want
to maintain a database of the things you've read in order to reference
them later. Surprisingly there appears to be no global, free database
of such references with a useful API. As a result, myself and all the
researchers I've spoken to have a workflow that is something like
this:
1. Find interesting paper.
2. Grab a bibtex/endnote reference from ACM/IEEE/CiteSeer/Google Scholar.
3. Spend 10 minutes fixing all the typos and incomplete information
in the reference.
There's no getting around it: all these sources have crap data. The
main reason for this is that the data comes from scripts/bots that
index the paper itself and attempt to automatically pull out this
information. On top of that, I haven't found nice APIs for any of
these either. (There is more to say here about things like Zotero,
Mendeley, Papers, etc. but I'm keeping this brief)
What I want to propose is building a reference database where all
information is submitted by the paper authors, conference organisers
or some other "trusted" source. Some things that would be cool:
- Free. Totally free.
- REST API.
- Federation with other sources, so you could host a local instance
that would have other trusted instances it could automatically
synchronise with.
- Automatic bibtex and endnote generation.
- Support for layering. E.g. if I reference an upstream database
record, I also want to add a field for recording a local path to the
PDF of the paper on my hard drive, but this information is obviously
useless to others and shouldn't be pushed back to the server.
I like this as a project for CPB for several reasons. It is something
I would actively use every single day. This project is easily broken
down into a lot of small tasks at a granularity where you could start
and finish a single task over a weekend. The project has lots of room
to move and many directions we could extend it in.
Please let me know what you think and I'd love to have some traction
on this one as it would be a godsend for myself and other researchers.
Thanks,
Matt
That scraping site is definitely something worth keeping an eye on.
Interestingly they don't seem to take a strong stance on legal issues:
https://scraperwiki.com/docs/python/faq/#data_types. I imagine many
sites who have an explicit policy against scraping would be more than
a little annoyed if they found you doing this. I would be surprised if
the big guns like Facebook and Google didn't have monitors on their
web logs that automatically flag bot-like behaviour from clients.
Anyone noticed this crop up recently: http://www.commoncrawl.org/ ?
Not sure what to do with it, but I'm sure we can come up with a cool
use if we scratch our collective heads.
I use bibtex a lot these days, so I can definitely provide some
guidance on what the requirements are there. I've never used EBSCO,
but looks like it's pretty focused on English literature as opposed to
scientific publications.
I got side-tracked with moving house, but went to make a start on this
today and discovered something odd. I appear to have lost my "New
Repository" button on both my personal profile and the CPB profile.
Can anyone else create new repos? I can still push to my existing
repos, but can't seem to create new ones.
On 17 December 2011 15:53, Matthew Fernandez