[RFC] New project: reference database

Matthew Fernandez

unread,

Nov 3, 2011, 7:08:32 PM11/3/11

to code-p...@googlegroups.com

Morning all,

I wanted to float an idea for a new project to gauge interest.

I'm back on the academic circuit these days and have discovered
something that particularly irks me. When writing for an academic
publication (or just reading for personal interest) you typically want
to maintain a database of the things you've read in order to reference
them later. Surprisingly there appears to be no global, free database
of such references with a useful API. As a result, myself and all the
researchers I've spoken to have a workflow that is something like
this:
1. Find interesting paper.
2. Grab a bibtex/endnote reference from ACM/IEEE/CiteSeer/Google Scholar.
3. Spend 10 minutes fixing all the typos and incomplete information
in the reference.
There's no getting around it: all these sources have crap data. The
main reason for this is that the data comes from scripts/bots that
index the paper itself and attempt to automatically pull out this
information. On top of that, I haven't found nice APIs for any of
these either. (There is more to say here about things like Zotero,
Mendeley, Papers, etc. but I'm keeping this brief)

What I want to propose is building a reference database where all
information is submitted by the paper authors, conference organisers
or some other "trusted" source. Some things that would be cool:
- Free. Totally free.
- REST API.
- Federation with other sources, so you could host a local instance
that would have other trusted instances it could automatically
synchronise with.
- Automatic bibtex and endnote generation.
- Support for layering. E.g. if I reference an upstream database
record, I also want to add a field for recording a local path to the
PDF of the paper on my hard drive, but this information is obviously
useless to others and shouldn't be pushed back to the server.

I like this as a project for CPB for several reasons. It is something
I would actively use every single day. This project is easily broken
down into a lot of small tasks at a granularity where you could start
and finish a single task over a weekend. The project has lots of room
to move and many directions we could extend it in.

Please let me know what you think and I'd love to have some traction
on this one as it would be a godsend for myself and other researchers.

Thanks,
Matt

Tom Allen

unread,

Nov 13, 2011, 11:30:45 PM11/13/11

to code-p...@googlegroups.com

There's a lot of merit to this project, but it's big. Probably too big for CPB, but I could be wrong.

I don't have much more to say other than that, but something vaguely relevant that I found recently is https://scraperwiki.com/ (more relevant to the previous project Polilist actually) which has a bunch of open source code to scrape webpages for various details. There's heaps for government address details, etc, but also a few for references.

Cheers,

Tom

Matthew Fernandez

unread,

Nov 14, 2011, 10:50:35 PM11/14/11

to code-p...@googlegroups.com

Yes, I agree it's big. I was hoping it could sort of expand or
contract depending on how much time we have to contribute.

That scraping site is definitely something worth keeping an eye on.
Interestingly they don't seem to take a strong stance on legal issues:
https://scraperwiki.com/docs/python/faq/#data_types. I imagine many
sites who have an explicit policy against scraping would be more than
a little annoyed if they found you doing this. I would be surprised if
the big guns like Facebook and Google didn't have monitors on their
web logs that automatically flag bot-like behaviour from clients.

Anyone noticed this crop up recently: http://www.commoncrawl.org/ ?
Not sure what to do with it, but I'm sure we can come up with a cool
use if we scratch our collective heads.

Patrick Coleman

unread,

Nov 14, 2011, 11:02:37 PM11/14/11

to code-p...@googlegroups.com

I'd be up for it - it's nice to see a clear definition of what's required, and I'm sure it'd be easy to find enough

students/researchers to get real user feedback.

It does seem quite big, although it's possible to get working versions up without every feature -

e.g. while federation/layering is cool, it'd still be useful without those;

I can think of other possibly useful additions (e.g. "people who reference this paper also reference...")

that fall into this category too, all stuff that could be parallelised nicely once the basic reference DB is working.

That said, it's been a while since I've seen any referencing like that - it'd be great to here from others as to

what sort of things should be stored per-reference (e.g. title, author, publisher, ...etc), plus

inputs we can use to generate these, and output types (bibtex, endnote, standard reference formats, ...)

I couldn't find a huge amount of similar things by searching -

does anyone have much experience with EBSCO? http://www.ebscohost.com/public/literary-reference-center

http://www.refworks.com/ looks rather outdated, and http://en.wikipedia.org/wiki/List_of_academic_databases_and_search_engines

might be a useful source.

- Pat

Matthew Fernandez

unread,

Nov 23, 2011, 9:05:00 PM11/23/11

to code-p...@googlegroups.com

Cool! I'll set up a repository early next week with some initial
thoughts and hackery, then we can go from there.

I use bibtex a lot these days, so I can definitely provide some
guidance on what the requirements are there. I've never used EBSCO,
but looks like it's pretty focused on English literature as opposed to
scientific publications.

Patrick Coleman

unread,

Dec 1, 2011, 4:26:38 AM12/1/11

to code-p...@googlegroups.com

Hey - talking to a PhD student friend of mine, he's said apparently Zotero is pretty popular at Adelaide Uni:

http://www.zotero.org/

Sounds similar-ish, there's even a collection API and an android app that gets a reference by scanning a book barcode.

The other news is that it's an open source project, so if we want to avoid writing a new app but still having something to work on,

I'm sure they'd be happy to get contributions.

- Pat

Billy Huang

unread,

Dec 1, 2011, 4:39:02 AM12/1/11

to Patrick Coleman, code-p...@googlegroups.com

Been using that since 2 years ago, but moved to chrome

Sent from my Windows Phone

From: Patrick Coleman
Sent: Thursday, 1 December 2011 20:26
To: code-p...@googlegroups.com
Subject: Re: [RFC] New project: reference database

Matthew Fernandez

unread,

Dec 16, 2011, 11:53:30 PM12/16/11

to code-p...@googlegroups.com

On 24 November 2011 13:05, Matthew Fernandez

<matthew....@gmail.com> wrote:
> Cool! I'll set up a repository early next week with some initial
> thoughts and hackery, then we can go from there.

I got side-tracked with moving house, but went to make a start on this
today and discovered something odd. I appear to have lost my "New
Repository" button on both my personal profile and the CPB profile.
Can anyone else create new repos? I can still push to my existing
repos, but can't seem to create new ones.

Matthew Fernandez

unread,

Dec 20, 2011, 1:05:40 AM12/20/11

to code-p...@googlegroups.com

Not sure what changed, but my buttons are back. I've pushed something
very preliminary to https://github.com/CodeProBono/canonical. Let me
know if you have thoughts, otherwise I'll use the Christmas break to
flesh out the database structure a bit.

On 17 December 2011 15:53, Matthew Fernandez

Reply all

Reply to author

Forward