arxiv api and bibtex / json metadata

179 views
Skip to first unread message

Fred Howell

unread,
May 6, 2009, 11:30:37 AM5/6/09
to arXiv api
A question for the arxiv api developers....

I'm looking to code a search interface for http://publicationslist.org
which uses the arxiv api - providing a gui for researchers to compile
their own publications list like the one we did for PubMed [
http://publicationslist.org/pubmed.html ] (search for own
publications, pull in metadata, make a personal web publications list
- screencast of the pubmed version on:
http://publicationslist.org/em-loader/emloader-report-workflow.html )

The catch is that the vanilla Atom feed from the arxiv api doesn't
seem to include all the bibtex-style metadata, and extracting journal
names etc from the plain text is hard.

Is there a version of the API which returns an atom feed with embedded
bibtex info?
We used Atom + JSON/Bibtex in the entries for another project which
worked quite well - http://bit.ly/wnmFk

Or would it need extra HTTP requests for each entry to extract
detailed metadata?

Thanks,

Fred Howell

http://publicationslist.org
( & http://a.nnotate.com )

julius.lucks

unread,
May 6, 2009, 11:40:56 AM5/6/09
to arxi...@googlegroups.com
Hi Fred,

Can you give some examples of what you mean? Are their metadata
fields that are not represented in the API ATOM feeds, or is the ATOM
just not in the right format? Please give as much detail as possible
so we can help think of a solution.

Thanks,

Julius

---------------------------------------------------------------------------------------
http://www.openwetware.org/wiki/User:Julius_B._Lucks
----------------------------------------------------------------------------------------

Fred Howell

unread,
May 6, 2009, 12:28:58 PM5/6/09
to arXiv api

> Can you give some examples of what you mean?  Are their metadata  
> fields that are not represented in the API ATOM feeds, or is the ATOM  
> just not in the right format?  Please give as much detail as possible  
> so we can help think of a solution.


... the citation info looks to come back as something like:
<arxiv:journal_ref xmlns:arxiv="http://arxiv.org/schemas/
atom">Eur.Phys.J. C31 (2003) 17-29</arxiv:journal_ref>

It would be good to get the year, journal, pages etc in separate
fields to
save parsing the string... is this info stored separately within
arxiv?
Or would a simple parse of the string to get the components work
reliably? If not, it would be useful to get separate fields like:
<year>2003</year>
<journal>Eur.Phys.J.</journal>
<pages>17-29</pages>

Thanks,
Fred.



PS I've used a JSON encoding with bibtex field names, embedded in the
<content>
of Atom entries before (to get lossless exchange of bibtex
metadata) ... most
tools already have some bibtex import / export support, so there's
some
advantage in using its schema.

I appended a sample of JSON/Bibtex embedded within
Atom - it's a bit unusual to include json within xml, but you sidestep
all the
clunky XML namespaces by using json, and it's easy to process.
Not sure if JSON output could be an alternative for your arxiv api...

<feed>
<entry>
<content>
{
"type":"article",
"title":"...",
"year":"2006",
"author":"...",
"journal":"Concurrency Computat.: Pract. Exper.",
"volume":"19",
"number":"",
"pages":"207-221",
"month":"",

"doi":"10.1002\/cpe.1044",
"pdflink":"",
"urllink":"http:\/\/www3.interscience.wiley.com\/...",
"abstract":"The integrative ambitions of systems biology - (...)",
"note":"",
"keywords":"XML, Semantic Web, e-Science"
}
</content>
</entry>
</feed>


Matt Leifer

unread,
May 7, 2009, 6:57:23 AM5/7/09
to arXiv api
I think the problem is that the arXiv simply doesn't store this
metadata in separate fields. It is not a problem with the API, but
with the arXiv itself. When an author adds a journal reference to an
arXiv article it is just a single field rather than separate fields
for journal name, date of publication, page numbers, etc.

Personally, I would support a move for the arXiv to fine-grain the
journal reference data, not least because it would be useful for
extracting publication lists and bibtex files from the API.

Matt

Thorsten S

unread,
May 8, 2009, 2:48:00 PM5/8/09
to arxi...@googlegroups.com
it is correct that the journal reference field at arXiv is basically
free form and we don't have more fine grained tokenization of it.

The problem is not only our data entry form, however, since we receive
the majority of journal refs from external services (e.g. SLAC/SPIRES,
etc.) and publisher feeds as simple text strings. This augments author
provided citation information.

For the purpose of automatic linking, resource identification and
resolution the DOI is much more suitable than the journal ref, not the
least because there is no unique established standard for journal refs
and even widely adopted conventions have changed over time.


http://export.arxiv.org/api_help/docs/user-manual.html#extension_elements
...
If the author has provided a DOI for the article, then there will be a
<arxiv:doi> element with this information:

<arxiv:doi xmlns:arxiv="http://arxiv.org/schemas/atom">
10.1529/biophysj.104.047340
</arxiv:doi>
...


For entries for which there is a journal ref field but no DOI, your
best option is to try to parse the journal ref. Since a majority of
the journal refs comes from a few automated sources, they follow
common conventions and are reasonably well structured

Cheers
Thorsten
Reply all
Reply to author
Forward
0 new messages