How to get affiliations for *all* authors?

Mohammed Afif

unread,

Mar 14, 2017, 8:52:28 PM3/14/17

to arXiv api

Hi everyone,

I'm trying to build a visualisation that displays new submissions on a world map.

When I made my arxiv account my affiliation was a required field, so I assumed that this information would be accessible for all authors.

However I've been reading through the API docs and previous topics here, and in several locations it is stated that the affiliation is optional.

Perhaps arxiv has the affiliation of authors from the registration field stored and does not display them. But I'm not sure why this is the case.

It cannot be for privacy, because for any arbitrary paper I can just look at the PDF and the author's affiliation is always there.

Since this data is easily human-readable why not make it machine-readable?

This is a little frustrating, I can't come up with a different way to get author's affiliations reliably. There are packages that let you parse pdfs (https://github.com/metachris/pdfx)

but then I will have to resort to downloading ~100 pdfs a day and parsing them, which arxiv doesn't seem to like and I don't want to pay for that bandwidth.

Will the folks developing the arxiv API expose the affiliation fields we are required to fill when we register?

Best,

Mohammed

Thorsten S

unread,

Mar 14, 2017, 9:20:03 PM3/14/17

to arXiv api

Dear Mohammed,

arXiv makes no attempt at collecting affiliation information of all authors of all papers in a comprehensive and canonical form. The affiliation is an optional free form text field on submission of each paper. Similarly the affiliation of the submitter (not necessarily an author) is not normalized or verified.

You are underestimating the difficulty in providing accurate affiliation information for (all) authors of a given paper in a canonical and authoritative form. There are various commercial databases which attempt to address the need for institutional information and they have different conventions, granularity and coverage. There is also the issue that authors move and change affiliation and keeping track of the complete timeline and affiliation history of authors is not something arXiv attempts to do. Simply because an author once had an affiliation doesn't mean that for a subsequent paper submitted by a coauthor or collaborator the previous affiliation of said author -- if he can be uniquely identified by name (which is not a given) or said author subsequently claims the paper for his author record (which is optional) is applicable. Similarly, the affiliation(s) for older papers of an author may not be known.

arXiv simply makes accessible the optional free form text affiliations as entered by the submitter in the metadata for a particular paper.

We expect that the increasing use of ORCIDs for author identification will facilitate exposing much richer information.

Cheers

T.

--
You received this message because you are subscribed to the Google Groups "arXiv api" group.
To unsubscribe from this group and stop receiving emails from it, send an email to arxiv-api+unsubscribe@googlegroups.com.
To post to this group, send email to arxi...@googlegroups.com.
Visit this group at https://groups.google.com/group/arxiv-api.
For more options, visit https://groups.google.com/d/optout.

Mohammed Afif

unread,

Mar 15, 2017, 12:29:08 PM3/15/17

to arXiv api

Hi Thorsten,

I appreciate the difficulty in getting complete and correct information for all authors and for all papers, but for the purpose of visualising new submissions, I don't need that much.

What I'm essentially doing is visiting the 'new' page of an archive (like https://arxiv.org/list/astro-ph/new) and plotting those papers on a map based on the affiliation of the first author.

Looking down that list of new submissions, the vast majority of the papers submitted were submitted by the first author. If the first author submitted the paper, then they have an arxiv account and they have filled in the required "organization" field that arxiv makes new users fill out when creating an account. That's the information I'd like to have access to through the API.

I'm happy to trust that the majority people will update this field if they are active on arxiv and I'm okay with there being a few errors or outdated affiliations --the purpose of my visualisation is an alternative way to browse new arxiv submissions, and I don't need absolutely correct information.

Unfortunately, checking authors manually against ORCID, the majority of them do not have ORCIDs.

For now I think exposing the "organization" field from user's accounts would work best. It would also save me having to parse PDFs for author affiliations, even though most (every) paper submitted to arXiv has affiliation information.

The only other option I could see is parsing the source TeX files. If there's a way to get just the TeX (without the associated images) then that should be efficient with the bandwidth.

Cheers,

Mohammed

To unsubscribe from this group and stop receiving emails from it, send an email to arxiv-api+...@googlegroups.com.

Thorsten S

unread,

Mar 15, 2017, 1:19:26 PM3/15/17

to arXiv api

Hi Mohammed,

well you are making a few assumptions and cutting a few corners which is fine for your application. However the information offered by arXiv has to adhere to a more rigorous standard and there currently isn't a system in place to comprehensively manage affiliation information for individual papers.

If you want curated affiliation information you should look at what ADS or INSPIRE offer.

Cheers

T.

To unsubscribe from this group and stop receiving emails from it, send an email to arxiv-api+unsubscribe@googlegroups.com.

Mohammed Afif

unread,

Mar 15, 2017, 1:46:53 PM3/15/17

to arXiv api

Thanks Thorsten, I'll check those out.

Mohammed

Reply all

Reply to author

Forward