Name splitting in arxiv OAI-PMH

54 views
Skip to first unread message

Isabel Beckenbach

unread,
Aug 9, 2024, 6:48:32 AM8/9/24
to arXiv API Disucssion, Isabel Beckenbach
Hi,

I have a question about the name splitting in the arxiv OAI-PMH. In the "arXiv" metadata format the author names are splitted out into a "keyname" and "forenames". It seems that most of the time the name string is splitted at whitespace " " and the last part is the keyname. This is not always correct for example for spanish names.

However, I found an example where "keyname" is not just the last name part, namely for the author name "Philipp J. di Dio" (2407.19933), the keyname is "di Dio" and the forenames are "Philipp J.", which seems to be correct.

Could you explain how the name splitting at arxiv works?

Best,

Isabel Beckenbach

P.S. I tried to post this question yesterday, but it did not work. I hope that it is not posted twice now.

Brian Maltzan

unread,
Aug 9, 2024, 8:29:13 AM8/9/24
to a...@arxiv.org
Hi Isabel,

You may be interested in this, which is very close to the oai code:

Cheers,
Brian


--
You received this message because you are subscribed to the Google Groups "arXiv API Disucssion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to api+uns...@arxiv.org.
To view this discussion on the web visit https://groups.google.com/a/arxiv.org/d/msgid/api/b33f0957-a6b4-4e73-9e8a-8271e654e21en%40arxiv.org.

Isabel Beckenbach

unread,
Aug 12, 2024, 3:55:37 AM8/12/24
to arXiv API Discussion, bmal...@arxiv.org
Thank you, Brian. That helps a lot.

In the comments it says: "This routine should just go away when a better metadata structure is adopted that deals with names and affiliations properly."

Are there any plans in the near future to change the metadata structure at arxiv?

Best

Isabel

Jake Weiskoff

unread,
Aug 12, 2024, 4:13:35 PM8/12/24
to a...@arxiv.org

Hi Isabel,

 

Yes there are plans to update the metadata schema for arXiv, as well as updating all of the read-only tools (such as the search API) in the coming months, and year(s).

 

In particular, the issue you’ve described is well known. We intend to correctly identify individual authors within our corpus (e.g. in addition to the known “multi-word surnames issue”, we have issues disambiguating the anglicization/romanization of some Asian names). The work will be ongoing over the course of the next year or so.

 

-- 

Jake Weiskoff

Project Manager, arXiv.org

Cornell Tech

ja...@arxiv.org

 

 

--
You received this message because you are subscribed to the Google Groups "arXiv API Discussion" group.


To unsubscribe from this group and stop receiving emails from it, send an email to api+uns...@arxiv.org.

Reply all
Reply to author
Forward
0 new messages