API issues and data quality

100 views

Skip to first unread message

Mi Tar

unread,

Apr 1, 2013, 3:16:51 AM4/1/13

to arxi...@googlegroups.com

Hi!

I have some issues with arXiv format of API data (in comparison to arXivRaw).

It seems there is no way to see in arXiv format that article has been removed/withdrawn. If it is completely deleted, then there is an empty entry, but if it is just withdrawn like:

http://arxiv.org/abs/0704.0213
http://export.arxiv.org/oai2?verb=GetRecord&identifier=oai:arXiv.org:0704.0213&metadataPrefix=arXiv

This is not visible. It seems that the only way to do it is by checking raw format and checking the size of the latest version?

http://export.arxiv.org/oai2?verb=GetRecord&identifier=oai:arXiv.org:0704.0213&metadataPrefix=arXivRaw

What does <source_type>I</source_type> mean?

BTW, http://arxiv.org/pdf/0704.0213.pdf could really return 404. If it would, doing HEAD request would be a way to check for existence of PDF. Is there any better API way to know if PDF is existent and then displaying a link to it? Or in general a way to know which PS/PDF/source links are available for the given article?

If I understand correctly, the main advantage of arXiv format over arXivRaw format is parsing of authors? Other things are the same (except removing versions and displaying only the latest)?

Because also parsing of authors is not correct. The issue is with authors who put their middle names or nick names in:

http://arxiv.org/abs/0705.2274

arXiv parser sees this as affiliation (per format definition) but of course it is not. Parsing gets completely lost:

http://export.arxiv.org/oai2?verb=GetRecord&identifier=oai:arXiv.org:0705.2274&metadataPrefix=arXiv

Is there no input checking when authors submit this data? It would be really easy to spot. Probably, the best thing would be that affiliation would have to be specified in [], but now it is what it is. So then some input checking should be done. If () is in the middle of the string, this is an error, affiliation should be at the end, or before the comma. If you want, I can provide you with a list of broken author strings our parser found and you could inform authors or fix them? (For example, instead of () using "" around the name.)

But as a consequence, we will probably have to just manually reparse the authors data. So then the question is of usefulness of arXiv format. arXivRaw has both versions and at least no wrong parsing of authors. Is anybody really using () for listing affiliation? I have seen much more cases of middle names than affiliations?

Lastly, there seems that no input checking was done for MSC classification as well:

http://export.arxiv.org/oai2?verb=GetRecord&identifier=oai:arXiv.org:1002.0007&metadataPrefix=arXiv

Despite defined format, it is possible to find quite a range of various wrong formats. I think data would be really much more valuable if some input checking would be done. At least some basic things would catch most of the problematic entries.

Would you be interested in us providing you with list of problematic entries? Are there any plans or even motivations to make data cleaner? Otherwise we will just use raw data and try to get the best we can out of it, but I believe the best would be to fix things at data source, so that also others can have clean data.

Mitar

Reply all

Reply to author

Forward

0 new messages