Addressing Misrepresentation of Publication Dates

262 views
Skip to first unread message

Dror Shvadron

unread,
Apr 16, 2024, 9:12:33 PM4/16/24
to OpenAlex Community

Hi OpenAlex Community,

I'd like to discuss an issue regarding the representation of publication dates in OpenAlex, specifically for papers published before 2002. OpenAlex currently assigns the "earliest available date of electronic publication" as the publication date. This date, however, may not accurately reflect the actual publication date for older papers that were digitized years later.

For example, publication W2113695611, originally from 1991 but uploaded online in 2002, appears in OpenAlex with a 2002 publication date. In contrast, MAG distinguished between the paper's journal publication date and its online date. The attached figure illustrates instances where the discrepancy between these dates is three years or more, highlighting significant spikes in 2002 and 2010. According to my calculations, this issue affects approximately 1.7 million publications.

In my work, I have combined MAG and OpenAlex data, using the earliest of the two dates to represent the actual publication date. I'm curious to hear if others have encountered similar issues or have devised alternative solutions. Additionally, any feedback from the OpenAlex team on potential adjustments to this dating approach would be greatly appreciated.

Best regards,

Dror

oa_mag_pubdates.png

Samuel Mok

unread,
Apr 17, 2024, 5:53:59 AM4/17/24
to Dror Shvadron, OpenAlex Community
I recognize this issue! I handle it by also retrieving the item's data from crossref's api, and use some heuristics to determine the 'actual' publication date. In most cases this works out fine; as in general I only encounter this issue with journal articles, and most of them have a DOI registered with crossref. 
However, in your specific example this doesn't work either, as the crossref data also only includes the 2002 date -- obviously the item didn't have a DOI before being digitalized, and the DOI registration doesn't include any other dates. In fact, this entry is very low on information. I also noticed that according to this api response the DOI has only been created a couple of months ago, which seems strange. I'm not entirely sure what's going on here! I'm curious what the contents of the MAG entry for this item (2113695611) are; I suspect the data in MAG was scraped/indexed from IEEE explore, which would give the proper dates. That would also explain the discrepancies: OpenAlex probably uses the current crossref API response over older MAG data when they do not match up; but this is just a guess from my part.

Cheers,
Samuel

--
You received this message because you are subscribed to the Google Groups "OpenAlex Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openalex-commun...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/openalex-community/b34bc425-7a4c-44a5-b189-7d7513d952afn%40googlegroups.com.

Eck, N.J.P. van (Nees Jan)

unread,
Apr 17, 2024, 8:40:13 AM4/17/24
to Samuel Mok, Dror Shvadron, OpenAlex Community

In the case of Crossref, I have found out that it is also helpful to look at the publication date of the source in which a publication is published. The publication date of the source is unfortunately not available in the JSON data of a work. It is only available in the XML data of a work. For the given example, the publication date of 1991 can indeed be found under the proceedings metadata in the XML. It would be great if OpenAlex could make this date available as well.

 

Best,

Nees

Samuel Mok

unread,
Apr 17, 2024, 9:06:52 AM4/17/24
to Eck, N.J.P. van (Nees Jan), Dror Shvadron, OpenAlex Community
I did not know crossref's xml responses included more entity data compared to the json response. Besides containing the conference event data, the xml also includes the isbn and pages, which are not included in this json.
Thank you Nees, this is very helpful for me as well! 

Cheers,
Samuel

Robert Chen

unread,
Apr 18, 2024, 2:11:30 PM4/18/24
to OpenAlex Community
Can we get some clarity from the devs on whether the "publication_date" field in the OpenAlex API is the actual publication date or the earliest available date of electronic publication (e.g., ePub date)?

I know the documentation says earlest available date of electronic publication, but I think one of the webinars said it was the actual date which I had confirmed by emailing OpenAlex support.

Jason Portenoy

unread,
Apr 18, 2024, 4:45:43 PM4/18/24
to OpenAlex Community
Hi folks,

The publication date usually comes from Crossref. If you want to know exactly how it is derived from the Crossref API response, you can look at the code here: https://github.com/ourresearch/oadoi/blob/41f3f7004ec8764b0d5a4c11977cb0e4eb023ae0/pub.py#L276 and here: https://github.com/ourresearch/oadoi/blob/cd2a6c2f338dfd95c33a3786f74d89de892b638c/recordthresher/record_maker/crossref_record_maker.py#L68, and you might learn more by doing a search for "publication_date" in that repository ("oadoi").

We are currently offering one "publication_date," which usually corresponds (roughly) to the earliest date we can find in the work's metadata (which is usually the earliest date of electronic publication). We are thinking about how we might change this in the future, such as offering multiple dates for electronic and print publication. It is a problem, however, that this can get pretty complicated and hard to implement, as you might see if you followed the instructions in the previous paragraph and tried to work your way through some of the code involved in getting the publication dates from the various incoming data sources.

We welcome the discussion happening here, and we want to see it continue so we can keep making our way toward having better data and data models. We can't necessarily offer satisfying answers to all of the questions posed in these discussions. We know that there are some discrepancies in the dates, but given the difficulty in addressing the numerous cases and edge cases, and also the fact that these discrepancies affect 1% or less of works and are generally less of an issue for newer works, I'd say it will probably be some time before we're able to implement a comprehensive solution.

Please, continue to discuss and probe! Nees's observation that Crossref's XML metadata can have information that the JSON lacks was certainly eye opening, and something we'll look into more.

Cheers,
Jason Portenoy

Robert Chen

unread,
Apr 18, 2024, 4:59:52 PM4/18/24
to OpenAlex Community
Thanks, Jason!

I have generally wanted the earliest publication date just because it is easier to define than the multitudes of possible dates that a publisher sometimes can come up with. However, I have never thought of Dror's issue up until now.

I second having separate fields for the actual publication date and earliest electronic publication date, even though I realize it is a bit of a challenge.

Rob

Message has been deleted

Robert Chen

unread,
Apr 19, 2024, 9:35:16 AM4/19/24
to OpenAlex Community
I just did a quick comparison of 36 works using EuropePMC's Earliest Publication Date field in their API and OpenAlex's date.

22 (~61%) of the entry dates matched and 14 (~39%) did not. I have attached the data in this post for peoples' reference.

I realize from Jason's post that the OpenAlex dates come from CrossRef so much of this is out of OpenAlex's control, but I thought it would be good to share.
Dates - EuroPMC vs OpenAlex.xlsx

Dror Shvadron

unread,
Apr 19, 2024, 10:37:30 AM4/19/24
to OpenAlex Community
Hi Jason and all,

Thank you for the insightful response and the links to the source code. I appreciate all the work that has been put into constructing OA and realize there are complexities involved due to the nature of Crossref’s data. Also, thank you Nees for pointing out the additional fields in the crossref XML source.

I would like to point out that there are about 9 million publications whose MAG date predates the OA date (4.5% of the 193m matched articles), of them 3.4m articles where the MAG year predates OA (1.7%). The 1.7m articles I wrote about earlier are cases where the difference is 3 or more years. I do think this is an issue that could affect some users, especially those working on earlier years or some other subsets of the data. I have personally observed an instance where researchers were misled by the anomalies in publication spikes during 2002 and 2010.

Looking forward to further discussion and potential solutions.

Best regards,
Dror
Reply all
Reply to author
Forward
0 new messages