Large number of datasets in 2022

164 views
Skip to first unread message

Jason Portenoy

unread,
Jun 30, 2023, 9:24:55 AM6/30/23
to OpenAlex users

Recently, a member of this group—ibragi...@gmail.com (thank you!)---pointed out a few data issues that they had found. One of them involved the University of Michigan in Ann Arbor, which has a large spike in works in the year 2022. Here is a screenshot of the OpenAlex data for that institution:

There is certainly something odd going on there. The number of works is more than an order of magnitude higher in 2022 than any other year. If we look at the article types of this university’s works, we see that there are a very large number of dataset works:

And if we look at the top authors, we see two researchers—probably the same person—with an extraordinary number of works:

Indeed, if we turn off datasets, the chart of works by year for this university looks much more like what we might expect:

So what’s going on here, and how do we fix it? Well, unfortunately, there’s not an easy answer, as it is tied up in the idea of what we consider a “work,” and how that definition may be changing over time. Each of those nearly 400,000 datasets are indeed individual dataset works registered with DOIs in Crossref, created by a researcher affiliated with the University of Michigan Ann Arbor, as part of the ENCODE project.


OpenAlex strives to be a comprehensive and inclusive source of information, so it is not in our nature to throw away these data points. However, this is not in line with what people expect to see when they ask the question: “How many publications did this university publish in this year?”---especially in this extreme case. We have certainly been discussing this internally, and we look forward to the broader discussion in the community. For now, we will stay the course, and keep the data as is. Users of the data can always use filters to exclude works they are not interested in including. We may at some point start introducing some sensible default filters that may be helpful as a starting point. And certainly, this will be an interesting data point as we (all of us) continually reevaluate the landscape of bibliometric data.


Cheers,

Jason Portenoy

Reply all
Reply to author
Forward
0 new messages