Hi All,
Indeed, the dataset coverage in OpenAlex is still at times a bit muddy, but as Rainer mentions, I think this is often not the fault of OpenAlex but of upstream sources.
For example, searching for datasets from authors affiliated with my institution, I can find this dataset 10.48527/kiper0 (
OpenAlex,
repository page), which is composed of 6 data files and 1 README file. Because of the configuration of this particular Dataverse, each file gets its own unique PID (created by appending a few characters to the end of the dataset-level PID). What that means, however, is that these file-level PIDs are also being ingested into OpenAlex and treated as a fully-fledged dataset (for example: 10.48527/kiper0/vsmrgo [
OpenAlex,
repository page]). The result is that what should constitute a single dataset entry in DataCite (and OpenAlex, and other downstream targets) ends up appearing as 8 different discrete entries. That means your counts of datasets may be being artificially inflated by more than just versioning.
I will say that the
DataCite entry for the above dataset-level PID contains
HasPart relatedIdentifiers for its 7 children datafile records (and for completeness, the children have
isPartOf relatedIdentifiers to their parent [
example]). Similarly, for the record that Rainer linked, the
DataCite JSON uses the
IsVersionOf relatedIdentifier to link the versions of the datasets to one another. So, it would seem like at least a relevant first step would be for OpenAlex to preserve some of these relatedIdentifier fields to allow easier downstream identification of versions and parent-child relationships.
Best,
Kevin