Feedback after loading latest snapshot

147 views
Skip to first unread message

Jamie

unread,
Jun 2, 2023, 10:34:53 AM6/2/23
to OpenAlex users
Hi OpenAlex Team

Thanks for making such a useful dataset! I have some feedback after loading the latest snapshot in case it is useful.

Institutions
There are at least two institutions with an undocumented relationship field in the root of the document in addition to associated_institutions.relationship. I think this might be a mistake. Examples include https://openalex.org/I4210139938 and https://openalex.org/I4210090354 in institutions/updated_date=2023-05-03.

Works: inconsistent APC data
The APC data seems to be in an inconsistent state with some entities using apc_paid and apc_paid_usd data structures and others using the apc_payment data structure (the documented field).

For example, there are 16,307 entities using apc_paid and apc_paid_usd in works/updated_date=2023-04-29/part_026.gz  rather than apc_payment. You can just do a cat filename | jq -c 'select(.apc_paid_usd != null)' on the file to see them.

Perhaps apc_payment and apc_paid_usd are older data structures that haven't been migrated to apc_payment for all works yet?

Works: null values in arrays
The corresponding_institution_ids arrays sometimes contain null values, e.g. [null]. It would be great if these null values were removed, because databases such as BigQuery do not accept null values in arrays, which means that they have to be removed during pre-processing.

An example is https://openalex.org/W2136548344 in works/updated_date=2023-04-15/part_000.gz

merged_ids: empty CSV file
There is an empty CSV file in merged_ids/sources:
https://openalex.s3.amazonaws.com/browse.html#data/merged_ids/sources/.csv

merged_ids: no manifest?
Would it be possible to supply a Redshift manifest for the merged_ids? Not having a manifest for the merged_ids means that two different methods are required to list available files, one for each entity by reading each manifest and another that uses boto3 or the AWS API to iterate over the merged_ids objects on the bucket.

Kind regards

Jamie Diprose

Casey Meyer

unread,
Jun 2, 2023, 2:24:37 PM6/2/23
to Jamie, OpenAlex users
Hi Jamie,

Thanks for the great feedback! For the first two, we sometimes try out different fields before documenting it and making it live in the API. We don't have an easy way to remove those from the snapshot (as of now). So Institution.relationship is not valid, and the valid object for APC data in works is Work.apc_payment. 

Those outdated fields will get removed as we update records. Someday we should have a cleaner way to remove those that we're testing out.

We'll check on the [null] values that are making it into is_corresponding. That shouldn't be happening. We'll also try to create a manifest for merged_ids. I added that to our backlog! Thanks again.

Casey



--
You received this message because you are subscribed to the Google Groups "OpenAlex users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openalex-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/openalex-users/f5800ecb-8542-4d8c-9824-4598ac60d47an%40googlegroups.com.


--
Casey Meyer
Developer - OpenAlex, Unpaywall
OurResearchWe build tools to make scholarly research more open, connected, and reusable—for everyone.
Reply all
Reply to author
Forward
0 new messages