Thanks for making such a useful dataset! I have some feedback after loading the latest snapshot in case it is useful.
InstitutionsThere are at least two institutions with an undocumented relationship field in the root of the document in addition to associated_institutions.relationship. I think this might be a mistake. Examples include
https://openalex.org/I4210139938 and
https://openalex.org/I4210090354 in institutions/updated_date=2023-05-03.
Works: inconsistent APC dataThe APC data seems to be in an inconsistent state with some entities using apc_paid and apc_paid_usd data structures and others using the apc_payment data structure (the documented field).
For example, there are 16,307 entities using apc_paid and apc_paid_usd in works/updated_date=2023-04-29/part_026.gz rather than apc_payment. You can just do a cat filename | jq -c 'select(.apc_paid_usd != null)' on the file to see them.
Perhaps apc_payment and apc_paid_usd are older data structures that haven't been migrated to apc_payment for all works yet?
Works: null values in arraysThe corresponding_institution_ids arrays sometimes contain null values, e.g. [null]. It would be great if these null values were removed, because databases such as BigQuery do not accept null values in arrays, which means that they have to be removed during pre-processing.
An example is
https://openalex.org/W2136548344 in works/updated_date=2023-04-15/part_000.gz
merged_ids: empty CSV fileThere is an empty CSV file in merged_ids/sources:
https://openalex.s3.amazonaws.com/browse.html#data/merged_ids/sources/.csvmerged_ids: no manifest?Would it be possible to supply a Redshift manifest for the merged_ids? Not having a manifest for the merged_ids means that two different methods are required to list available files, one for each entity by reading each manifest and another that uses boto3 or the AWS API to iterate over the merged_ids objects on the bucket.