TL;DR: there are quite a few undocumented fields returned by the API, and fields that have a different structure compared to the docs. I made a quick notebook with python dataclasses to test these issues which you can run yourself from your browser here as an app (as shown in the screenshot), or here as a notebook w/ editable source code.
Now the actual message:
During developing a Python client for the OpenAlex API, I have encountered several discrepancies between the official documentation and the actual API responses. I am documenting these issues here in the hope that they can be addressed to improve the API's reliability and usability.
The python file defining these dataclasses can be found in my github repo: https://github.com/utsmok/aletheca/blob/main/src/aletheca/entities.py. In short, I created a lot of nested dataclasses to represent the various entities and objects in the OpenAlex API, and used the dacite library to parse API JSON responses into these dataclasses, using the strict mode. This throws an error if there are any extra undeclared fields, missing non-optional fields, or if the data does not match the expected type[s].
The data was retrieved from 8 endpoints (works, authors, institutions, sources, publishers, concepts, funders, and topics) using the ?sample=n parameter, and no other filters or parameters. Most sample sets were of size 50. A marimo notebook was made to run the queries and analyze the results, which can be found in the repo as well, here: https://github.com/utsmok/aletheca/blob/main/marimo_checks/check_entity_dataclasses.py, or run directly in the cloud using marimo's services: https://marimo.app/gh/utsmok/aletheca/main?entrypoint=marimo_checks%2Fcheck_entity_dataclasses.py
Now, on to the issues:
Issues shared between multiple entitiesid: I encountered dehydrated entities which had a null value for the id field, something that should not happen according to the documentation. Example: Found in authorships for https://openalex.org/W7104996979
score and relevance_score: These fields appear in API responses where they should not: e.g., when using ?sample=10. The values were very high, close to 1.0. Elastic backend error?
created_date and updated_date: One or both of these fields were found missing for core entities sometimes. Created date should of course always be present; and the updated data can of course always use the created data as a fallback -- so neither should ever be empty. Example: https://openalex.org/P4361727468
topics, topic_share: These fields were found in multiple entities (Source, Institution) but are not documented anywhere for those endpoints.
nullable and/or missing fields of type list or bool: Several fields documented as lists or booleans were found to be null or missing instead of being empty lists or false. It would be very helpful if the api was consistent about this: either always return the field with a default value ([] or false for example), or always omit it when there is no value. Now it's a crapshoot -- we not only have to check if the value is empty or null before e.g. iterating over a list, but also check if the field is even present at all!
Examples:
funders: The works.funders field, which appears to be a list of DehydratedFunder objects (? -- aren't documented either!), is not documented.
institutions: A top-level institutions field (list of DehydratedInstitution?) is present but not documented.
has_content: The has_content object ({pdf: bool, grobid_xml: bool}) is not documented.
institution_assertions, is_xpac (bool), awards are also present and undocumented, but I haven't properly investigated them yet.
has_fulltext: was found to be null instead of false. Example: https://openalex.org/W3028709719
indexed_in: The documentation lists only a few valid values, datacite is missing but is returned by the API.
Work-Location Object (e.g. best_oa_location, primary_location, locations)I know these will be deprecated 'soon' -- but as they are still in the API after the refresh and the 'soon' message has been there for a long time, I figured it's still worth reporting these issues. I did notice a lot of problems with x_concepts and other concept-related fields in the responses, so maybe it's a good time to drop them now?
Interesting post Samuel ! While I see that some of the issues you found are solved at the moment, I also found out that some fields disapeared from the Works endpoint after the launch of Walden.This is what I saw in the WORKS endpoint, and it would be nice to know when it is a good moment to change/adapt our software to be future proof. Is this hope the JSON responses will look like from now on ?Cheers,Ivoprimary_location: ID is new field
primary_location: is_indexed_in_scopus is gone
primary_location: raw_type is new
primary_location: raw_source_name is new
type_crossref is gone
institution_assertions has become institution
has_fulltext is gone
fulltext_origin is gone
datasets is gone
versions is gone
awards is new field
funders is new field
has_content is new field with pdf and grobid_xml
abstract_inverted_index_v3 ?
cited_by_api_url is goneOp zaterdag 15 november 2025 om 11:26:48 UTC+1 schreef sam...@gmail.com:
--
You received this message because you are subscribed to the Google Groups "OpenAlex Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openalex-commun...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/openalex-community/fc3df49e-8c0a-4e94-9970-00ed4252920bn%40googlegroups.com.
| Field Name | Description / Notes | Link | Status |
|---|---|---|---|
institution_assertions | Work Object | ⚠️ Missing | |
institutions | List of dehydrated institutions. | Work Object | ⚠️ Missing |
datasets | Work Object | ⚠️ Missing | |
versions | List of version URLs. | Work Object | ⚠️ Missing |
has_fulltext | Boolean | Work Object | ⚠️ Missing |
cited_by_percentile_year | Work Object | ✅ Documented | |
is_xpac | Boolean flag. | Work Object | ✅ Documented |
license_id | Distinct from license string. | Location Object | ✅ Documented |
raw_source_name | Raw string used to match source. | Location Object | ✅ Documented |
id | Location object ID. | Location Object | ✅ Documented |
raw_type | Raw type string. | Location Object | ✅ Documented |
topics | Author Object | ⚠️ Missing | |
topic_share | Author Object | ⚠️ Missing | |
topics | Source Object | ⚠️ Missing | |
topic_share | Source Object | ⚠️ Missing | |
is_indexed_in_scopus | Boolean flag. | Source Object | ⚠️ Missing |
oa_flip_year | Integer year. | Source Object | ⚠️ Missing |
is_high_oa_rate | Boolean flag. | Source Object | ⚠️ Missing |
is_ojs | Boolean flag. | Source Object | ⚠️ Missing |
is_in_scielo | Boolean flag. | Source Object | ⚠️ Missing |
is_high_oa_rate_since_year | Integer year. | Source Object | ⚠️ Missing |
is_in_doaj_since_year | Integer year. | Source Object | ⚠️ Missing |
oa_works_count | Integer count. | Source Object | ⚠️ Missing |
last_publication_year | Integer year. | Source Object | ⚠️ Missing |
first_publication_year | Integer year. | Source Object | ⚠️ Missing |
host_organization_lineage_names | List of names. | Source Object | ⚠️ Missing |
raw_type | Raw string type. | Source Object | ⚠️ Missing |
topics | Institution Object | ⚠️ Missing | |
topic_share | Institution Object | ⚠️ Missing | |
type_id | String ID. | Institution Object | ⚠️ Missing |
homepage_url | URL string. | Publisher Object | ⚠️ Missing |
oa_works_count | Found in counts_by_year for non-Work entities. | e.g. Author Object#counts_by_year | ⚠️ Missing |
description | Localized description dict. (additional top level key after display_name) | e.g. Concept Object#international | ⚠️ Missing |
| Field Name | Issue Details | Link | Status |
|---|---|---|---|
indexed_in | API returns "datacite"; docs do not list this value. | Work Object#indexed_in | ⚠️ Missing value |
type | API returns "funder"; docs do not list this value. | Institution Object#type | ⚠️ Missing value |
associated_institutions | relationship returns "successor"; docs do not list this value. | Institution Object#associated_institutions | ⚠️ Missing value |
parent_publisher | API returns an object (dict); docs say it is a String (ID). | Publisher Object#parent_publisher | ⚠️ Mismatched type |
To view this discussion visit https://groups.google.com/d/msgid/openalex-community/ff757f3f-19f4-4952-9f89-779e4421e212n%40googlegroups.com.