Hi OpenAlex Team and Community,
I have been conducting a comparative audit on University of Arizona (UA) institutional data across two distinct OpenAlex data pulls (November 2025 vs. May 2026) for Year 2024. While looking at the variance, I had a few questions about my interpretations of how OpenAlex handles ingestion, deduplication, and ID retention.
The Metrics:Total UA Works (Publication Year 2024): 9,492 (May 2026 pull) vs. 7,951 (Nov 2025 pull)
Total Citations (cited_by_count): 377,643 vs. 305,670
Net Ingestion Changes: +87,398 newly indexed citation links
Dropped IDs: −15,425 works decreased/removed from the original dataset pool (~5%)
Targeted Recovery: ~2,387 works (~15% of the dropped IDs) were successfully recovered by re-pulling those exact missing OpenAlex IDs using the same API.
1. Growth vs. Ingestion Timeline
My Interpretation: The substantial growth in 2024 publication counts reflects active, ongoing backfilling and ingestion pipelines targeting the most recent two-year window.
Question: Is a ~19% increase in a recent (less than 2-year window) publication year typical over a 6-month snapshot window, and at what point does a publication year's volume generally reach "steady state" stability in OpenAlex?
2. The 5% Drop in IDs
My Interpretation: The drop of over 15,000 works is expected database behavior due to aggressive deduplication, metadata corrections, or record merges.
Question: When an ID disappears from an institutional filter but remains valid in the database, does this usually mean the work was re-classified (e.g., its institutional affiliation string was stripped/corrected), or is there another common backend reason for this?
Question: Does the OpenAlex team consider a ~5% baseline variance in IDs normal between semi-annual snapshots, or should institutional data trackers expect higher or lower margins of volatility ?
3. The 15% Re-Pulled Recovery Rate
My Interpretation: Because I was able to successfully recover ~2,387 records (the 15% recovered) by specifically hitting the missing IDs directly, I assumed the initial omission was due to transient server/network lag, timeout constraints, or API pagination gaps during bulk queries.
Question: Are transient omissions common when running large API filter calls, or could these IDs have been temporarily "de-indexed" and then reinstated between my snapshot intervals?
Question: The 85% (unrecoverable): these IDs have been completely deprecated or deleted from the live database. This points to permanent entity merging (deduplication) where these IDs were swallowed by a "master" record, or the works were purged entirely due to source data cleanup. When an exact OpenAlex ID returns a 404 or fails to resolve in a direct query (the 85%), is there an endpoint or a specific field (like a merged_into or replaced_by property) we can check to see which new ID inherited its citations and metadata?
4. Baseline Snapshot Variance
My Interpretation: Overall, these shifts and fluctuations fall within a normal, expected range of variation between snapshots for a large research institution.
I would love to get your insights on whether my logic holds up, or if there are nuances to the OpenAlex ingestion and entity-merging architecture that I am misinterpreting.
Thank you for your time and for building such an incredible open resource!
Best regards,
Yan Han
The University of Arizona Libraries