Questions on snapshot variance and ID drops: Institutional data audit for Year 2024 data

39 views
Skip to first unread message

Y Han

unread,
Jun 13, 2026, 10:07:24 PM (8 days ago) Jun 13
to OpenAlex Community

Hi OpenAlex Team and Community,

I have been conducting a comparative audit on University of Arizona (UA) institutional data across two distinct OpenAlex data pulls (November 2025 vs. May 2026) for Year 2024. While looking at the variance, I had a few questions about my interpretations of how OpenAlex handles ingestion, deduplication, and ID retention.

The Metrics:
  • Total UA Works (Publication Year 2024): 9,492 (May 2026 pull) vs. 7,951 (Nov 2025 pull)

  • Total Citations (cited_by_count): 377,643 vs. 305,670

  • Net Ingestion Changes: +87,398 newly indexed citation links

  • Dropped IDs: −15,425 works decreased/removed from the original dataset pool (~5%)

  • Targeted Recovery: ~2,387 works (~15% of the dropped IDs) were successfully recovered by re-pulling those exact missing OpenAlex IDs using the same API. 

Questions on My Interpretation:

1. Growth vs. Ingestion Timeline

  • My Interpretation: The substantial growth in 2024 publication counts reflects active, ongoing backfilling and ingestion pipelines targeting the most recent two-year window.

  • Question: Is a ~19% increase in a recent (less than 2-year window) publication year typical over a 6-month snapshot window, and at what point does a publication year's volume generally reach "steady state" stability in OpenAlex?

2. The 5% Drop in IDs

  • My Interpretation: The drop of over 15,000 works is expected database behavior due to aggressive deduplication, metadata corrections, or record merges.

  • Question: When an ID disappears from an institutional filter but remains valid in the database, does this usually mean the work was re-classified (e.g., its institutional affiliation string was stripped/corrected), or is there another common backend reason for this?

  • Question: Does the OpenAlex team consider a ~5% baseline variance in IDs normal between semi-annual snapshots, or should institutional data trackers expect higher or lower margins of volatility ?

3. The 15% Re-Pulled Recovery Rate

  • My Interpretation: Because I was able to successfully recover ~2,387 records (the 15% recovered) by specifically hitting the missing IDs directly, I assumed the initial omission was due to transient server/network lag, timeout constraints, or API pagination gaps during bulk queries.

  • Question: Are transient omissions common when running large API filter calls, or could these IDs have been temporarily "de-indexed" and then reinstated between my snapshot intervals?

  • Question: The 85% (unrecoverable): these IDs have been completely deprecated or deleted from the live database. This points to permanent entity merging (deduplication) where these IDs were swallowed by a "master" record, or the works were purged entirely due to source data cleanup. When an exact OpenAlex ID returns a 404 or fails to resolve in a direct query (the 85%), is there an endpoint or a specific field (like a merged_into or replaced_by property) we can check to see which new ID inherited its citations and metadata?

    4. Baseline Snapshot Variance

  • My Interpretation: Overall, these shifts and fluctuations fall within a normal, expected range of variation between snapshots for a large research institution.

I would love to get your insights on whether my logic holds up, or if there are nuances to the OpenAlex ingestion and entity-merging architecture that I am misinterpreting.

Thank you for your time and for building such an incredible open resource!

Best regards,
Yan Han
The University of Arizona Libraries

Christos Petrou

unread,
Jun 14, 2026, 9:24:47 PM (7 days ago) Jun 14
to Y Han, OpenAlex Community
Might be able to partly answer your first question. When it comes to peer-reviewed journals of the leading publishers (think of a WoS or Scopus journal universe), records are nearly fully up-to-date with approximately a two-month lag. That means that the total monthly works should already be nearly stable up to March or April 2026. Exceptions are a couple of publishers like IEEE and APS that seem to have a slightly longer lag.

Your +19% works for 2024 should not be driven by late-indexing of works in peer-reviewed journals. It could be the result of late-indexing of other types of works (e.g., books) AND/OR better retrospective institutional attribution of works in  peer-reviewed journals. I think institutional attribution was a bit patchy up until a few months ago, but it seems to be comprehensive now. 

 
Reply all
Reply to author
Forward
0 new messages