Affiliation loss in 100+ authors works

63 views
Skip to first unread message

S Fattori

unread,
Feb 16, 2026, 3:44:42 AM (12 days ago) Feb 16
to OpenAlex Community

Good morning everyone,

Context: I'm doing a cross-institutional benchmarking analysis to compare a specific research department at multiple universities. My workflow is: 

  • Step 1: Define core research topics for the department at my institution using internal publication data
  • Step 2: Use the topics I find to query publications by affiliation in OpenAlex for both my institution and other institutions

This approach is necessary because I only have detailed internal data for my own institution, but need to benchmark against other universities where I must rely on OpenAlex affiliation-based queries.

The problem is that, using the API, OpenAlex retains detailed authorship information only for the first 100 authors of a publication. For publications with >100 authors, affiliations beyond this cutoff are not indexed. This creates a significant data loss issue for certain research fields.

More critically: I cannot determine whether authors from Institution A tend to appear in positions 1-100 more frequently than authors from Institution B in the same large-collaboration publications. This creates an unmeasurable bias when comparing institutions.

Is there a way to query OpenAlex (not using the snapshot) that captures all institutional affiliations for 100+ author publications, even if individual author details are truncated? Are there alternative approaches to retrieve institutional affiliation information for large-collaboration publications that I'm missing?

I'm aware I could use DOI-matching for my own institution, but this doesn't solve the core problem: I need a consistent methodology that works equally for all institutions involved.

Any thoughts on this would be greatly appreciated!

Silvia Fattori

Samuel Mok

unread,
Feb 16, 2026, 4:20:18 AM (12 days ago) Feb 16
to S Fattori, OpenAlex Community
You could look up the affiliations of the authors separately , e.g.:
1. Gather all Works that you're interested in
2. Select the Works with author count > 100
3. Grab all unique Author-ids for those works
4. Use the Author-API endpoint to gather the affiliations for those authors in batches of 50 at a time

Optional: also include the publication year of the works for better affiliation matching if required.

Alternatively, depending on what exactly you're benchmarking, it could be better to drop these works from your dataset completely or handle them separately, especially considering the relatively large amount of errors in affiliation data /. author attribution in general.

Cheers,
Samuel

--
You received this message because you are subscribed to the Google Groups "OpenAlex Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openalex-commun...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/openalex-community/0f5f553b-8fdf-4b6e-a123-6f880ace9392n%40googlegroups.com.

Ivo Bleylevens

unread,
Feb 17, 2026, 6:02:25 AM (10 days ago) Feb 17
to OpenAlex Community
Hi Silvia,

I ran across the same feature of OpenAlex: when searching in the API with a certain filtering, only the first 100 authors are given in the search result.
For each item with >=100 authors I fetch the json response of it separately. Then you do get more than 100 authors.
That solved it in my case.

For example:
Filtering gives you only 100 authors in the search result: https://api.openalex.org/works?filter=doi:10.1038/nbt.2957
But once you start processing the search result and you get >=100 authors this will give you all the 183 authors though: https://api.openalex.org/works/W2064397275

Good luck.

Ivo
Maastricht University

Op maandag 16 februari 2026 om 10:20:18 UTC+1 schreef sam...@gmail.com:

S Fattori

unread,
Feb 17, 2026, 8:35:13 AM (10 days ago) Feb 17
to OpenAlex Community
Hi, 

Thank you so much for taking the time to reply!

What I perhaps didn't explain well enough is that my concern goes a bit beyond my own institution's data. Let me try to clarify: the deeper issue I'm running into is about cross-institutional comparisons. 
When querying OpenAlex by institution affiliation, the API only indexes authorship information for the first 100 authors of each publication. For large-collaboration papers (which are very common in the field I'm focusing on right now), this means that any institution whose authors happen to appear beyond position 100 will simply not be returned by the query.

The tricky part is that this affects different institutions differently. Some institutions may systematically appear earlier or later in author lists. This creates an asymmetric bias: Institution A might look more productive than Institution B in OpenAlex queries, not because they actually published more, but simply because their researchers tend to appear earlier in the author list.

For my own institution I can detect and correct for this bias using internal data that allow me to know exactly what we published. But for the other universities I'm benchmarking against, I have no way of knowing which papers are missing or how large the bias is. This makes a fair comparison really difficult.

Best,
Silvia

Ivo Bleylevens

unread,
Feb 17, 2026, 12:10:46 PM (10 days ago) Feb 17
to OpenAlex Community
Hi Silvia,

I thought that in the past OpenAlex was searching in all affiliations of all authors of a publications during a search query; and only showing you the first 100 authors. So the search was complete amongst all authors, but the search result was limited. But it turns out not to be like that (anymore?)?

For example: publication W2064397275 with title "A comprehensive assessment of RNA-seq accuracy, reproducibility and information content by the Sequencing Quality Control Consortium" has as 114th author James C. Willy affiliated to https://openalex.org/I4210096362. Only this author is affiliated to this institute; it occurs only once. And indeed when searching for this title and this affiliation, the result is: zero. This title is not found....because OpenAlex only searches in the first 100 authors nowadays?
because this author is among the first 100 ?

This is indeed a bit odd behaviour ... and could influence my own analyses also when this is indeed the case.

Kind regards,
Ivo





Op dinsdag 17 februari 2026 om 14:35:13 UTC+1 schreef s.fa...@vu.nl:

S Fattori

unread,
Feb 18, 2026, 3:08:13 AM (10 days ago) Feb 18
to OpenAlex Community
Hi Ivo,

That's exactly the issue.
To give you an example from my side: the work https://openalex.org/works/W4205805429 from 2021, is co-authored by an author affiliated to Vrije Universiteit Amsterdam, and belongs to the topic "Particle physics theoretical and experimental studies". This work has 337 authors, and the author from the VU is not in the first 100 positions.

So, theoretically, I expect this work to be listed in the results of this search in OpenAlex: https://openalex.org/works?filter=authorships.institutions.lineage:i865915315,primary_topic.id:t10048,publication_year:2021, i.e., when I put a filter on the year, plus the institution, plus the topic. However, the work is not present in the result list. That's because the author from my institution is not in the first 100 positions.

The issue probably lies in the way OpenAlex builds the filters.
I agree that this is definitely a problem when dealing with big author consortia. If authors are listed alphabetically, then someone who often publishes in big consortia and has a last name starting with Z might end up with most of their papers not showing up.

Best,
Silvia

Ivo Bleylevens

unread,
Feb 18, 2026, 3:41:08 AM (10 days ago) Feb 18
to S Fattori, OpenAlex Community
Yes I see.

I did this for another project of ours and could be a solution for you too: download all publications of the relevant topics and of the relevant period, make sure you download all authors when author count is >= 100 and then do the filtering and sorting on your side in a database instead of relying on the api. Can be a long running download though but for me it was worth the wait. That’s a compromise between using the api with its limitations and hosting and paying for a data snapshot yourself…

Let me know if this can be a solution for you.

Bianca Kramer

unread,
Feb 18, 2026, 4:05:43 AM (10 days ago) Feb 18
to Ivo Bleylevens, S Fattori, OpenAlex Community
Hi all,

To bulld on Ivo's last remark - another way of making use of the full data snapshot without hosting and paying for the data snapshot yourself could be to make use of a publicly available versions of OpenAlex in Google Big Query. In that case, you only have querying costs (not costs for ingest and storage), and there is a free tier which includes 1TB of querying - or use the Google Big Query sandbox with similar allowance). It does use SQL so consider if that would work for you. 

Najko Jahn at SUB Göttingen maintains a copy of the full OpenAlex snapshot (minus abstracts) which is updated monthly, see here: https://orion-dbs.community/collections/subugoe/#openalex

With a small group of people, we actually started an initiative to bring together multiple open datasets available via Google Big Query - the benefit being that hosting can be distributed and the full datasets are available for everyone for querying (including combining multiple datasets). More information about the  thinking behind this apporach (including the Google shaped elephant in the room!) can be read in this blogpost. The initiative's website, with an overview of the datasets that are currently availble is here: https://orion-dbs.community/

kind regards,
Bianca Kramer 

(with apologies for cross-posting -  my message about the ORION initiative yesterday was partly prompted by this discussion as well!) 





Op wo 18 feb 2026 om 09:41 schreef Ivo Bleylevens <ivo.ble...@gmail.com>:
Yes I see.

I did this for another project of ours and could be a solution for you too: download all publications of the relevant topics and of the relevant period, make sure you download all authors when author count is >= 100 and then do the filtering and sorting on your side in a database instead of relying on the api. Can be a long running download though but for me it was worth the wait. That’s a compromise between using the api with its limitations and hosting and paying for a data snapshot yourself…

Let me know if this can be a solution for you.

--
You received this message because you are subscribed to the Google Groups "OpenAlex Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openalex-commun...@googlegroups.com.
Message has been deleted

Ivan Sterligov

unread,
Feb 19, 2026, 4:25:46 AM (9 days ago) Feb 19
to Bianca Kramer, Ivo Bleylevens, S Fattori, OpenAlex Community
Hi all, 

I guess this is an intentional limitation acknowledged by OpenAlex: https://docs.openalex.org/api-entities/authors/limitations 'We plan to change this in the future, so that filtering works as expected.'

OpenAlex is currently not the best option for consequential organization-level bibliometrics due to a variety of reasons like lack of affiliation data curation (even the WorksMagnet curation service is deprecated :( ), missing raw affils, messy ML\rules-based affiliation model, inconsistencies in paper counts at various endpoints, messy author profiles that are seemingly also not curated etc. This is hopefully going to change in 2026, as many of these issues were mentioned during recent open calls. 

Also, regarding areas with large facilities or hospitals (and high average authors per paper) it is important to note that OpenAlex relies on ROR in treating org relationships. I.e. a large physics facility jointly run by two universities could be listed not as their 'child', but instead just 'related' to them, meaning they won't get papers assigned to this facility when queried via authorships.institutions.lineage.  

Best regards,
Ivan

PS This new ORION initiative looks great, thanks a lot for sharing! 








--
Всего доброго,

Иван Стерлигов

Eric Jeangirard

unread,
Feb 19, 2026, 5:14:47 AM (9 days ago) Feb 19
to Ivan Sterligov, Bianca Kramer, Ivo Bleylevens, S Fattori, OpenAlex Community
Hi Ivan,

Regarding the works-magnet, we have chosen to disable the submission of feedback through the works-magnet, as OpenAlex has announced the upcoming release of an integrated curation tool (see the 2026 roadmap: https://blog.openalex.org/openalex-2026-roadmap). The works-magnet interface remains usable, but correction submissions are no longer possible.
This decision addresses a very practical operational constraint: feedback submitted today is handled manually, with delays that can exceed two months between submission and actual processing. It became necessary to freeze the flow of requests to ensure that all previously submitted corrections are fully processed before transitioning to the new system.
OpenAlex states that the future curation tool will allow for rapid processing of corrections (approximately 24 to 48 hours). As part of the development of OpenAlex’s economic model, this tool will be reserved for institutions that financially support OpenAlex.This approach aligns with the sustainability model of open infrastructures like OpenAlex: open data that remains freely reusable, combined with paid services (in this case, supported and delegated curation) that contribute to the project’s long-term viability.

Best
Eric

Ivan Sterligov

unread,
Feb 19, 2026, 6:01:13 AM (8 days ago) Feb 19
to Eric Jeangirard, Bianca Kramer, Ivo Bleylevens, S Fattori, OpenAlex Community
Hi Eric, and thanks for the clarification! Works-magnet is a great tool.

Regarding the decision to limit the availability of curation to those who pay:

I get that it is 'a part of the development of OpenAlex’s economic model' but I don't fully agree that it 'aligns with the sustainability model of open infrastructures'.

This basically means that the data itself (not the services) for a particular institution depends on whether it is able and willing to pay $5000\year member fee. 

So, a 'member' university essentially gets more paper counts - and possibly higher rank in Leiden Open Ranking and other OpenAlex-based projects.  

This is hardly a desired outcome from many perspectives, afaik even Elsevier allows non-subscribers to curate Scopus org profiles for free. For example, what about low-income countries and numerous struggling universities from the regions that were underrepresented in WoS\Scopus? 

Best regards,
Ivan

Reply all
Reply to author
Forward
0 new messages