Total works and cites don't match sums in counts by year

126 views
Skip to first unread message

James Dumlao

unread,
Mar 18, 2025, 12:37:21 PMMar 18
to openalex-...@googlegroups.com, Misha Teplitskiy, ayo...@essec.edu
Hello!

I gathered productivity metrics for thousands of authors back in January 2025. When trying to compare the total works and cites to the sums of the counts by year, I noticed that the counts by year are much larger for many authors (hundreds of works and thousands of cites more than the totals). Does anyone have insight into why this might be and which numbers are more reliable?

Sincerely,
PhD Candidate
School of Information
University of Michigan

Gabor Schubert

unread,
Mar 18, 2025, 2:41:37 PMMar 18
to OpenAlex Community
Hi James,

Can you describe how you got your numbers? It is hard to understand your problem without any context.

Best regards,
Gabor Schubert

Jack Young

unread,
Mar 18, 2025, 4:50:24 PMMar 18
to James Dumlao, openalex-...@googlegroups.com, Misha Teplitskiy, ayo...@essec.edu
Hi James,

I ran into a very similar issue with the counts_by_year data last year.  At first, I thought it might have something to do with the lag described here: https://help.openalex.org/hc/en-us/articles/27891614701207-Why-are-the-counts-by-year-numbers-different-than-what-I-see-in-the-user-interface.  But, the counts_by_year citations were actually coming back consistently higher (often double) than the citation counts available through the user interface.  

OpenAlex had suggested that there may be some issue with the code aggregating the counts_by_year, but couldn't give an exact timeline for resolution.

From my experience, I believe the counts available through the user interface will be most accurate, as they are updated more frequently.  To get around this issue, we created some code that extracted citation data directly from the individual papers (instead of relying on the aggregated counts_by_year).  We were only doing it for 200ish researchers, so I'm not entirely sure how it'd handle thousands of authors.  But if you're interested in learning more, just let me know.  I'd be happy to share our work.

Best,

Jack

Jack YoungMLIS
(He/Him)

Research Impact Librarian

University Library

1280 Main Street West
Hamilton, Ontario L8S 4L6

location: Mills Memorial Library, Sherman Centre for Digital Scholarship
phone: (905) 525-9140
email: jky...@mcmaster.ca


--
You received this message because you are subscribed to the Google Groups "OpenAlex Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openalex-commun...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/openalex-community/CAHf%2BC7YYi0WGnDcDrN_gOcO%2BkfgxAaiAHE39aAPGEQQJc8BHkQ%40mail.gmail.com.

James Dumlao

unread,
Mar 19, 2025, 12:49:37 AMMar 19
to Gabor Schubert, OpenAlex Community
Hi Gabor,

Yes, I made a series of API calls capturing Author objects, specifying works_count, cited_by_count, and counts_by_year. My question is about discrepancies between the sums of values in counts_by_year and the values in works_count and cited_by_count.

Sincerely,
PhD Candidate
School of Information
University of Michigan


--
You received this message because you are subscribed to the Google Groups "OpenAlex Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openalex-commun...@googlegroups.com.

Gabor Schubert

unread,
Mar 19, 2025, 4:21:35 AMMar 19
to OpenAlex Community
Hi James,

Thanks for describing the problem in more detail. I never used this feature. I noticed something when I look at the default API reply to an author object (for instance my own data): https://api.openalex.org/people/A5057232826
It seems that the counts_by_year citation numbers are only given to a limited number of years, in this case 2012-2025. I have definitely older articles with citations pre-2012, thus my cited_by_count is 1311, but the the sum of the counts_by_year citations between 2012-2025 will be smaller: 1178. I am not sure if you can get all the years with some extra API command.

Gabor

Gabor Schubert

unread,
Mar 19, 2025, 4:27:37 AMMar 19
to OpenAlex Community
Actually the following can be found in the API documentation: (https://docs.openalex.org/api-entities/authors/author-object#counts_by_year): "Any works or citations older than ten years old aren't included."
But in this case the sums of the years would be smaller than the totals, and you are describing the opposite, if I understand it correctly.

Gabor

James Dumlao

unread,
Mar 19, 2025, 1:46:48 PMMar 19
to Gabor Schubert, OpenAlex Community
Hi Gabor,

Correct, I noticed many objects were truncated at 2012, but am actually noticing overcounting rather than undercounting! I'm hoping that the reason for the overcounting is systematic across Author objects, so that they're at least relatively correct to each other.

Sincerely,
PhD Candidate
School of Information
University of Michigan

Gabor Schubert

unread,
Mar 19, 2025, 3:03:06 PMMar 19
to OpenAlex Community
Thanks. Can you give a real life example when the overcounting is really large? It would be interesting to dig deeper.
Gabor

Jack Young

unread,
Mar 20, 2025, 4:03:32 PMMar 20
to Gabor Schubert, OpenAlex Community
Hi Gabor,

Just jumping in here because I've experienced the same issues as James.  

For a real world example: The following API call (https://api.openalex.org/people/A5056993317?select=display_name,counts_by_year) displays citations by year for OpenAlex ID A5056993317.  When these citation counts are tallied up, they equal 1,467.  

However, when I search for the same OpenAlex ID using the UI, the total number of citations for this author is only 958 (https://openalex.org/works?page=1&filter=authorships.author.id%3Aa5056993317&group_by=publication_year,open_access.is_oa,primary_topic.id,authorships.institutions.lineage,type,cited_by_count_sum) .  

Thanks,

Jack

Gabor Schubert

unread,
Mar 20, 2025, 6:55:56 PMMar 20
to OpenAlex Community
Hi Jack,

Thanks for the example. I see the difference now, although it's unclear what could be the cause. Quite probable that the cited_by_count values  in the API  are wrong and the citations per works numbers are closer to reality. I see that someone mentioned this problem already in 2023 in the openalexR at Github: https://github.com/ropensci/openalexR/issues/115

One odd thing I noticed in this particular case, that the most cited article by this author was one from 2009: https://openalex.org/works/w2155741368 with 155 citations. According to OpenAlex the "source" of the article is Pubmed (https://pubmed.ncbi.nlm.nih.gov/19239742), although it is an article from the Journal of the Canadian Dental Association: https://jcda.ca/dental-surgery-patients-anticoagulant-therapy-warfarin-systematic-review-and-meta-analysis, and it has no DOI.

I recreated the cited by numbers for each year from all the 74 works of the author with the help of this query:
https://api.openalex.org/works?filter=cites:W2155741368&group-by=publication_year&per-page=100
and then repeat this for all the 73 other work IDs, and then sum the numbers for each year.

Here are the number of citations per year. I included the cited_by_count citation values in parentheses for those years which are shown by the API call.

2001: 1
2009: 3
2010: 11
2011: 13
2012: 20 (21)
2013: 18 (18)
2014: 22 (22)
2015: 23 (23)
2016: 25 (26)
2017: 32 (34)
2018: 36 (37)
2019: 38 (48)
2020: 73 (118)
2021: 115 (186)
2022: 139 (234)
2023: 174 (304)
2024: 177 (328)
2025: 36 (68)

There is one citing article from 2001 to an article from 2012 (https://openalex.org/works/W1517988088) which is obviously an error.
It seems that up to 2018 the citation values are almost identical, but from 2019 the cited_by_count values are much higher.

It's not much, but maybe can help a bit to understand where to find the source of the problem.

Gabor

James Dumlao

unread,
Mar 20, 2025, 7:42:43 PMMar 20
to Gabor Schubert, OpenAlex Community
I appreciate you looking into this and finding at least a partial explanation! Thank you all.

Sincerely,
PhD Candidate
School of Information
University of Michigan

Gabor Schubert

unread,
Mar 21, 2025, 4:10:09 AMMar 21
to OpenAlex Community
One more thing which might complicate things even further: when we are talking about the summation of "cited by counts" it is not obvious what OpenAlex is doing when they are aggregating the citation numbers. If we have article_A, article_B and article_C, and article_D has references to all three of them: three references article_A and article_B and article_C, in this case article_A, article_B and article_C have a cited_by value of 1, and the sum of these is 3, although there is only 1 article (article_D) which have cited the three articles. So there could be large differences if one calculates the sum of citations or the sum of citing articles. 

The following is stated in the OpenAlex documentation 

"Integer: The total number [of] "Works" that cite a work this author has created."
"List: Author.works_count and Author.cited_by_count for each of the last ten years, binned by year. To put it another way: each year, you can see how many works this author published, and how many times they got cited."

"Integer: The number of citations to this work. These are the times that other works have cited this work: Other works ➞ This work."

It feels from the wording of these definitions that they might calculate different things. I'm not sure if that this is the case or if this has anything to do with the large discrepancies, but it might also be an possible explanation for differences in the numbers.

Gabor
Reply all
Reply to author
Forward
0 new messages