Groups keyboard shortcuts have been updated
Dismiss
See shortcuts

Help - references OpenAlex

107 views
Skip to first unread message

Silva

unread,
Oct 4, 2024, 11:04:24 AM10/4/24
to OpenAlex Community

Hi, Dear all

I hope this message finds you well.

I am using OpenAlex as an indispensable tool in my research, allowing me to access a vast amount of academic data efficiently and for free. Its interface and functionality have been extremely useful in finding and connecting relevant information, which has greatly contributed to the progress of my work. However, I have encountered a specific issue that has hindered the experience: I am noticing inconsistencies in the reference counts for some works in OpenAlex compared to the number of references in the original PDFs and the journal websites. Here are a few examples:

The work https://openalex.org/works/W3036130366 indicates 53 references, but both the original PDF and the journal's website show a different count.

For the work https://openalex.org/works/W3094482919, OpenAlex reports 43 references, but the PLOS ONE website and the PDF list 38.

Similarly, https://openalex.org/works/W3022039950 has 52 references in OpenAlex, but 49 in the original PDF and on the journal's website.

Finally, https://openalex.org/works/W3096640152 has 30 references recorded in OpenAlex, while the PDF and the journal's site show 29.

I have noticed this discrepancy occurring in thousands of cases. I would like to request help in understanding the reasons for these inconsistencies. Additionally, how might this impact the citation counts of the referenced works? For example, if there are five extra references, could these five works potentially receive citations, even though they were not cited in the original references of the citing papers?

Thank you for your attention, and I look forward to your response.

Sincerely, Silva


Samuel Mok

unread,
Oct 4, 2024, 3:15:06 PM10/4/24
to Silva, OpenAlex Community
Hello Silva,

The code creating the OpenAlex database can be viewed on Github, but it doesn't include detailed documentation and it's understandably complex, so it'll take some effort to unravel all the details if you want to know exactly what's going on.

In general though, OpenAlex first creates a list of references for each Work, and then uses those to also create the reverse relationships -- citations. So for a citation to be added to a Work, first the Work doing the citing needs to be added to the database, and of course the reference will need to be detected. I'm not 100% certain, but afaik references are extracted directly from metadata if available (e.g. from OpenAIRE, Crossref, or otherwise); but also by extracting dois from the bibliography of the paper itself. 

This is also briefly mentioned in the OpenAlex docs for Work filters:
image.png

If you want to dig a bit deeper, here are various pieces of code that relate to your question that might get you started:
There are various SQL queries that create various views related to citations & references;
the function add_references gives some insight into how references are ingested and how they relate to citations (from the main python script that creates & updates Works);
and in the initialization script you can see how references are loaded in from the database as 'reverse citations'. 

Cheers,
Samuel

--
You received this message because you are subscribed to the Google Groups "OpenAlex Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openalex-commun...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/openalex-community/31b0dd45-837b-4ed1-a315-02abc0cbc3b2n%40googlegroups.com.

Bianca Kramer

unread,
Oct 5, 2024, 2:58:01 AM10/5/24
to Samuel Mok, Silva, OpenAlex Community
Hi Silva, 

In addition to Samuel's remarks, looking closer at the first two of your examples gives some clues on the cause(s) of the discrepancies:

> The work https://openalex.org/works/W3036130366 indicates 53 references, but both the original PDF and the journal's website show a different count.

This is a medRxiv preprint with 2 versions (v1 and v2), with 12 and 44 references, respectively The two sets are largely non-overlapping, and the OpenAlex record combines the references from both versions. In addition, there is one set of duplicate works in OpenAlex for the same DOI that occurs as a reference (https://openalex.org/W3042270788 and https://openalex.org/W4210642183). 

> For the work https://openalex.org/works/W3094482919, OpenAlex reports 43 references, but the PLOS ONE website and the PDF list 38.

Here, there are a number of cases with multiple OpenAlex IDs for the same referenced work, for several reasons:

- different editions of the same book (from different years)
(https://openalex.org/W1587026990 and https://openalex.org/W4254687493)
- different records for the same software package  on CRAN (one with doi, one without)
(https://openalex.org/W2171216257 and  https://openalex.org/W4399542805)
- subsequent version of the same F1000Research publication (which each get a different DOI)
(e.g. https://openalex.org/W2609499779  and https://openalex.org/W4235332646
- records for a preprint and a review of that preprint (on preLights), while the original reference is to the preprint
In this case, the second record has some of the metadata of the preprint itself, 
(https://openalex.org/W2986870495 and https://openalex.org/W4234222970)

In addittion, some of the original references were not matched with an OpenAlex record (e.g. references to websites, rather than scholarly outputs), and the OpenAlex reference listt includes one generic record for 'Deleted Work' (https://openalex.org/W4285719527)

Lots going on! 

Some of these are known challenges, e.g. how to handle different preprint versions and citations to each version vs. all versions together.
Some of them are more straightforward errors in matching the OpenAlex record to the actual reference. 

It should be noted, too, that discrepancies between references is common when comparing databases - especially where not all refererences from e.g. the pdf are included. Sometimes this occurs because they are not part of the database itself -  e.g. with references to websites, reports e.d. or with references that (rightly or wrongly) could not matched to a DOI or PMID when the database only has records witth these identifiers (as in Dimensions).

For instance, the first example has the expected 44 references in Crossref, of which 42 with DOI included in the metadata- and only (these) 42 in Dimansions. The second example has the expected 38 referencs in Crossref, of which 29 with DOI included in the metadata, and 33 in Dimensions.

So overall, there's definitely room for improvement on many fronts, on the other hand, there are inherent limitations that are one reason to not overly rely on any citation count as a 'ground truth'. 

Hope this helps, this was fun to untangle, so thanks for the 'puzzle'! 

kind regards, Bianca 

Bianca Kramer
Sesame Open Science 



Op vr 4 okt 2024 om 21:15 schreef Samuel Mok <sam...@gmail.com>:

Pedro Henrique

unread,
Oct 7, 2024, 11:07:00 AM10/7/24
to Bianca Kramer, Samuel Mok, OpenAlex Community
Bianca, you were brilliant in your response! Thank you!

I still have some questions regarding the OpenAlex metrics, and I would like the community's help to clarify them:

How does OpenAlex handle citations for both a preprint and the published article when both are available? Crossref treats the preprint and its corresponding article as distinct citable objects, each with its own DOI. As a result, the citations are counted separately, with no merging or consolidation between the preprint and the peer-reviewed article. Here are some clarifications from Crossref on the matter:  

 https://community.crossref.org/t/avoiding-duplicate-doi/11965 

https://archive.org/details/gmail-crossref-re-citations .


Does OpenAlex transfer the citations from a preprint to the article published in a peer-reviewed journal? Or does it treat them as separate citable documents due to their distinct DOIs? Furthermore, is it correct to attribute a citation to something that wasn’t explicitly cited but has a related earlier version?

Thank you in advance to anyone who can respond


Pedro Henrique

unread,
Oct 7, 2024, 11:07:00 AM10/7/24
to Samuel Mok, OpenAlex Community
  Hi, thanks for the comment, Samuel. 
The number of references shown by OpenAlex also differs from the number of references in Crossref, so I believe there is some internal issue in OpenAlex causing inconsistencies in the data.  

Reply all
Reply to author
Forward
0 new messages