Sometimes wrong oa_status and license ?

150 views
Skip to first unread message

Ivo Bleylevens

unread,
Jun 11, 2024, 12:08:56 PM6/11/24
to OpenAlex Community
Dear readers, or OpenAlex team,

sometimes I discover items that have a GREEN oa_status but when I follow the DOI link I end up at the publishers publication with a Creative Commons license and therefore I think this item should be GOLD open access instead of GREEN open access. I often see it for certain publishers like the ones with DOI prefixes like 10.1016 and 10.1111 (SienceDirect and Wiley) but probably also others. I didnt dive into it deeply yet. Does anyone know how this is possible and how often this occurs (what's the error rate on this field?

Would love to share thoughts with other interested people !

Below is an API query with 8 example DOIs that illustrate what I'm seeing:

Capture.JPG

Yusuf Ali Ozkan

unread,
Jun 11, 2024, 1:59:50 PM6/11/24
to OpenAlex Community
Hi Ivo,

Thanks for raising this. I believe the main data source for OA status in OpenAlex is Unpaywall. I've checked one of the DOIs you mentioned through Unpaywall's Simple Query Tool. Unpaywall identifies this as 'green'. So, it's worth contacting them. They are also a part of OurResearch. So, I'm assuming the OpenAlex team should know the issue if it was raised before.

Thanks,
Yusuf

Krugs.de

unread,
Jun 11, 2024, 2:38:54 PM6/11/24
to Yusuf Ali Ozkan, OpenAlex Community
Hi 

Please correct me if I am wrong, but as far as I know, the publishing model has nothing to do with the final license the paper is under. So they are publishing models for the journals, and the license is the license given to the paper. There are correlations, but they are not that clear. So I do not see a contradiction between these two at all.


Hope this helps,

Rainer
 


---
Rainer M. Krug, PhD (Conservation Ecology, SUN), MSc (Conservation Biology, UCT), Dipl. Phys. (Germany)

Orcid ID: 0000-0002-7490-0066

Department of Evolutionary Biology and Environmental Studies
University of Zürich
Office Y34-J-74
Winterthurerstrasse 190
8075 Zürich
Switzerland

Office:    +41 (0)44 635 47 64
Cell:           +41 (0)78 630 66 57
email:      Raine...@uzh.ch
       Rai...@krugs.de
Skype:     RMkrug

PGP: 0x0F52F982

On 11 Jun 2024, at 19:59, Yusuf Ali Ozkan <yusufal...@gmail.com> wrote:

Hi Ivo,
--
You received this message because you are subscribed to the Google Groups "OpenAlex Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openalex-commun...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/openalex-community/648eda86-cfcd-4d49-bd27-458c61332a1cn%40googlegroups.com.

Samuel Mok

unread,
Jun 11, 2024, 3:27:23 PM6/11/24
to Ivo Bleylevens, OpenAlex Community
OpenAlex determines the OA status using this pretty straightforward heuristic:
  • gold: Published in an OA journal that is indexed by the DOAJ.

  • green: Toll-access on the publisher landing page, but there is a free copy in an OA repository.

  • hybrid: Free under an open license in a toll-access journal.

  • bronze: Free to read on the publisher landing page, but without any identifiable license.

  • closed: All other articles.


Following this classification, most items in your results should've been marked as Gold as they're published in a DOAJ journal -- e.g. your second work is published in this journal: https://doaj.org/toc/1748-717X. As it's a DOAJ journal, by definition all items published there are gold -- so it should never be any else. 
However, as you can also see in the API results, OpenAlex for some reason doesn't have any PDF urls for the primary publication location -- so it defaults to 'closed', with 'green' if there's a version available in a repo somewhere.
I don't know why OpenAlex hasn't picked up the PDFs for these gold items -- but this is definitely an error in some part of the importing process. I also have a suggestion to the OpenAlex team: either update your written heuristic in the docs (https://docs.openalex.org/api-entities/works/work-object#the-openaccess-object) with more details to clearly show how cases like this come to pass; or change things on the processing side to fit the heuristic as-is. In this case, you could label the work as 'gold' as it's published in a DOAJ journal -- even though there's no known gold PDF link in the OpenAlex dataset.

For applications where knowing the exact OA status is highly important, I suggest adding your own heuristics on top of OpenAlex's data -- for instance, you can apply my suggestion above by checking if one of the sources of the work's locations is a DOAJ journal, and if so, mark it as 'gold'. I personally do even more work and combine data from other sources (like crossref, OpenAIRE, etc), and then use that to figure out the open access status according to specific definitions that make sense for my institution. 


Cheers,
Samuel
--
You received this message because you are subscribed to the Google Groups "OpenAlex Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openalex-commun...@googlegroups.com.

Losia Lagisz

unread,
Jun 11, 2024, 9:07:44 PM6/11/24
to OpenAlex Community
  • Hi ,
  • I have a related question: 
  • "green: Toll-access on the publisher landing page, but there is a free copy in an OA repository."


  • I wonder how this is actually determined, especially for preprints?


  • Given that published articles usually do not link back to the preprints, and preprints records may not link forward to the published versions, there may be no information (e.g. DOI) in the meta-data on the existence of other copy.

  • In such a case, does the algorithm conduct an online search for other copies? If so, is it based just on the exact title match? I wanted to add that title-based search might be quite unreliable given that many preprints can have different title from the published version (as we noticed when evaluating preprints from EcoEvoRxiv).

Cheers,
Losia

Samuel Mok

unread,
Jun 12, 2024, 5:31:36 AM6/12/24
to Ivo Bleylevens, OpenAlex Community
Addendum: the source code of OpenAlex is of course available; here you can find the complete script for creating/updating a Work entry: https://github.com/ourresearch/openalex-guts/blob/main/models/work.py

Here is the main function for determining OA status:
image.png
Which mostly matches the heuristic, with some small details added. You'll need to dig deeper into the code, starting with checking out how open access status of locations is determined. 

Cheers,
Samuel

Samuel Mok

unread,
Jun 12, 2024, 5:42:10 AM6/12/24
to Losia Lagisz, OpenAlex Community
Hi Losia,

The code for matching Works is also available and can be found here: https://github.com/ourresearch/openalex-guts/blob/main/models/__init__.py -- Items are first ingested as 'Records' before being parsed to Work objects. 
Here are the subqueries that are used for matching works ( code is shortened by me to only show the relevant part):
image.png

These joins are run when a record is being processed, and you can find that in https://github.com/ourresearch/openalex-guts/blob/main/models/record.py in the function get_or_mint_work.
They try to find a match in this order, and stop if a match is found:
1. doi
2. pmid
3. arxiv_id
4. match arxiv preprint from datacite 'related records' field
5. title

For the title they use some limitations, like a min/max amount of characters, and they normalize the titles using this python function embedded in the SQL db: https://github.com/ourresearch/openalex-guts/blob/main/sql/db_udfs/f_unpaywall_normalize_title.sql
They have some safeguards in place to prevent errors in title matching, which you can see spread around the various bits of code.

Cheers,
Samuel



Losia Lagisz

unread,
Jun 12, 2024, 6:59:13 AM6/12/24
to Samuel Mok, OpenAlex Community
Thank you, Samuel - that's lots of details!
It seems like only arxiv is searched at the moment, not a broad range of preprint repos. 
Are there any plans to include more?
Kind regards,
Losia

Ana Enriquez

unread,
Jun 12, 2024, 7:11:50 AM6/12/24
to OpenAlex Community
Hi Losia and all,

It looks to me like that code is matching on arXiv ID, as well as a few others (DOI, PMID...). But the full list of "sources" (including preprint repositories) for Unpaywall is much bigger. You can view the list and learn how sources can be added on Unpaywall's Sources page. I suspect the initial problem you raised, that many preprints don't have the necessary metadata for a match, is an issue.

All the best,
Ana

Esther M. Jackson

unread,
Aug 20, 2024, 5:04:00 PM8/20/24
to OpenAlex Community
This is a fascinating discussion that my colleague shared with me as we work with some data of our own. 

Our group is using OpenAlex data to estimate APC spending and was puzzling over the coexistence of APCs for Green or Bronze journals. We reviewed this article which helped explain the OA type logic: https://support.unpaywall.org/support/solutions/articles/44001777288-what-do-the-types-of-oa-status-green-gold-hybrid-and-bronze-mean-

It feels that it may be appropriate for there to be an issue started in the Github project (https://github.com/ourresearch/OpenAlex/issues) to document additional discussion. What do folks think? I'm happy to create an issue or leave that to someone else, such as Samuel, who articulated a specific request already:

...either update your written heuristic in the docs (https://docs.openalex.org/api-entities/works/work-object#the-openaccess-object) with more details to clearly show how cases like this come to pass; or change things on the processing side to fit the heuristic as-is. In this case, you could label the work as 'gold' as it's published in a DOAJ journal -- even though there's no known gold PDF link in the OpenAlex dataset.

Thank you,
Esther
Reply all
Reply to author
Forward
0 new messages