multiple doi assignments, incomplete concepts tagging in Works

179 views
Skip to first unread message

Tuan Chien

unread,
Aug 8, 2022, 11:42:43 PM8/8/22
to OpenAlex users
Hi OpenAlex team

I've been taking a look at the Work entity using the recent snapshot (last update 2022-07-09).  I want to draw attention to two issues in case they're not known/incorrect behaviour.

doi

Some doi are assigned to multiple different works. One example is

Another example with invalid doi:

From a quick look, it appears that these doi in question are being assigned to different works, rather than same works with duplicate entries.

There appears to be 899,471 dois with multiple assignments like this. This impacts 2,171,912 of the works, or about 0.9%.

Concepts

The description in https://docs.openalex.org/about-the-data/concept seems to indicate that concepts are hierarchical.

There exist Work entries that only have higher level concepts listed under concepts but not their lower level concepts.

For example, the work with doi
has two tags Rhinoplasty (level 3), and Nasal dorsum (level 4), but it is missing level 0, 1, 2 concept tags.

For a given work, it would be good if the concept tags in the hierarchy are all provided for that entry, similar to MAG.

Thanks!


Best,
Tuan

Thomas Klebel

unread,
Aug 10, 2022, 3:32:29 AM8/10/22
to OpenAlex users
Hi Tuan,

just jumping in on the question of concepts in MAG: as far as I can recall, the current behaviour in OpenAlex is similar to that of MAG. In MAG there were quite a few cases (can't say how many exactly) like the one you describe.
I agree, though, that it would be good if every work also has top-level concepts. Obviously, this should only be with a reasonable degree of certainty, and this might simply be the explanation why it is currently not the case for all works.

Best,
Thomas

Casey Meyer

unread,
Aug 10, 2022, 9:31:44 AM8/10/22
to Thomas Klebel, OpenAlex users, Jason Priem
Hi Tuan,

re: DOIs

Thanks for the feedback! We are aware of the DOI issue and made changes to fix that a couple weeks ago. The next snapshot will be out soon and it will have those corrections. Looking in the API (latest version of the data) I can see that the DOI you mentioned is assigned to one record now, so it looks like we are on the right track.

If you download the next snapshot we would appreciate it if you check again to see if the problem is resolved. This type of QA is very helpful so if there are still DOIs assigned to multiple works we would like to hear about it.

re: Concepts

The issue with concepts (like Thomas mentioned) is that the tree is hierarchical but broader concepts still need to be matched. You can see the broader concepts for rhinoplasty by looking at the ancestors field in the concepts portion of the API here.

So while the goal with the tagger is to have a full tree of concepts it's not always possible. We're always looking to improve, so if you have a couple examples where it would appear the tagger should have applied concepts and it did not, we will take a look and try to improve it in the next version. The key components used to match tags are work title, journal (venue) title, abstract inverted index, and document type.

Thanks,
Casey

--
Casey Meyer
Developer - OpenAlexUnpaywall
OurResearchWe build tools to make scholarly research more open, connected, and reusable—for everyone.

--
You received this message because you are subscribed to the Google Groups "OpenAlex users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openalex-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/openalex-users/245dd762-67b9-4d43-9d8f-bf30116e0680n%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Tuan Chien

unread,
Aug 11, 2022, 3:39:10 AM8/11/22
to OpenAlex users
Thanks for the explanations everyone.  They are insightful.

@Casey Good to know the doi issue has been picked up.  I will run the same check again next snapshot and let you know if multiple assignments still persist, or if it's resolved entirely.

I will make note of the possibility that one needs to fill in the missing tags. It makes more sense why it's doing that now that you've provided some detail on the tagger.

My two cents:
If you do want to impose a hierarchy of concepts to the existing tagger, one modification you could make to the tagging pipeline is something like:
- Construct the hierarchy of concepts as a graph.
- Run tagging classifier.
- For each tag found from the classifier, traverse the concept graph back to level 0.
- Use the set of all traversed concept nodes as your final list of tags.

Philosophically it feels like it should be ok to do this because:
(1) If we accept the hierarchy as valid, then each higher level concept by definition is a specialisation of the lower level concept.
(2) If a tagger tags a higher level concept, then by (1), it should inherit that concept.  If it shouldn't inherit the lower level concept in that case, then we have a situation where the validity of the hierarchy is put in question, or the higher level tag is erroneous.

Best,
Tuan

klebel...@gmail.com

unread,
Aug 11, 2022, 12:18:57 PM8/11/22
to Jason Priem, Casey Meyer, OpenAlex users

Hi Jason,

 

thanks for your thoughts!

 

With MAG, „aggregating“ concepts to the top-most level was problematic in my view, since lower-level concepts had multiple ancestors (if tracing the full hierarchy), which resulted in very vague tagging. An example I remember was “alien”, which can be understood in terms of biology, immigration law, and astronomy or film, and thus was at least traceable to biology, sociology, and potentially another top-level field. I gather you have done some work on the concepts, and in this particular case, the situation is much better now (https://api.openalex.org/concepts?filter=display_name.search:alien has three well-defined concepts).

 

If works could always be (unambiguously, and with some certainty) tagged with top-level concepts, this would definitely be useful. For the community, it would mean that we would have one solution that can be tested and applied consistently, instead of everyone re-inventing the wheel. It seems to me that retrieving the top-level concepts, or vice-versa, retrieving all works that relate to a concept, are both fairly common use-cases.

 

Best,

Thomas

 

p.s.: not sure why, but your message showed up only in my inbox, but not in the related thread (https://groups.google.com/g/openalex-users/c/wyFD6svC0Qo)

 

From: Jason Priem <ja...@ourresearch.org>
Sent: Donnerstag, 11. August 2022 00:15
To: Casey Meyer <ca...@ourresearch.org>
Cc: Thomas Klebel <klebel...@gmail.com>; OpenAlex users <openale...@googlegroups.com>
Subject: Re: multiple doi assignments, incomplete concepts tagging in Works

 

Hi, I just wanted to expand a bit on what Casey already said about the concepts, to share the philosophy behind it. 

 

Our goal from early on was to create something that was mostly compatible with MAG, and so that's the initial reason for the behavior you observe.

 

As Thomas notes above, the rather counterintuitive tagging behavior you've observed was actually quite rampant in MAG. Their approach created concept (aka "field-of-study," aka "topic") links on a concept-by-concept basis that completely ignored hierarchy. So that way, when you see a high-level concept like "Biology" applied to an article, it means that the tagger made a direct match between that article and that concept. This direct match is not "polluted" by any inference based on the tag hierarchy. This is why there were tons of MAG articles that matched on "Computational Biology"  (for example) but not on its ancestor "Biology." 

 

I'm not exactly sure why MAG opted for this approach, but it does have one very nice advantage: you can easily see the strength (as measured by the assignment algo) of each concept-to-work mapping.  And then if you want to also include tags that can be logically assigned based on the hierarchy, you can do it yourself, by looking up ancestors in the published tag tree. So MAG's approach is a bit more explicit, and supports both use cases with a bit of work on the user's part ("show me the directly assigned concepts" and also "show me both the directly assigned concepts, and the logically-assigned ancestor concepts.")

 

But all that said, I agree MAG's approach (and now ours) is pretty confusing, and I think we may change it in future data dumps, depending on user feedback. It's not hard for us to assign tags in both ways (directly, and logically-from-the-tree), and that saves downstream users from having to do it.  So if folks have a preference, please let us know, and we'll consider that carefully.

Best,
Jason

 


For more options, visit https://groups.google.com/d/optout.


 

--

Jason Priem, CEO

OurResearch: We make software to help open science.

follow at @jasonpriem and @OurResearch_org

Tuan Chien

unread,
Aug 31, 2022, 4:26:13 AM8/31/22
to OpenAlex users
Thanks for the explanations.

@Casey I have now had a chance to look at the multiple doi assignments in the August snapshot.  It does appear to be a massive reduction in multiple assignments compared to previous snapshots.  There is still a tail of about 16k dois still exhibiting this.  I have a csv I can share (~700kB) of the doi in question including a count of ids assigned that doi.  If you're interested in seeing this, let me know what the best way is to share it.

Best,
Tuan

Richard Orr

unread,
Aug 31, 2022, 4:04:00 PM8/31/22
to Tuan Chien, OpenAlex users
Hi Tuan,

Thanks for checking our work! After we cleaned up the bulk of the (mostly inherited) duplicated DOIs we found a bug where we were occasionally creating our own duplicates - there were 18,042 Works sharing 8,936 distinct DOIs. The bug has been fixed and the duplicate works merged away. This will be reflected in the next snapshot update.

It would be great to have your list for comparison. There's no need to keep it secret, at 700kB you should be able to attach it here or send it to sup...@openalex.org.

Thanks,
Richard


For more options, visit https://groups.google.com/d/optout.


--
Richard Orr
Lead Developer - Unpaywall, OpenAlex

Tuan Chien

unread,
Sep 1, 2022, 3:24:44 AM9/1/22
to OpenAlex users
The csv is uploaded as an attachment.  The second column "c" is the number of ids with that doi assignment.

The level of community engagement and responsiveness of the OpenAlex team is very appreciated. Thank you!

Best,
Tuan
bquxjob_7388bade_182edc89dfe.csv

Casey Meyer

unread,
Nov 2, 2022, 10:50:16 AM11/2/22
to OpenAlex users
Hi Tuan and Thomas,

We implemented the fix for concepts so it is a hierarchical tree. Can you please check it out and see if it's working as you expected?

Thanks,
Casey

Reply all
Reply to author
Forward
0 new messages