Concept translations from wikidata

Tom Demeranville

unread,

Dec 13, 2022, 10:24:09 AM12/13/22

to OpenAlex users

Hi,

I'm wondering if anyone has pulled all the concept translations out of wikidata and matched them up with the concepts in OpenAlex?

Looking at it, it seems theoretically possible to generate some flat CSV files containing the available translations for each concept. However, I'm not very familiar with wikidata or how to get the information out, so I'm hoping someone else has already done it!

All the best,

Tom

Sandra M

unread,

Dec 15, 2022, 3:14:39 PM12/15/22

to OpenAlex users

Hello Tom,

I'm a bit familiar with both APIs, so I set up an example how you could pull all translations for a concept from Wikidata here : https://github.com/smierz/openalex-utils/blob/main/notebooks/translate_concept_using_wikidata.ipynb

If you combine this approach for one concept with cursor paging through the entire OpenAlex 'concepts' endpoint as described in this notebook https://github.com/ourresearch/openalex-api-tutorials/blob/main/notebooks/getting-started/paging.ipynb you can pull all concepts and their translations.

Words of caution:

If you store all data in one CSV you end up with a very large file that probably won't fit into Excel:
OpenAlex stores 65073 concepts, and Excel has a row limit of 1,048,576 rows (that's a max of~16 translations per concept before you are exceeding the limit)
You may want to set per_page to its max 200, so you lower the number of requests to OpenAlex
(i dont know if and how you could reduce requests to Wikidata though)

And now for something completely different: Congratulations for taking part in the Great Pottery Throw Down!!

Cheers,

Sandra

Demeranville, Tom

unread,

Dec 16, 2022, 8:55:32 AM12/16/22

to Sandra M, OpenAlex users

Wow, that's great. Thank you very much Sandra.

Tom Demeranville

Product Director

ORCID Inc

https://orcid.org/0000-0003-0902-4386

--
You received this message because you are subscribed to the Google Groups "OpenAlex users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openalex-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/openalex-users/69baac8c-76fa-4c45-b640-eeb0d5b6ac1cn%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Demeranville, Tom

unread,

Dec 20, 2022, 6:49:12 AM12/20/22

to Sandra M, OpenAlex users

Thanks to the code you supplied, I've done a little bit of coding and mapped all the translations and aliases to the openalex concepts. I then matched them with the keywords field in ORCID that researchers use to self-describe themselves.

Some discoveries:

- There are 3m+ aliases and translations for the ~65k concepts.

- Education is the top self-reported keyword concept in the ORCID registry after mapping to aliases and translations. Previously it was the English language string "Machine Learning"

- Researchers have used 41 different ways to refer to 'data mining' in the ORCID registry

- 790k ORCID records are linked to at least one OpenAlex concept via the keywords field

- There are some popular concepts in the ORCID registry that don't match well to openalex concepts. For example "Hydrology" in ORCID does not match to the "Hydrology (agriculture)" concept in openalex/wikidata.

A couple of things I need to work on to get more useful insights and accurate numbers:

- OpenAlex can have multiple concepts per wikidata concept. For example, "Formal Education", "Formal Learning" and "Education" Openalex concepts all map to "Education" in wikidata.

- The same alias/translation can appear in multiple wikidata concepts. For example, the hindi translation of "Informal Education" in Wikidata is "education", which causes me issues.

- A surprisingly small number of keywords in ORCID match the 3m+ translations and aliases (~80K). There are a lot of keywords (~1m) in ORCID that do not map to any of the 3m aliases/translations. This could be something to do with the way I'm matching non-roman characters in my relational database.

This is a first exploratory step, but it's very helpful to start to make sense of the various ways researchers self-describe.

And thanks for the mention of Throwdown. Filming it was a lot of fun. I can wait to watch the new series!

Tom Demeranville

Product Director

ORCID Inc

https://orcid.org/0000-0003-0902-4386

Reply all

Reply to author

Forward