Language detection for works

98 views
Skip to first unread message

Jason Portenoy

unread,
May 11, 2023, 2:21:49 PM5/11/23
to OpenAlex users
Hello everyone,

OpenAlex works now have a "language" field. This is the language of the work in ISO 639-1 format. The language is automatically detected using the information we have about the work. We use the langdetect software library on the words in the work's abstract, or the title if we do not have the abstract. Keep in mind that this method is not perfect, and that in some cases the language of the title or abstract could be different from the body of the work.

You can filter and group by language code.


Cheers,
Jason Portenoy

Eric Jeangirard

unread,
May 12, 2023, 8:59:49 AM5/12/23
to Jason Portenoy, OpenAlex users
Hi Jason
Thanks for this new feature !
Just to understand, is there a rationale to prefer the langdetect package to other alternatives, like https://pypi.org/project/fasttext-langdetect/ ?
Thanks !
Eric

--
You received this message because you are subscribed to the Google Groups "OpenAlex users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openalex-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/openalex-users/dc253ffe-24b0-4902-af19-c838a674a48bn%40googlegroups.com.

Jamie

unread,
Jun 2, 2023, 10:33:56 AM6/2/23
to OpenAlex users
Hi Eric and Jason

fasttext is faster and more accurate than langdetect according to this benchmark: https://modelpredict.com/language-identification-survey

Which was the reason we decided to use it for this dataset (although that was only a one off dataset): https://openknowledge.community/language-diversity/. fasttext was pretty easy to setup and use, you can see how to use the model here and how to download the model in the README.

The titles and abstracts from Crossref Metadata and MAG (not sure if these made their way into OpenAlex in the abstract inverted indexes), required cleaning, as there were some that were mistakenly set as URLs or DOIs and others that contained HTML, non-UTF-8 characters etc. 

Thanks

Jamie
Reply all
Reply to author
Forward
0 new messages