Stemming is nice... But

66 views
Skip to first unread message

Rainer M Krug

unread,
May 16, 2024, 8:39:53 AMMay 16
to OpenAlex Community, OpenAlex Support
Hi

I have a question and / or problem.

I was doing a Title and Abstract search for “Researcher” (https://openalex.org/works?page=1&filter=title_and_abstract.search%3A%22Researcher%22) and was wondering: why are there nearly 20 million works having “Researcher” in title or abstract? But then I tried “research” https://openalex.org/works?page=1&filter=title_and_abstract.search%3A%22Research%22 and got the same result.
So what is happening is (as I understand it) that stemming results to equaling these two - although Researcher and Research are two different things - particularly in searches. The same is by the way also true for “Influencer” and “Influence”and quite a few others I tried, but not for “Report” and “Reporter”.

We were looking to identify papers which have certain actors (Influencer, Researcher) in their abstract or title, but NOT influence or research. So it seems that this is not possible with stemming.

Is there a way of disabling stemming? I thought I read somewhere that terms in inverted commas (“…”) are not stemmed, but we put these words in inverted commas and they are still stemmed.

Is there any way around this? How can we deal with this?

Thanks,

Rainer



--
Rainer M. Krug, PhD (Conservation Ecology, SUN), MSc (Conservation Biology, UCT), Dipl. Phys. (Germany)

Orcid ID: 0000-0002-7490-0066

Department of Evolutionary Biology and Environmental Studies
University of Zürich
Office Y19-M-72
Winterthurerstrasse 190
8075 Zürich
Switzerland

Office: +41 (0)44 635 47 64
Cell: +41 (0)78 630 66 57
email: Raine...@uzh.ch
Rai...@krugs.de

PGP: 0x0F52F982



OpenAlex Support

unread,
May 16, 2024, 8:39:54 AMMay 16
to Rainer M Krug, OpenAlex Community

Hello!

We received your request (ID #1831).

Unfortunately we cannot reply to every support request we receive. But we will try our best.

We hold weekly virtual open houses, which is another way you can get your questions answered, by engaging with us directly! Click here to sign up for an upcoming open house.

For priority support, please have a look at OpenAlex Premium.

Thanks,
OpenAlex Team


This email is a service from OpenAlex Support.

Samuel Mok

unread,
May 16, 2024, 10:34:40 AMMay 16
to Rainer M Krug, OpenAlex Community
Hi Rainer,

The search option for OpenAlex is explained more in-depth here:

As you can read there, there is no possibility to disable the used stemming algorithm in a search -- I'd guess that disabling stemming would either put a lot more stress on the servers; or result in sub-par search results, depending on the implementation. 

The easiest solution I can think of is to retrieve the title, n-grams, abstract, etc for the works you get from the API with your current search, and then filter out unwanted articles locally. This should work as the results you want are a strict subset of the results you currently get, if I understand correctly.
In this way you also have full control over how exactly the filtering is done instead of depending on the implementation of OpenAlex.

Cheers,
Samuel

--
You received this message because you are subscribed to the Google Groups "OpenAlex Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openalex-commun...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/openalex-community/2FA204C7-1312-4B32-A19A-60C295EBB092%40krugs.de.

Rainer M Krug

unread,
May 16, 2024, 11:02:01 AMMay 16
to Samuel Mok, OpenAlex Community
Hi Samuel

OK - so I have to work with this.

I think in the longer run, they have to deal with this issue - stemming is nice, but in some cases (as in this one) it limits the usability.

Concerning your suggestion:

Yes - definitely worth investigating. And I have a local snapshot of the data so I could do that. 

But this leads me to an earlier question: How best implement Abstract and Title full text search on a local snapshot of the data?

I tried DuckDB on a subset (4.5 million works), but the search is really to slow to do anything useful.

I plan on converting the local snapshot to parquet datasets, but I am still struggle with the implementation and have at the moment not really time to look into this in more detail.

Cheers,

Rainer



Samuel Mok

unread,
May 17, 2024, 2:10:19 PMMay 17
to Rainer M Krug, OpenAlex Community
That's quite outside of the scope of this community I'd say, and although I'm not familiar with your setup I'd bet that it'll definitely possible to set up a locally running DB like duckdb, sqlite, or PostgreSQL to be able to do a performant full text search on millions of rows. You'd need to at least build the correct indices for your usecase and of course use the appropriate query; but  sub 1 second query times are very attainable. To get started you could follow the duckdb manual on setting up full text search: https://duckdb.org/2021/01/25/full-text-search.html, although more established non-embedded dbs like PostgreSQL will have a lot more possibilities and documentation for implementing and fine-tuning a full text search, see here for example:

Cheers,
Samuel

Casey Meyer

unread,
May 29, 2024, 11:57:09 AMMay 29
to OpenAlex Community
Hi Ranier,

Great news! I think you made a great point about this and we can see others wanting to disable stemming. So you can now search title and abstract without stemming and without stop words removed. This is enabled in the API for now, but we'll likely add it in the UI in the near future. You can use this by adding "no_stem" to the end of the search param for these four parameters, like this:


Hope this is helpful for you and others that are interested in this capability. We might add it to fulltext later, but are starting with title and abstract for now.

Thanks,
Casey

--
Casey Meyer, CTO
OurResearchWe build tools to make scholarly research more open, connected, and reusable—for everyone.



Krugs.de

unread,
May 29, 2024, 12:03:52 PMMay 29
to Casey Meyer, OpenAlex Community
Dear Casey

This is great news. Perfect - love it. I will definitely look into this and use it.

This is why OpenAlex is so great!

Cheers,

Rainer

---
Rainer M. Krug, PhD (Conservation Ecology, SUN), MSc (Conservation Biology, UCT), Dipl. Phys. (Germany)

Orcid ID: 0000-0002-7490-0066

Department of Evolutionary Biology and Environmental Studies
University of Zürich
Office Y34-J-74
Skype:     RMkrug

PGP: 0x0F52F982

On 29 May 2024, at 17:57, Casey Meyer <ca...@ourresearch.org> wrote:

Hi Ranier,
Reply all
Reply to author
Forward
0 new messages