Semantic (vector) search questions and clarifications needed

69 views
Skip to first unread message

Rainer M Krug

unread,
Mar 13, 2026, 4:31:21 AM (12 days ago) Mar 13
to OpenAlex Community
Hi

I am posting here as I think the answers will be of interest for everybody. 

I am trying to get a better picture of the semantic search (https://developers.openalex.org/guides/searching#semantic-search-beta), which is effectively a vector search - only that the embeddings are generated by OpenAlex

My questions are:

- What is vectorised? Title and Abstract? Fulltext if available?
- Which embeddings model does OpenAlex use, and how stable is this, i.e. how often will this be updated? Ion the same line - if updated, can I select the old embeddings model (repeatability of searches and change over time).
- Which similarity metric is used? I assume cosine similarity?
- I assume, that this will be deterministic, i.e. I get the same works back each time?

It would be great, if the following extensions could be added:
- vectorise a document locally, and supply the vector to the API. Use case: I have a hand full of Articles and want to find additional papers similar too these articles.
- instead of getting the top 50 back, getting everything above a certain similarity. This would make a explorative systematic literature search possible using the semantic search.

Thanks a lot,

Rainer
---
Dr. Rainer M. Krug (PhD Conservation Ecology, SUN; MSc Conservation Biology, UCT; Dipl. Phys. Germany)

Senior Data Specialist
Environmental Bioinformatics,
SIB Swiss Institute of Bioinformatics
Zurich

Senckenberg Biodiversity and Climate Research Centre,
Senckenberg Society for Nature Research
Frankfurt Main



Ivan Sterligov

unread,
Mar 13, 2026, 7:08:35 AM (11 days ago) Mar 13
to Rainer M Krug, OpenAlex Community
Hello Rainer,

I guess this is the retrieval part:

https://github.com/ourresearch/openalex-elastic-api/blob/master/vector_search/views.py

It uses databricks mosaic search https://docs.databricks.com/aws/en/vector-search/vector-search and databricks-gte-large-en model, which is mainstream and production-grade. Importantly, it is for English but possibly can work for other languages.

I assume this was the initial ingestion pipeline:

https://github.com/ourresearch/openalex-walden/blob/main/notebooks/vector_search/BatchEmbeddings.ipynb
it uses titles+abstracts, "Truncates title to 500 chars and abstract to 5500 chars"
but it says it uses a different embeddings model (openai-embedding-3-small).

There is a more recent one that handles only titles:
https://github.com/ourresearch/openalex-walden/blob/main/notebooks/vector_search/TitleOnlyEmbeddings.py
it "Generates embeddings for ~197M works that have titles but no abstracts. Uses `ai_query()` with `databricks-gte-large-en`"

They also seemingly tried another model (databricks-bge-large-en) here:
https://github.com/ourresearch/openalex-walden/blob/main/notebooks/vector_search/CreateWorkEmbeddings_v2.ipynb

I would assume that for now they mostly use databricks-gte-large-en for titles only. I also suspect that vectorizing abstracts is more problematic from publishers' point of view.

I did not find any mentions of chunking and vectorizing full texts.

Overall, a bit confusing, but very interesting, and huge respect for OurResearch for being so open, they even include time and $ costs for embedding all the millions of docs in some notebooks.

Best regards,
Ivan

--
You received this message because you are subscribed to the Google Groups "OpenAlex Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openalex-commun...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/openalex-community/E06542FB-EFE2-4F99-9D87-D77CE523F4AE%40krugs.de.


--
Всего доброго,

Иван Стерлигов

Rainer M Krug

unread,
Mar 13, 2026, 8:20:12 AM (11 days ago) Mar 13
to Ivan Sterligov, OpenAlex Community
Hi Ivan

Thanks for digging in the source code for that - much appreciated.

Lot’s of interesting info, but it highlights the need for an official OpenAlex document which documents the embedding as well as the search in more detail so that one can use it.

Thanks a lot,

Rainer

Samuel Mok

unread,
Mar 13, 2026, 8:31:40 AM (11 days ago) Mar 13
to Ivan Sterligov, Rainer M Krug, OpenAlex Community
Hi Rainer and Ivan, 

To add to this, the "jobs" folder in the walden repo shows the actual tasks as run by the OA team. The embedding creation task is defined here, indeed pointing to one of the notebooks Ivan linked, and "confirming" they're using OpenAI's embedding API, and that the input is the title + abstract, truncated to 0.5k chars + 5.5k chars respectively. 

This is a pretty straightforward and "simple" approach, and I feel like there is a lot to gain here: for example, they already do some sort of structured embedding with their topics analysis -- after all, that's using a LLM to match the title+abstract to a predefined set of topics. It seems logical to me to reuse this result for item matching as well then! Or including a hybrid-search pattern, combining bm25 with the embeddings to enable both keyword-based and similarity-based searches in one query. 

On another note, I read that it takes 8-10 days to calculate the embeddings, but that doesn't seem right to me, as openai also provides a batch api that should be way quicker and is 50% of the price! The downside is that it does not immediately respond to queries, instead returning the results within 24h -- but that's fine for this purpose. See https://developers.openai.com/api/docs/guides/batch/. Also, the estimate of 8-10 days of processing corresponds to roughly ~500 tokens per item on average, so I used that to calculate a quick comparison on how you could approach creating these embeddings, see below. Renting a cloud server would be very economical in this case I think:

optionTFLOPS/hourTime takenCost
Gaming PC (RTX 4090)414,000~5 Days~$10 (electricity only)
Cloud GPU (1x H100 80GB)1,440,000~36 Hours~$110 ($3.00/hr rental cost, runpod)
OpenAI (Normal Endpoint)180,000 (5M TPM limit)~12 Days$1,736 ($0.02 per 1M tokens)
OpenAI (Batch Endpoint)723,000~3 Days$868 ($0.01 per 1M tokens)

Cheers,
Samuel

Rainer M Krug

unread,
Mar 13, 2026, 8:45:10 AM (11 days ago) Mar 13
to Samuel Mok, Ivan Sterligov, OpenAlex Community
Hi Samuel

Interesting points. So it seems to be openai-embedding-3-small. But if I understand this correctly, you can not run it locally, so you have to use the OpenAI endpoint for that?

I would be interested in kn owing what the reasoning was to take a proprietary embedding model and none of the open ones? Simplicity for processing?  
This makes creating the reference embeddings yourself and submit them to OpenAI to search for them more difficult (if this will exist sometime…).


Cheers

Rainer

Rainer M Krug

unread,
Mar 13, 2026, 9:04:28 AM (11 days ago) Mar 13
to Rainer M Krug, Samuel Mok, Ivan Sterligov, OpenAlex Community
I assume that OpenAlex is using Cosine similarity as the similarity measure?


Samuel Mok

unread,
Mar 13, 2026, 10:29:45 AM (11 days ago) Mar 13
to Rainer M Krug, Ivan Sterligov, OpenAlex Community
OpenAI embeddings are indeed proprietary models, only accessible through the API. There are many APIs/services out there that can do the same job, including many open models, not sure why this choice was made. Also, you're misunderstanding the next step: the embeddings are retrieved and stored in the openalex databricks storage, which is setup to perform the search query using their setup. See their "option 2"  for details on how to use "your own" embeddings as the source. Also, to clarify, embeddings only need to be determined once per item (unless the items or input data changes, or they change to a different embedding model of course). 

Rainer M Krug

unread,
Mar 13, 2026, 10:34:07 AM (11 days ago) Mar 13
to Samuel Mok, Ivan Sterligov, OpenAlex Community
Makes sense.

My thought was to create my own embeddings from one (or multiple) key papers, combine them into a single reference vector, and then search for this own created reference vector. Also, It would be great if one could download the embeddings via API call.

Thanks

Rainer


Bianca Kramer

unread,
Mar 13, 2026, 11:42:15 AM (11 days ago) Mar 13
to Samuel Mok, Ivan Sterligov, Rainer M Krug, OpenAlex Community
Hi all, 

Thanks Rainer for starting this discussion and Ivan and Samuel for surfacing the information. 

I'm branching this thread to raise a related point: now that search functionalities are expanded to better (though still not fully) meet the needs for systematic reviews, and with semantic search based on titles and abstracts, the implications of excluded abstracts for search and retrieval could be said to be more relevant than ever. 

I wonder whether this user community feels that an argument could effectively be made to publishers that by not having 'their' abstracts included in OpenAlex, this hurts the discoverability of 'their' content? 

Happy to hear viewpoints on this! 

kind regards, 
Bianca Kramer 
(Sesame Open Science and Barcelona Declaration on Open Research Information)



Op vr 13 mrt 2026 om 13:31 schreef Samuel Mok <sam...@gmail.com>:

Rainer M Krug

unread,
Mar 13, 2026, 12:01:35 PM (11 days ago) Mar 13
to OpenAlex Community, Samuel Mok, Ivan Sterligov, Bianca Kramer
Hi Bianca

This is an important point you make and I thought about that already. 

This point could be definitely be made - and probably more prominent even when talking about the standard title and abstract search. With the semantic search and the vectorisation extremely likely. But I think this would need further investigation - i.e. comparison of the embeddings Title only versus Title and Abstract. If the title is good, it should already outline the topic of the paper, but I guess the fill scope is only available via the abstract. 

So yes - I think that point can be made. But there is another question: financial incentives and where do they make their money from - paper access or fees from OpenAI, WoS, etc what they are paying that they can use the abstracts.
Furthermore, the availability through something like OpenAlex and data snapshots makes it extremely easy to use it to train AIs - and I guess m,any publishers do have or plan their own AI Chatbots behind a paywall.

So yes - the fundability will be lower, but do they actually care?

Cheers,

Rainer
Reply all
Reply to author
Forward
0 new messages