Evaluating ebm backend / developer branch

101 views
Skip to first unread message

Sven Sass

unread,
Feb 2, 2026, 5:17:20 AMFeb 2
to Annif Users
Hello all,

I'm trying to evaluate the ebm backend, but I wanted to check beforehand:

1.) Is it a bad idea for an non (Annif-)developer trying to evalulate that backend as it not stable for now?
2.) If it is not a too bad idea: what would the correct approach be 
  a.) checkout branch "deutsche-nationalbibliothek-issue855-add-ebm-backend-gh-hosted-large-runner" 
  b.) checkout https://github.com/deutsche-nationalbibliothek/ebm4subjects (and use it for current main?)
  c.) something else?

Any information appreciated.

Kind regards
Sven

Maximilian Kähler

unread,
Feb 3, 2026, 6:29:56 AMFeb 3
to Annif Users
Hi Sven,

as one of the co-developers here, my answer would be the following:

Right now there is an error with the logger that needs fixing, that you should probably wait for (give us a  week  or two). 
Generally, we are happy about any feedback from you as a beta-tester. So the answer is a cautious "yes, go ahead but mind the gap..." 

How to proceed:

  * the correct Annif-branch to work with, is in on our DNB fork: deutsche-nationalbibliothek:issue855-add-ebm-backend
  * the ebm4subjects-package could be deployed from pypi, unless you want to work with it's source code. In this case, take the main-branch in https://github.com/deutsche-nationalbibliothek/ebm4subjects
  * in our latest version of ebm4subjects, support for sentenceTransformer is an optional dependency, that you would install when installing annif with the backend "ebm-in-process" (see pyproject.toml)
  *  To get startet: there is a draft for a wiki page on ebm: https://github.com/NatLibFi/Annif/wiki/DRAFT-%E2%80%90-Backend:-EBM  This contains all information how you can configure the backend. The actual  embedding model from huggingface is probably the most important parameter.  
  * To manage expectations: ebm is a method developed to improve performance in the long tail of large vocabularies. On it's own you can expect results that are in about the same metric-values as MLLM, but the actual matches should be significantly distinct from MLLM suggestions (as similarities are based on embeddings and not string representations). You should use ebm along with e.g. omikuji or another statistical approach for best results.   

Please feel free to send us feedback via github. Especially, if you run into errors. 

Best,
Maximilian

Sven Sass

unread,
Feb 4, 2026, 12:51:35 AMFeb 4
to Annif Users
Hi Maximilian,

thank you for your prompt answer and the detailed information on how to process.

I'm happy to hear that it is worth a go and will surely provide feedback.

Kind regards,
Sven

Maximilian Kähler

unread,
Feb 5, 2026, 5:15:02 AMFeb 5
to Annif Users
The error with the logger has been fixed. So you can give it a try, now. 
Best,
Maximilian

Sven Sass

unread,
Feb 6, 2026, 12:45:44 AMFeb 6
to Annif Users
Hello Maximilian,

thanks for the information. Evaluation will probably start next week. Thanks so much for the support!

Best regards,
Sven

Sven Sass

unread,
Feb 10, 2026, 4:58:08 AMFeb 10
to Annif Users
Hello all,

if someone else is thinking about evaluating the ebm backend: following Maximilians instructions it is quite easy to setup the project.

Best regards,
Sven

Sven Sass

unread,
Feb 19, 2026, 3:24:16 AMFeb 19
to Annif Users
Hello Maximilian,

I tried to send you a personal message, but I'm not sure if it reached you, so just for safety I post it here again.

Currently I'm stuck with my evaluation, because I run into an basically with any configuration (mainly: embedding). I also used your example configuration but to no avail

Traceback (most recent call last):
  File "/opt/annif/dev3/Annif/venv/bin/annif", line 6, in <module>
    sys.exit(cli())
             ~~~^^
  File "/opt/annif/dev3/Annif/venv/lib/python3.13/site-packages/click/core.py", line 1442, in __call__
    return self.main(*args, **kwargs)
           ~~~~~~~~~^^^^^^^^^^^^^^^^^
  File "/opt/annif/dev3/Annif/venv/lib/python3.13/site-packages/click/core.py", line 1363, in main
    rv = self.invoke(ctx)
  File "/opt/annif/dev3/Annif/venv/lib/python3.13/site-packages/click/core.py", line 1830, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^
  File "/opt/annif/dev3/Annif/venv/lib/python3.13/site-packages/click/core.py", line 1226, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/annif/dev3/Annif/venv/lib/python3.13/site-packages/click/core.py", line 794, in invoke
    return callback(*args, **kwargs)
  File "/opt/annif/dev3/Annif/venv/lib/python3.13/site-packages/click/decorators.py", line 34, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/opt/annif/dev3/Annif/venv/lib/python3.13/site-packages/flask/cli.py", line 400, in decorator
    return ctx.invoke(f, *args, **kwargs)
           ~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^
  File "/opt/annif/dev3/Annif/venv/lib/python3.13/site-packages/click/core.py", line 794, in invoke
    return callback(*args, **kwargs)
  File "/opt/annif/dev3/Annif/annif/cli.py", line 504, in run_eval
    for hit_sets, subject_sets in pool.imap_unordered(
                                  ~~~~~~~~~~~~~~~~~~~^
        psmap.suggest_batch, corpus.doc_batches
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    ):
    ^
  File "/home/dev/.local/share/uv/python/cpython-3.13.11-linux-x86_64-gnu/lib/python3.13/multiprocessing/pool.py", line 873, in next
    raise value
  File "/home/dev/.local/share/uv/python/cpython-3.13.11-linux-x86_64-gnu/lib/python3.13/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
                    ~~~~^^^^^^^^^^^^^^^
  File "/opt/annif/dev3/Annif/annif/parallel.py", line 76, in suggest_batch
    suggestion_batch = project.suggest(batch, self.backend_params)
  File "/opt/annif/dev3/Annif/annif/project.py", line 272, in suggest
    return self._suggest_with_backend(transformed_docs, backend_params)
           ~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/annif/dev3/Annif/annif/project.py", line 151, in _suggest_with_backend
    return self.backend.suggest(docs, beparams)
           ~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
  File "/opt/annif/dev3/Annif/annif/backend/backend.py", line 143, in suggest
    return self._suggest_batch(documents, params=beparams)
           ~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/annif/dev3/Annif/annif/backend/ebm.py", line 188, in _suggest_batch
    candidates = self._model.generate_candidates_batch(
        texts=[doc.text for doc in documents],
        doc_ids=[i for i in range(len(documents))],
    )
  File "/opt/annif/dev3/Annif/venv/lib/python3.13/site-packages/ebm4subjects/ebm_model.py", line 567, in generate_candidates_batch
    chunk_index = pl.concat(chunk_index).with_row_index("query_id")
                  ~~~~~~~~~^^^^^^^^^^^^^
  File "/opt/annif/dev3/Annif/venv/lib/python3.13/site-packages/polars/functions/eager.py", line 234, in concat
    out = wrap_df(plr.concat_df(elems))
                  ~~~~~~~~~~~~~^^^^^^^
polars.exceptions.SchemaError: type Int64 is incompatible with expected type Null

Not sure, how I can fix this? 

Any insights appreciated.

Best regards,
Sven

Maximilian Kähler

unread,
Feb 19, 2026, 5:37:24 AMFeb 19
to Annif Users
Dear Sven,

thank you for reporting this. It is actually quite  difficult to figure out the root of this error remotely.
What we need is more information, ideally a minimal reproducible example, that allows us to recreate this error in our setting. 
Would you mind reporting this error in an issue here:

and I would ask you to add the following information:

* your projects.cfg
* some test data (including a test vocab) that produces this error
* the client call that you used ("annif train [your options]")
 * package versions in your python environment

I know, this is a lot of work. But it takes even more effort digging into this, without knowing the circumstances. 

Thank you!

Best,
Maximilian

Sven Sass

unread,
Feb 20, 2026, 1:54:18 AMFeb 20
to Annif Users
Hello Maximilan,

I posted an issue report here: https://github.com/NatLibFi/Annif/issues/936

Please let me know if I can be of any help while investigating this issue. I'm really happy to help.

Best regards,
Sven

Sven Sass

unread,
Feb 24, 2026, 2:21:17 AMFeb 24
to Annif Users
Hello Maximilan,

I'm still on the process of evaluating the EBM with different embeddings/ensembles etc. Once I'm finished I'll post my observations here, in case it might help someone else.

I did notice that if I train a project with a given setting for device, duckdb_threads and want to change it while evaluating the project it will still use the configuration with which it was trained with.

Eg: I do train on "cuda:0" and then change the project configuration to "cuda:1" it will evaluate on "cuda:0".

I have not double checked with other backends if this is the intended behavior - I think it would be nice to switch the gpu or use more/less when required. For now I was training on one GPU, but I could imagine a case training on mutiple GPUs while evaluating only on one.

Similar to this: if I copy the projects folder (projects/[project_name]") to another folder and configure a project for this folder it throws an error:
"Error: Cannot open file "<..>/data/projects/ebm-jina/ebm-duck.db": No such file or directory"
where "ebm-jina" is the original projects name not the current projects name ("ebm-jina-50000")

And of course: I don’t mean to nitpick — I just want to help.

And one more question: the "jinaai/jina-embeddings-v5-text-small" embedding expectes the parameter "task" to be set. This should be one of: retrieval, text-matching, clustering, classification. Is "classification" the right choice?
encode_args_documents={"device": "cuda:0", "batch_size": 300, "show_progress_bar": True, "task": "classification"}

Thank you so much and

best regards,
Sven


ps.: while evaluating I do see a message like this:
"configuration generated by an older version of XGBoost, please export the model by calling
`Booster.save_model` from that version first, then load it back in current version. See:

    https://xgboost.readthedocs.io/en/stable/tutorials/saving_model.html

for more details about differences between saving model and serializing."

As to my understanding of the linked page (and chat-gpts) a trained model does not store gpu information and it should be possible to run it on another gpu.

Sven Sass

unread,
Mar 10, 2026, 5:29:38 AMMar 10
to Annif Users
Hello Maximilian/all,

my evaluation is finished and here are my findings.


Prerequisites:
My dataset is quite large: I do have around 700.000 short text documents of which 600.000 are used for training and 100.000 for evaluation with around 7.000 labels. I'm using 256GB RAM (+256 swap) and have two ADA 6000 (48GB VRAM each).

I did evaluate basicly all available backends before to find out what is the best combination of backends in my case. Best is here defined as best "F1 score (doc avg)". The champion is this this ensemble (basic ensemble not nn)
- Omikuji-Attention*0.4624,
- Xtransformer*0.4206,
- MLLM*0,1170
with a limit of 15 and a threshold of 0.15 the F1 value is at 69,44%.

The task was to evaluate if I can add the EBM-backend to improve this result.


Installation
As this backend is not yet in the main version of Annif the installation steps can be found here: https://github.com/NatLibFi/Annif/issues/936.

Please note: Maximilian pointed out that using uv is easier:
uv sync --extra ebm-in-process # or --extra ebm-api


Training
The main parameter of the EBM-backend is the sentence-transformer. I did evaluate these:
- BAAI/bge-m3 (default)
- jinaai/jina-embeddings-v3
- google/embeddinggemma-300m
- intfloat/multilingual-e5-small
- nvidia/llama-embed-nemotron-8b
- jinaai/jina-embeddings-v5-text-small
- Qwen/Qwen3-Embedding-8B

With my hardware and the given training set I was not able to train with any of these embeddings. I reduced the training size to 50.00 documents and was able to train
- BAAI/bge-m3 (default)
- jinaai/jina-embeddings-v3
- intfloat/multilingual-e5-small

For all other backends except "jinaai/jina-embeddings-v5-text-small" the training failed. For "jinaai/jina-embeddings-v5-text-small" the training did pass, but I was not able to evaluate 50.000 documents.

So my first observation is, that the memory consumption is higher than those of the other tested backends. The training time was roughly about the same as the x-transformer backend - but with only 1/12 of the training material.
The backend supports OpenAI and HuggingFace, so if you can send your data to the cloud, the hardware requirements may not be an issue to you.


Evaluation
I achieved these F1-scores
- 18.00% bge-m3
- 26.16% jina-v3
- 17.58% e5-small
- 39.82% mllm (this is even higher than training with 600.000 documents)
- 56.52% omikuji (this is ~10% worse than training with 600.000 documents)

As the results are not directly comparable to the current champion (because of the reduced training set) I evaluated if I can improve an ensemble consisting of mllm, omikuji with adding the ebm-backend to it.
For the champion I observed an improvement of roughly 2,5% when adding x-transformer.

The base-line with omikuji:0.6583,mllm:0.3417 was a F1 score of 57,93%, when adding ebm with jinva-v3 embedding: omikuji:0.3491,mllm:0.3636,ebm-jinav3:0.2872 the F1 score reduced to 56,73%.

So for my use case I was unable to achieve an improvement.

The goal of this backend (to my understanding) is to give good results without the need of many examples (or any) for any given label. In my case I do have rather many examples and this is the reason the omikuji backend is performing so well.

The jina-embeddings-v5-text-small came out during my evaluation. It ranks quite high in the hugging-face MTEB leaderboard and my be worth a look, if you are trying to evaluate the backend for yourself.

I hope this might help someone - feel free to ask any questions.


Again: so many thanks for the support during the evaluation and the quick fix of my issue!

best regards,
Sven

Maximilian Kähler

unread,
Mar 11, 2026, 10:39:45 AMMar 11
to Annif Users
Dear Sevn,

thank you for that detailed report. And also for your previous message:

Parameter-Change between training and evaluation
Your request to switch some configuration parameters (like cuda vs cpu) between training and evaluation is very reasonable. We have already implemented this and released the ebm4subjetcs package in a new version. An update to the annif backend will follow shortly. Thank you for putting that forward. You will then be able to overwrite most parameters with the annif client arguments. You will also be able to switch the deployment options (from in-process for training to API for production). 

 Correct usage of jinaai/jina-embeddings-v5-text-small
I haven't worked with the newest jinai-ai model. An earlier version supported assymmetric embeddings for retrieval, e.g. task = "retrieval.passage" (for documents) and "retrieval.query" (for vocab). This is best fitted to EBM. I think this is now handled with the argument "prompt name" and task = "retrieval". I think setting it to task "classification" would not be ideal. 
 
Saving ressources with EBM:
Indeed, processing time for EBM is quite slow. The bottle-neck is the embedding generation, primarily. EBM is not a typical supervised learning backend in the sense, that the quality scales with the amount of training data. What is trained is only the ranking model, which may be saturated by 1.000 documents or even less. So fueling it with 600k documents for training is way overshot. Processing time scales linear with documents for EBM.  
To cut cost, consider:
  * reducing the number of training docs
  * restrciting the number of chunks per document  `max_chunk_count` (especially for longer ducments this can be expensive). 
  * allow for larger chunks `max_chunk_length` is quite 50 characters by default, which means that any chunk larger then that will be split after the next sentence. So usually one sentence is one chunk. If you choose a higher number, chunking will be coarser, also resulting in fewer chunks in total 
  * some models (like jinai-AI) also support "matryoshka embeddings", which allows you to choose a smaller embedding dimension. I haven't tested it myself, but this might also help to speed things up.  

I was surprised that the training crashed with so many of the embedding models. I would expect 48GB of VRAM to be well enough for most of these models, as long as you don't set the batch_size to unreasonable high values (start with 32, see if it works, then abort and double up. Repeat until you reach your VRAM-limit). What was the cause for "not being able to train" with the other models? Something like CUDA-out-of-memory?

Inlcuding EBM into an Annif ensemble:
It is unfortunate, that you could not improve your ensemble by adding EBM. From what I can tell in your setup, putting the weights between omikuji,mllm and ebm to values so close to equal puts to much emphasize on the weaker components (EBM and MLLM). Did you determine these weights manually or with annif optimise? If I had to guess parameters I'd say omikuji:0.66 and splitting up the rest bewteen EBM and MLLM. 
We are not at this point with our own ensembles at DNB. So we still need to find the best way to integrate EBM into en ensemble. Maybe you can also achieve better performance with nn-ensemble. But maybe you should wait with this until the (re-)developments with the nn-ensemble are finished. 

Thank you again for your feedback. This is very valuable. Especially in the current phase before the first release, it is very helpful to have early testers! Please, don't hesitate to report any other issues, that you might have. 

Best,
Maximilian

Sven Sass

unread,
Mar 16, 2026, 2:26:01 AM (10 days ago) Mar 16
to Annif Users
Hello Maximilian,

great, that you implemented parameter modification between training and evaluation. I think this is a very useful feature for your backend.

I did test jina-v5 with "retrival", so I guess I was on the right track. Thank you for clarification.

Training-set
As for the amount of training documents - I had the same question for Osma and he also told me that 600k is far too much. As for omikuji my measurement is: more is better. As mentioned in my previous post the mllm improved with fewer training documents. I'm still struggling to understand how 1.000 documents can be sufficient if I have around 7.000 labels? This would mean I would not have an example for every label (best case: only one example for 1.000 labels).

If I do find the time I'll look into the max_chunk_count and max_chunk_length. My gut feeling tells me, this won't help with 600k documents anyway (disregarding if this is a reasonable approach in the first place). Batch size was at 32 for most of the tests. I did increase it to 300 in some cases - but all of those worked in the end. I did monitor that GPU-load is not at 100% - it is more at 50%-70%. However it was at constant 100% with the x-transformer backend if I remember correctly and it surely was the case when I was generating embeddings with llamaindex. Maybe there is some room for performance optimization which I did not find in my evaluations. But I surely like how seemless it uses both GPUs at the same time, this was not the case with the x-transformer.

Ensembles
I did use hyperopt to calculate weights: "annif hyperopt -m "F1 score (doc avg)" ensemble-ebm-50000"
Our expectations of the weights match though. I expected other weights, but had no indication that hyperopt did not produce valid results. I'll give it a shot with "manual weighting" to see if it improves the result.

As my previous tests did not show improvements for nn-ensemble I did not test it. I agree, revisiting this after the new nn-ensemble is finished does make sense.


Here is some more detailed information about my issues during training an devaluation.

google/embeddinggemma-300m
While training 600.000 articles it was obvious after 2% that it was using up the GPU memory fast and at a steady pace. Assuming the ratio remains the same it would have taken about 1.400 GB for 100% so. In a later retest with 50.000 documents it stop with "cuda out of memory".


jinaai/jina-embeddings-v3
"trust_remote_code" is required. It basically takes 5GB and stays that way. Jina seems to be quite a bit faster aswell (3-4 times?).

For configuration only "cuda" is required to use GPU. It took me some time to install the required components (flash-attn/cuda-toolkit) but I have to admit I'm not a pro at Pyhthon (and I think that is out of the scope of this backend or even Annif).


When training ~700k documents the phase "Backend ebm: creating embeddings for text chunks and query dataframe" took roughly 18 hours. After that it crashed, because it needed >512GB CPU-memory which is too much for my system.

First column is CPU-memory staying low for quite a while and then spiking within only a few minutes. My assumption is, that it is in the phase after the mentioned message. Actually I thought at this point the "hard part" was finished.

22:14:53,137848,29318,2769
22:15:03,138237,29318,2769
22:15:14,140502,29318,2769
<...>
22:16:44,248053,29318,2769
22:16:54,253141,29318,2769
22:17:04,253960,29318,2769
22:17:14,253857,29318,2769
<...>
22:25:37,255168,29318,2769
22:25:48,255194,29318,2769
-> Crash (syslog out of memory)


intfloat/multilingual-e5-small
No issues with 50,000, it took 18h for training

Started training process with 600k documents to see what happens. I'm not 100% sure if I tested this, at least it is not documented so I'm training it again.


BAAI/bge-m3
No issues with 50,000, it took 21h for training

Will start training process with 600k documents to see what happens. Again: I do think I've tested this, but have not documented it.

nvidia/llama-embed-nemotron-8b
"Out of cuda memory" more or less at startup

Qwen/Qwen3-Embedding-8B
"Out of cuda memory" more or less at startup

jinaai/jina-embeddings-v5-text-small
Training with 50.000k works finde, but evaluation 100k documents crashes after ~18h with 190GB CPU-memory. Last message was
"Backend ebm: running vector search and creating candidates with query_jobs: 16"
The was barely any CPU/GPU load.


My assumption is that any model failing with "Out of cuda memory" right at the beginning, will not work with any amount of documents on my GPU. I did implement a RAG-system in another context and did succesfully use Qwen/Qwen3-Embedding-8B embedding for generating the embeddings - so this embedding does fit in my GPU (generally speaking). 

Side note: I'm really curios how the new V5 embedding competes vs other local embeddings and OpenAIs "text-embedding-3-large" (in the RAG-context I'm allowed to use cloud services)


I'll send an update when the mentioned training tests are done - feel free to ask for any information!

Best regards,
Sven

Sven Sass

unread,
Mar 18, 2026, 8:25:52 AM (8 days ago) Mar 18
to Annif Users
Hello Maximilian,

tl;dr:
a.) I wanted to double check that training with 600k document fails and I'm afraid this is indeed the case for both e5 and m3 embeddings.
b.) I was able to improve F1-score with manual weights on ebm-ensemble by 0.77%

More details:

intfloat/multilingual-e5-small
The initial GPU memory usage was ~5.7GB and it was rising during chunking to ~8.1GB. GPU load was at about 50%-70% on both GPUs. CPU memory was steadly rising - my monitoring only watches physical memory, so it stopped at 256GB - apparently there was enough swap space to finish the process. Batches took 3:20h on both GPUs.

Columns: time, CPU Memory, GPU Memory 1, GPU Memory 2
09:53:37,252908,18,1071
...
09:53:47,123941,18,1071
So it uses quite a lot of memory, but this phase is succesful.

The phase following

"Backend ebm: running vector search and creating candidates with query_jobs: 16"
uses only one CPU in the beginning (is there a way to use all?). It starts with around 124GB of CPU ram and finishes about 184GB, then using all processors around 55% total usage (maybe the 16 threads are in effect and takes around 11 hours.
11:49:01,127710,18,1071
11:49:11,128709,18,1071
...
22:31:28,254251,18,1071
22:31:39,127916,18,1071

It stops with this message:
/home/dev/.local/share/uv/python/cpython-3.13.11-linux-x86_64-gnu/lib/python3.13/multiprocessing/resource_tracker.py:400: UserWarning: resource_tracker: There appear to be 4 leaked semaphore objects to clean up at shutdown: {'/loky-1699436-04szyovz', '/loky-1698691-ojkorpx5', '/loky-1698769-4jytqyaa', '/loky-1699501-a_51pr46'}
  warnings.warn(
 

BAAI/bge-m3
The initial GPU memory usage was ~17GB on both GPUs. Initial CPU memory usage was at 64GB, both GPUs are more or less at 100%.
At 20% / 3,5h it used 156GB CPU ram and GPU-ram is at about 34GB RAM.

17:34 RAM bei 248 GB und GPUs bei 34 und 46
  0   N/A  N/A         2054456      C   ...nnif/dev3/venv/bin/python3.13      34534MiB |
  1   N/A  N/A         2054541      C   ...nnif/dev3/venv/bin/python3.13      43636MiB |

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 10.14 GiB. GPU 1 has a total capacity of 47.37 GiB of which 8.06 GiB is free. Process 2053589 has 2.55 GiB memory in use. Including non-PyTorch memory, this process has 36.58 GiB memory in use. Of the allocated memory 26.03 GiB is allocated by PyTorch, and 10.05 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

So this behaves similar to gemma embedding: it consumes more and more memory over time. While with gemma it seems it was rising at a steady rate it "jumps" with this embedding (might be normal, but wanted to share as much information as possible).

Memory on GPU 1 jumps about 7 GB:
11:56:33,136722,27975,33761
11:56:43,136849,27975,33761
11:56:53,136833,27975,33761
11:57:03,136958,34553,33761
11:57:13,136939,34553,33761
11:57:23,136989,34553,33761

Memory on GPU 2 jumps about 12 GB:
14:40:51,196196,34557,33761
14:41:01,196283,34557,33761
14:41:11,196302,34557,33761
14:41:21,196350,34557,46437
14:41:31,196386,34557,46437
14:41:41,196451,34557,46437


Ensemble
I changed the hyperopted values to your suggestion: sources=omikuji-50000:0.66,mllm-50000:0.22,ebm-jina-50000:0.22 (well, yes, bare with me) and reevaluated it with the result of 58,70% for the F1 value. This is a .77% improvement compared to the ensemble without ebm-backend.

To me this raises the question what I did wrong when hyperoptimizing it (annif hyperopt -m "F1 score (doc avg)")?

Checking the logfile for hyperoptimization I did notice that the F1-score is at 12% and more or less does not change in iterations. 
Trial 0 finished with value: 0.12111927369952892 and parameters: {'omikuji-50000': 0.1702433045478593, 'mllm-50000': 0.5739916202122114, 'ebm-jina-50000': 0.87719
46834004369}. Best is trial 0 with value: 0.12111927369952892.
Trial 1 finished with value: 0.12140212308788588 and parameters: {'omikuji-50000': 0.08956166577644809, 'mllm-50000': 0.4316058957749992, 'ebm-jina-50000': 0.2172
7099007099815}. Best is trial 1 with value: 0.12140212308788588.
Trial 2 finished with value: 0.12162202446879701 and parameters: {'omikuji-50000': 0.5352265942786502, 'mllm-50000': 0.5575117250833155, 'ebm-jina-50000': 0.44036
639869444383}. Best is trial 2 with value: 0.12162202446879701.
..
Trial 9 finished with value: 0.12142314843806357 and parameters: {'omikuji-50000': 0.888701481126246, 'mllm-50000': 0.04349120357934089, 'ebm-jina-50000': 0.74068
62202497109}. Best is trial 2 with value: 0.12162202446879701.
Got best F1 score (doc avg) score 0.1216 with:
---
sources=omikuji-50000:0.3491,mllm-50000:0.3636,ebm-jina-50000:0.2872
---


It would be nice to have some kind of overall progress information for training and evaluation. I guess this is currently not supported by Annif.

Best regards
Sven

Osma Suominen

unread,
Mar 20, 2026, 8:42:22 AM (6 days ago) Mar 20
to annif...@googlegroups.com
Dear Sven,

thank you for your thorough experimentation!

Regarding what went wrong with the annif hyperopt, when it seemed to
arrive at a suboptimal solution.

You said you used the command:

annif hyperopt -m "F1 score (doc avg)"

There are two issues with this:

1. You tried to optimize for F1 score, which is not a very stable
metric. I recommend that you leave the metric at the default, which is NDCG.

2. The default number of hyperopt trials is only 10. With such a low
number of trials, you have to be lucky to find a good set of weights. I
suggest trying a much larger number of trials, at least 100 but 200-300
would be better (e.g. --trials 300). The initial 10 trials is just for
verifying that the process itself is working.

You can also set e.g. --jobs 8 to spread the load into multiple CPU
cores, which should speed up the process.

Best,
Osma


'Sven Sass' via Annif Users kirjoitti 18.3.2026 klo 14.25:
> Hello Maximilian,
>
> tl;dr:
> a.) I wanted to double check that training with 600k document fails and
> I'm afraid this is indeed the case for both e5 and m3 embeddings.
> b.) I was able to improve F1-score with manual weights on ebm-ensemble
> by 0.77%
>
> More details:
>
> *_intfloat/multilingual-e5-small_*
> *_BAAI/bge-m3_*
> *_Ensemble_*
> *_Training-set_*
> As for the amount of training documents - I had the same question
> for Osma and he also told me that 600k is far too much. As for
> omikuji my measurement is: more is better. As mentioned in my
> previous post the mllm improved with fewer training documents. I'm
> still struggling to understand how 1.000 documents can be sufficient
> if I have around 7.000 labels? This would mean I would not have an
> example for every label (best case: only one example for 1.000 labels).
>
> If I do find the time I'll look into the max_chunk_count and
> max_chunk_length. My gut feeling tells me, this won't help with 600k
> documents anyway (disregarding if this is a reasonable approach in
> the first place). Batch size was at 32 for most of the tests. I did
> increase it to 300 in some cases - but all of those worked in the
> end. I did monitor that GPU-load is not at 100% - it is more at
> 50%-70%. However it was at constant 100% with the x-transformer
> backend if I remember correctly and it surely was the case when I
> was generating embeddings with llamaindex. Maybe there is some room
> for performance optimization which I did not find in my evaluations.
> But I surely like how seemless it uses both GPUs at the same time,
> this was not the case with the x-transformer.
>
> *_Ensembles_*
> I did use hyperopt to calculate weights: "annif hyperopt -m "F1
> score (doc avg)" ensemble-ebm-50000"
> Our expectations of the weights match though. I expected other
> weights, but had no indication that hyperopt did not produce valid
> results. I'll give it a shot with "manual weighting" to see if it
> improves the result.
>
> As my previous tests did not show improvements for nn-ensemble I did
> not test it. I agree, revisiting this after the new nn-ensemble is
> finished does make sense.
>
>
> Here is some more detailed information about my issues during
> training an devaluation.
>
> *google/embeddinggemma-300m*
> While training 600.000 articles it was obvious after 2% that it was
> using up the GPU memory fast and at a steady pace. Assuming the
> ratio remains the same it would have taken about 1.400 GB for 100%
> so. In a later retest with 50.000 documents it stop with "cuda out
> of memory".
>
>
> *_jinaai/jina-embeddings-v3_*
> *_intfloat/multilingual-e5-small_*
> No issues with 50,000, it took 18h for training
>
> Started training process with 600k documents to see what happens.
> I'm not 100% sure if I tested this, at least it is not documented so
> I'm training it again.
>
>
> *_BAAI/bge-m3_*
> No issues with 50,000, it took 21h for training
>
> Will start training process with 600k documents to see what happens.
> Again: I do think I've tested this, but have not documented it.
>
> *_nvidia/llama-embed-nemotron-8b_*
> "Out of cuda memory" more or less at startup
>
> *_Qwen/Qwen3-Embedding-8B_*
> "Out of cuda memory" more or less at startup
>
> *_jinaai/jina-embeddings-v5-text-small_*
> Training with 50.000k works finde, but evaluation 100k documents
> crashes after ~18h with 190GB CPU-memory. Last message was
> "Backend ebm: running vector search and creating candidates with
> query_jobs: 16"
> The was barely any CPU/GPU load.
>
>
> My assumption is that any model failing with "Out of cuda memory"
> right at the beginning, will not work with any amount of documents
> on my GPU. I did implement a RAG-system in another context and did
> succesfully use Qwen/Qwen3-Embedding-8B embedding for generating the
> embeddings - so this embedding does fit in my GPU (generally speaking).
>
> Side note: I'm really curios how the new V5 embedding competes vs
> other local embeddings and OpenAIs "text-embedding-3-large" (in the
> RAG-context I'm allowed to use cloud services)
>
>
> I'll send an update when the mentioned training tests are done -
> feel free to ask for any information!
>
> Best regards,
> Sven
>
>
> mfaka...@gmail.com schrieb am Mittwoch, 11. März 2026 um 15:39:45 UTC+1:
>
> Dear Sevn,
>
> thank you for that detailed report. And also for your previous
> message:
>
> *Parameter-Change between training and evaluation*
> Your request to switch some configuration parameters (like cuda
> vs cpu) between training and evaluation is very reasonable. We
> have already implemented this and released the ebm4subjetcs
> package in a new version. An update to the annif backend will
> follow shortly. Thank you for putting that forward. You will
> then be able to overwrite most parameters with the annif client
> arguments. You will also be able to switch the deployment
> options (from in-process for training to API for production).
>
> *Correct usage of jinaai/jina-embeddings-v5-text-small*
> I haven't worked with the newest jinai-ai model. An earlier
> version supported assymmetric embeddings for retrieval, e.g.
> task = "retrieval.passage" (for documents) and
> "retrieval.query" (for vocab). This is best fitted to EBM. I
> think this is now handled with the argument "prompt name" and
> task = "retrieval". I think setting it to task "classification"
> would not be ideal.
> *Saving ressources with EBM:*
> Indeed, processing time for EBM is quite slow. The bottle-neck
> is the embedding generation, primarily. EBM is not a typical
> supervised learning backend in the sense, that the quality
> scales with the amount of training data. What is trained is only
> the ranking model, which may be saturated by 1.000 documents or
> even less. So fueling it with 600k documents for training is way
> overshot. Processing time scales linear with documents for EBM.
> To *cut cost*, consider:
>   * reducing the number of training docs
>   * restrciting the number of chunks per document
> `max_chunk_count` (especially for longer ducments this can be
> expensive).
>   * allow for larger chunks `max_chunk_length` is quite 50
> characters by default, which means that any chunk larger then
> that will be split after the next sentence. So usually one
> sentence is one chunk. If you choose a higher number, chunking
> will be coarser, also resulting in fewer chunks in total
>   * some models (like jinai-AI) also support "matryoshka
> embeddings", which allows you to choose a smaller embedding
> dimension. I haven't tested it myself, but this might also help
> to speed things up.
>
> I was surprised that the training crashed with so many of the
> embedding models. I would expect 48GB of VRAM to be well enough
> for most of these models, as long as you don't set the
> batch_size to unreasonable high values (start with 32, see if it
> works, then abort and double up. Repeat until you reach your
> VRAM-limit). What was the cause for "not being able to train"
> with the other models? Something like CUDA-out-of-memory?
>
> *Inlcuding EBM into an Annif ensemble:*
> It is unfortunate, that you could not improve your ensemble by
> adding EBM. From what I can tell in your setup, putting the
> weights between omikuji,mllm and ebm to values so close to equal
> puts to much emphasize on the weaker components (EBM and MLLM).
> Did you determine these weights manually or with annif optimise?
> If I had to guess parameters I'd say omikuji:0.66 and splitting
> up the rest bewteen EBM and MLLM.
> We are not at this point with our own ensembles at DNB. So we
> still need to find the best way to integrate EBM into en
> ensemble. Maybe you can also achieve better performance with nn-
> ensemble. But maybe you should wait with this until the
> (re-)developments with the nn-ensemble are finished.
>
> Thank you again for your feedback. This is very valuable.
> Especially in the current phase before the first release, it is
> very helpful to have early testers! Please, don't hesitate to
> report any other issues, that you might have.
>
> Best,
> Maximilian
>
> j3s...@googlemail.com schrieb am Dienstag, 10. März 2026 um
> 10:29:38 UTC+1:
>
> Hello Maximilian/all,
>
> my evaluation is finished and here are my findings.
>
>
> *_Prerequisites_*:
> My dataset is quite large: I do have around 700.000 short
> text documents of which 600.000 are used for training and
> 100.000 for evaluation with around 7.000 labels. I'm using
> 256GB RAM (+256 swap) and have two ADA 6000 (48GB VRAM each).
>
> I did evaluate basicly all available backends before to find
> out what is the best combination of backends in my case.
> Best is here defined as best "F1 score (doc avg)". The
> champion is this this ensemble (basic ensemble not nn)
> - Omikuji-Attention*0.4624,
> - Xtransformer*0.4206,
> - MLLM*0,1170
> with a limit of 15 and a threshold of 0.15 the F1 value is
> at 69,44%.
>
> The task was to evaluate if I can add the EBM-backend to
> improve this result.
>
>
> *_Installation_*
> As this backend is not yet in the main version of Annif the
> installation steps can be found here: https://github.com/
> NatLibFi/Annif/issues/936 <https://github.com/NatLibFi/
> Annif/issues/936>.
>
> Please note: Maximilian pointed out that using uv is easier:
> uv sync --extra ebm-in-process # or --extra ebm-api
>
>
> *_Training_*
> *_Evaluation_*
> saving_model.html <https://xgboost.readthedocs.io/en/
> stable/tutorials/saving_model.html>
>
> for more details about differences between saving model
> and serializing."
>
> As to my understanding of the linked page (and chat-
> gpts) a trained model does not store gpu information and
> it should be possible to run it on another gpu.
>
>
> Sven Sass schrieb am Freitag, 20. Februar 2026 um
> 07:54:18 UTC+1:
>
> Hello Maximilan,
>
> I posted an issue report here: https://github.com/
> NatLibFi/Annif/issues/936 <https://github.com/
> NatLibFi/Annif/issues/936>
>   File "/opt/annif/dev3/Annif/annif/
> ebm4subjects <https://
> github.com/deutsche-
> nationalbibliothek/
> ebm4subjects>
>   * in our latest
> version of ebm4subjects,
> support for
> sentenceTransformer is
> an optional dependency,
> that you would install
> when installing annif
> with the backend "ebm-
> in-process" (see
> pyproject.toml)
>   *  To get startet:
> there is a draft for a
> wiki page on ebm:
> https://github.com/
> NatLibFi/Annif/wiki/
> DRAFT-%E2%80%90-
> Backend:-EBM <https://
> github.com/NatLibFi/
> Annif/wiki/DRAFT-
> <https://github.com/
> deutsche-
> nationalbibliothek/
> ebm4subjects> (and
> use it for current
> main?)
>   c.) something else?
>
> Any information
> appreciated.
>
> Kind regards
> Sven
>
> --
> You received this message because you are subscribed to the Google
> Groups "Annif Users" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to annif-users...@googlegroups.com <mailto:annif-
> users+un...@googlegroups.com>.
> To view this discussion visit https://groups.google.com/d/msgid/annif-
> users/e1a3ac4f-ecea-4b70-bc2f-79721ed7c223n%40googlegroups.com <https://
> groups.google.com/d/msgid/annif-users/e1a3ac4f-ecea-4b70-
> bc2f-79721ed7c223n%40googlegroups.com?utm_medium=email&utm_source=footer>.


--
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 15 (Unioninkatu 36)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
osma.s...@helsinki.fi
http://www.nationallibrary.fi

Maximilian Kähler

unread,
Mar 23, 2026, 4:23:06 AM (3 days ago) Mar 23
to Annif Users
Dear Sven,

thank you for your reports. First, let me try to explain a little bit how EBM is supposed to work:

Phase A: Chunking: A document is chunked into pieces. 
Phase B: Embedding: For each piece embeddings are computed (which is the long GPU process happening). 
Phase C: Vector Search: The embeddings are compared, chunk-wise, to embeddings of the vocab (this is the vector search phase that runs on CPU). For every chunk this will generate a list of candidate labels along with some statistics: e.g. cosine-similarity of matched labels and position of the chunk in the text. 
Phase D: Aggregation: Chunk-wise statistics are aggregated by label and document, producing statistcs like frequency of occurence, sum of cosine similarities, spread over document, etc. 
Phase E: Training of ranker model: A small ML-model is trained to re-rank suggestions based on the aggregate statistics of Phase D. This is a model that deals only with the very small set of features extracted from Phase D. There is no Label-Specific training. Phse E is the only "learning" step and should be satured by ~1000-10000 documents. 

Having said that, I can only repeat that feeding EBM with 600k documents is far too much and there is no benefit in training the ranker on so many documents. You may wonder how the backend can handle 7000 labels: generation of label candidates happens in Phase C and is mainly based on cosine similarity of your pre-trained embeddings. So the "magical" power to recognise a candidate label comes with the embedding similarity. The ranker model only helps in promoting matches that occur more frequently. 

I suspect that the high RAM usage in the first of your experiments (intfloat/multilingual-e5-small) is indeed caused by the large number of documents in your training batch making Phase C and D very time and ressource consuming. Eventually the process may have run out of memory and crashed, leaving some of the parallel processes unattended. The solution is really not to train on such a large training corpus!

The second experiment with the BAAI/bge-m3 crashes in Phase B with a Cuda-Out-of-Memory error. The process allocates GPU-memory for the model itself as well as for the input chunks to be processed. If the model fits your GPU initially, this means that during the document processing there was a batch of documents with longer chunks, causing more memory consumption and eventually overstretching the limits. So I suspect this is a failure caused by specific input documents. Juho posted an example of a mal-formatted document causing a similar issue here: https://github.com/NatLibFi/Annif/pull/914 (one of the latest posts). 
This can be remedied by a smaller batch size and/or shorter chunks. Maybe we can come up with something to pre-filter such malformed chunks. 
If you have the means to deploy your embedding models via some (self-hosted) API (e.g. Ollama, vllm, LLama.cpp) you can also pass the API endpoint to the EBM backend as explained in the Wiki-Draft. This would prohibit CUDA-out-of-memory issues during batch processing, because the inference engines like Ollama or vllm have a smarter handling of the workload then EBM does.   

The possibility to overwrite training parameters is now implemented in the branch, thanks to my colleague Clemens. So you can now switch deployment configuration between training and inference time. 

Thanks again for reporting your problems. Please, feel free to report anything else. 

Best,

Maximilian 

Sven Sass

unread,
Mar 24, 2026, 7:39:09 AM (yesterday) Mar 24
to Annif Users
Hello Maximilian and Osma,

thank your for your replies and support.

@Osma: I'm leaving out the -m parameter for future trials and will increase the number of trials. Thank you for the hint!

@Maximilian: Thank you for the insight on how your backend works. Is it correct that for for the backend to work best it is better to have an example for any label rather than have as many examples as possible. So that in Phase C there is at least one example for each label? For my training set I should reduce it even further to maybe 10.000 and have all my 7.000 labels be in there at least once, right? I assume this approach works best with long lables "Bundesministerium des Innern und für Heimat" but maybe not so well with "A 10" (depending on how the embedding was trained).

And understood: won't train that many documents anymore ;)

I did check the training data for file size. The largest document was around 120k which I honestly did not expect. Will have a closer look at this if I stumble over memory issues again - for now I did not try to debug/change log level to find out the source of the issue.

Just to sum it up for others how may be reading this
:
- I was able to improve an ensemble consisting of omikuji and mllm by adding ebm4subjects to it
- I did not properly hyperoptimize the ensenmble (use more trials), so the improvement could be even bigger than the one I measured (I assume I did have more trials when comparing the x-transformer backend
- Memory and time issues can be overcome by chosing the right training-set

Or in short: give it a shot with your data if you are trying to figure out the best ensemble for you

Unfortunately my project is finished for the moment and I do not have access to the training data any longer, but hopefully will have the ability to pick this up later and will re-measure the backend with improved training-set and properly optimized (nn-?)ensemble. I guess by that time both ebm4subjects and the the nn-ensmemble implementation will be in main.


A little off-topic: In the early stage of this project I did draft a rag-based labelling mechanism outside of Annif. In  short words: when suggesting labels for a document, the labels of the nearest documents in the vector db are chosen (some weighting/normlization is done, but really quite straight forward). I observed a F1-score of around 43,57% back then - but I do not know anymore with what training set. 
Disclaimer: I know this approach only has a chance if you do have a lot of training data.

As I wanted to test the Jina V5 embedding anyway I created a "prototype" Annif "rag-ebm" backend (read: quick'n'dirty hack). To get a jump start I took your ebm4subjects backend as base, so I did not have to bother with choosing a vector-db and other stuff you already added. I really do hope this is ok for you - I'm not planning to use this code - I just wanted to see how well this approach works and probably will delete it, as I did with my first approach.

To figure out which embedding works the best I did test with 50.000 documents (F1 scores doc average):
jina-embeddings-v5-text-small: 47.58%
multilingual-e5-large: 49.02%
Qwen3-Embedding-8B, llama-embed-nemotron-8b and embeddinggemma-300m all failed with cuda out of memory.
For the same amount of documents omikuji was at 56.52% and mllm at 39.82%.

With the full training-set the F1-scores improved for omikuji to 63.59% and rag-ebm (with e5-large) to 58.14%. Mllm stayed more or less the same (39.97%)

The ensemble of omikuji and mllm (omikuji:0.9002,mllm:0.0998) was at 63.84% - adding rag-ebm (omikuji:0.2352,mllm,rag:0.7154) with e5-large embedding improved it to 64.49% (NDCG = 85.11%). I also tried manual weights of .42,.42 for omikuji/rag and .16 for mllm (just to be sure, because .72 seems a little bit high), but the result is not better.

So overall I had a decent result, but was not able to improve my current champion (omikuji, mllm, x-transformer). 

I guess using the approach of Phase C in ebm4subjects could improve the results further: currently all labels are used for every chunk of the document. Please also note, that I did experience memory issues when writing the embeddings to the vector db. The possible cause of this seemed to be embeddings.tolist() call, which had to be refactored to a loop and then inserting in batches. For testing I switched to milvus vector-db which also lead to a decent speed up (gpu version)

Thank you all again for your support!

Best regards,
Sven
Reply all
Reply to author
Forward
0 new messages