X-Transformer backend in nn-ensemble

40 views
Skip to first unread message

Sven Sass

unread,
Jul 5, 2025, 4:15:19 AMJul 5
to Annif Users
Hello all,

I'm experimenting with the X-Transformer backend (https://github.com/NatLibFi/Annif/pull/798) and got it to work good results (even outperforming Omikuji in some metrics).

I was able to integrate this backend into an ensemble and I'm currently running a training on an nn-ensemble with Omkikuji and X-Transformer backend. The training has been running for over three days now, which is a little surprising to me. 

Altough I have a rather large training set (~600.000 documents) the training of the X-Transformer-Backend took 2 - 8 hours depending on the configured transformer. For the nn-ensemble I'm using the tranformer that took roughly 3 hours for training.

I just wanted to check if I'm missing something, so two questions:

a.) does it even make sense to use an nn-ensemble with Omikuji/NN-Ensemble? (planning to evaluate Omikuji/NN-Ensemble/MLLM aswell)

b.) For training I had to use --jobs 1, otherwise I get an "AssertionError: daemonic processes are not allowed to have children". As a result both CPU and GPU are more or less idle. I assume that is what is or am I doing something wrong here? 

Thank you so much and

regards,
Sven

Sven Sass

unread,
Jul 5, 2025, 12:05:52 PMJul 5
to Annif Users
Text correction, I meant "a.) ... nn-ensemble with Omikuji/*X-Transformer*" :s

juho.i...@helsinki.fi

unread,
Jul 7, 2025, 4:55:28 AMJul 7
to Annif Users
Hi Sven,

Good to hear about your success with the Xtransformer!

We usually train NN ensemble projects with smaller datasets, some thousands of documents at most. Then the training should take about an hour. 600k documents is too much for training NN ensemble.

The effectiveness of NN ensemble and a good training set size can depend on your specific use case. You could monitor how increasing the dataset size affects the evaluation metrics, and when you see that the metrics are not improving, you can stop adding more data. There is an example script for this learning curves analysis in the Annif tutorial.

Please also see this Annif wiki page for an overview of the backends and their properties (still partially WIP).

Regarding your questions:

a.) Again, it is best to experiment with this: I would start with a simple ensemble, run hyperopt to find the optimal weights for the base projects, and only then try NN ensemble with the same base projects (and weights). However (spoiler alert!), for our system in LLMs4Subjects competition, we found that the NN ensemble did not improve the results: a simple ensemble consisting of Omikuji Bonsai, MLLM, and Xtransformer projects was the best setup: https://arxiv.org/abs/2504.19675

b.) Yes, the --jobs 1 requirement with Xtransformer training is a known problem, but thanks for confirming it!

Best regards,
-Juho

Sven Sass

unread,
Jul 8, 2025, 1:43:39 PMJul 8
to Annif Users
Hi Juho,

thank you for your quick response!

As I went trough the tutorial, the computation times expected in the tutorial were far higher than those I observed. So I gave it a shot with 600.000 documents right from the get go and it work really fast for all backends in the current version of Annif. Interestingly, the better-performing backends (MLLM and Omikuji) also required less computational time. 

I'm still amazed of well Annif works. I love this project.

My current elvaluation step is to figure out the best performing backends/ensembles. So for now the ressources are irrelevant, but they will be considered when choosing the right combination for production use. I'm not sure if a 1% better metric is justify using backends with much higher hardware requirements/consumption.

The nn-ensemble with Omikuji (.39) and X-Transformer (.61) backends was build after hyperopt-imizing the basic ensemble and evaluating it. As the current "champion" (which is a nn-ensemble of Omikuji (.86) and MLLM (.14)), is outperforming the same (basic) ensemble around 3%-5% across the board, I ended up evaluating the nn-ensemble before adding MLLM as third backend. Today the learning of the nn-ensemble with Omikuji and the X-Transformer backend did finish. As you expected the overall metrics did not improve. Some were better, some worse.

A short follow up for clarification: do more documents hurt the results or does it just take longer to train?
In our data there are some subjects with only very few occurences. My thought process is: the more examples the better.

Bare with me here, I'm not a scientist and not familiar with the algorithms: does balancing the subjects even matter? Eg.: I got 1.000 documents and 950 of them do have a subject "X" and 10 of them have a subject "Y". Is it best to train with all 1.000 or do I pick 10 examples of "X" and 10 examples of "Y"? I'v read that Annif is used for 300.000 subjects - so, how do you train with only 10.000ish documents?
For my data: in the 600.000 documents there are some subjects with less than 

So Omikuji, MLLM and X-Transformer ensemble is next on the menu. Your spoiler sure sounds promising :)

Regards,
Sven

ps.:  Please note: a "document" should better be called "article" since it is quite short. I'm not talking about 500 page-books here, but more 1-2 pages.

juho.i...@helsinki.fi

unread,
Jul 9, 2025, 8:12:02 AMJul 9
to Annif Users
Hi Sven,
Please see inline/below.
-Juho


On Tuesday, 8 July 2025 at 20:43:39 UTC+3 j3s...@googlemail.com wrote:
Hi Juho,

thank you for your quick response!

As I went trough the tutorial, the computation times expected in the tutorial were far higher than those I observed. So I gave it a shot with 600.000 documents right from the get go and it work really fast for all backends in the current version of Annif. Interestingly, the better-performing backends (MLLM and Omikuji) also required less computational time. 

I'm still amazed of well Annif works. I love this project.

My current elvaluation step is to figure out the best performing backends/ensembles. So for now the ressources are irrelevant, but they will be considered when choosing the right combination for production use. I'm not sure if a 1% better metric is justify using backends with much higher hardware requirements/consumption.

The nn-ensemble with Omikuji (.39) and X-Transformer (.61) backends was build after hyperopt-imizing the basic ensemble and evaluating it. As the current "champion" (which is a nn-ensemble of Omikuji (.86) and MLLM (.14)), is outperforming the same (basic) ensemble around 3%-5% across the board, I ended up evaluating the nn-ensemble before adding MLLM as third backend. Today the learning of the nn-ensemble with Omikuji and the X-Transformer backend did finish. As you expected the overall metrics did not improve. Some were better, some worse.

A short follow up for clarification: do more documents hurt the results or does it just take longer to train?
In our data there are some subjects with only very few occurences. My thought process is: the more examples the better.

My view is the same as yours: "the more examples the better", but with some additions and disclaimers:
  • more examples improves results only to some point (naturally; after this training just takes longer as you said, and in the case of Omikuji or fastText the resulting model is also bigger in disk/memory)
  • benefiting of the more examples may require that some model hyperparameters are adjusted for the change (an optimal value for the min_df parameter probably varies by the training set size)
  • more examples need to be "similar enough" to the data that are used for evaluation (or in production); otherwise more examples leads to worse results
  • NN ensemble may be special in some way: in our old learning-curve analysis we got best results when using less than 1,000 documents (but I suspect there was some problem with the hyperparameters)
 
Bare with me here, I'm not a scientist and not familiar with the algorithms: does balancing the subjects even matter? Eg.: I got 1.000 documents and 950 of them do have a subject "X" and 10 of them have a subject "Y". Is it best to train with all 1.000 or do I pick 10 examples of "X" and 10 examples of "Y"?
 
I think I don't have a good answer to this. Anyway, the imbalance of that degree sounds quite extreme. Instead of dropping the most of the 950 documents with subject "X", could you gather more documents that have "Y" and other subjects? And lexical algorithms do not care about the imbalance.


I've read that Annif is used for 300.000 subjects - so, how do you train with only 10.000ish documents?

Yes, that ratio is challenging. Especially when some of the subjects are very common, but the majority of the subjects are rare. Fortunately, the lexical algorithms (e.g. MLLM) do not require the subjects to be assigned in the training examples, as it "picks the subjects whose labels appear in the document".

Sven Sass

unread,
Jul 15, 2025, 5:22:02 AMJul 15
to Annif Users
Hello Juho,

just a short follow up - it might be interesting for everyone.

As you spoilered hyperopt-imized ensemble with Omikuji (.46) , MLLM (.12) and X-Transformer(.42) did outperform the ensemble with just Omikuji and X-Transformer and this is now my new "champion".

The training of the nn-ensemble did finish aswell (again it took around a week or so to finish) - but overall it did not perform better than the basic ensemble. Again: some metrics are better and some worse.


As of the training amount I did read your answer as "it does only hurt the training time, but the result will not be worse with more training data". And it might be a good idea to check if reducing training amout yields the same results, thus saving ressources. Will do that in future evaluations.

Thank you for your support!

Regards,
sven
Reply all
Reply to author
Forward
0 new messages