Running Annif on multiple graphic cards

21 views
Skip to first unread message

Ball

unread,
Feb 4, 2025, 5:06:54 PMFeb 4
to Annif Users
Hi,

We have been using CPUs to run the training but it always takes so long (~15 days) to sort of "complete" the job. We are wondering if Annif is already written in a way allowing parallel processing by multiple GPUs to speed up the training.

BTW, how much memory should we allocate to the database file when running nn-ensemble? The document corpus for training is 1.6 GB with a 33 MB vocabulary file. We upped the allocation to 32 GB but still getting error. Annif didn't throw error but the tmp file didn't get cleared out. Our programmer thought it might have failed at the final stage when writing the resulting model back. Any ideas if we need to allocate more memory and by how much?

Thanks,
Lucas

Osma Suominen

unread,
Feb 5, 2025, 7:58:09 AMFeb 5
to annif...@googlegroups.com
Hi Lucas,

15 days for training sounds like a lot. Typical train times are much
shorter, generally at most a few hours.

Can you tell more about your setup? What vocabulary are you using? What
is your project configuration? What kind and how much training data are
you using? What train command did you use? What hardware do you have?

Annif currently doesn't support GPU computing, because none of the
backends are written in such a way.

It sounds like you are training the NN ensemble with a very large
training set. This is typically not necessary. You can get good results
with the NN ensemble using just a few thousand training examples. This
is because the NN ensemble just refines the suggestions coming from the
source projects; you can think of it as a kind of fine-tuning approach
instead of a bare neural network that needs to be trained from scratch.

I suggest that you train the NN ensemble with, say, ten thousand
documents instead of using several gigabytes of training data.

-Osma
> --
> You received this message because you are subscribed to the Google
> Groups "Annif Users" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to annif-users...@googlegroups.com
> <mailto:annif-users...@googlegroups.com>.
> To view this discussion visit
> https://groups.google.com/d/msgid/annif-users/61bf8462-3e9b-411b-ba05-d63f7be18146n%40googlegroups.com <https://groups.google.com/d/msgid/annif-users/61bf8462-3e9b-411b-ba05-d63f7be18146n%40googlegroups.com?utm_medium=email&utm_source=footer>.

--
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 15 (Unioninkatu 36)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
osma.s...@helsinki.fi
http://www.nationallibrary.fi

Ball

unread,
Feb 5, 2025, 4:03:43 PMFeb 5
to Annif Users
Hi Osma,
The vocab is the topical facet of FAST which has 486k of entries and the document corpus is in the short text format with 3.63 million entries. Each entry in the document corpus has title, summary (when available), table of content (when available), and FAST URI(s). They are data extracted from our library catalog.
We are running the training on a VM with two Intel Xeon Gold 6334 CPUs (16 physical plus another 16 virtual cores) and 120-140 GB of dynamic memory. We restarted the job this morning with 128GB of allocated memory.
Here is a snippet of the project config related to the nn-ensemble

[nn-ensemble-topic-en]
name=NN ensemble English
language=en
backend=nn_ensemble
sources=tfidf-topic-en:1,fasttext-topic-en:1,mllm-topic-en:1,omikuji-parabel-topic-en:2
limit=100
vocab=fast-topic
nodes=100
dropout_rate=0.2
epochs=10
learn-epochs=1
lmdb_map_size=137438953472

I think we use the followings commands for nn-ensemble
load vocab file: /annif-projects$ annif load-vocab fast-topic [file path to the topical facet file] --language en
train:  annif train nn-ensemble-topic-en [file path to the document corpus file]

If we use 10K of documents for training nn-ensemble, do they need to be new training data? or we can extract them from the original set used to train the source models?

Thanks,
Lucas

Osma Suominen

unread,
Feb 7, 2025, 2:42:04 AMFeb 7
to annif...@googlegroups.com
Hi Lucas!

Thanks a lot for the details.

Do you have separate validation and/or test sets for evaluating the
quality? I'm asking because you already seemed to have trained several
base projects (tfidf, fasttext, mllm, omikuji-parabel) and then used
these as sources for the NN ensemble, with somewhat different weights.
Did you evaluate the quality of each of these separately against your
test set?

In my experience, it's best to spend some time first trying to squeeze
the best possible results (in terms of F1 score and/or nDCG) from each
individual project before combining them into ensembles. This can
involve adjusting analyzers or trying different hyperparameters (e.g.
Omikuji Parabel vs. Bonsai, or enabling 2-grams).

Once you have the base projects set up and working well, first you
should make a basic ensemble and see if that improves results over the
base projects (usually it does, but not always). One useful step to do
here is to use "annif hyperopt" to optimize the weights of the ensemble;
you have clearly not done this yet since your source weights are just 1
or 2. If the hyperopt result gives some source project a very low
weight, you can consider dropping it altogether (often TF-IDF isn't very
useful in practice as it's more of a toy model). Once you have
configured the basic ensemble with the optimized weights, evaluate once
more. That should be your baseline score before moving into advanced
ensembles.

Now that you know the baseline, you can try if a NN ensemble (with the
same source weights) improves scores or not, and by how much. If you
have set a side a separate validation set, I would recommend using that
to train the NN ensemble, so it's genuinely "fresh" data instead of
something that the source backends have already been exposed to. But if
you don't have that, just training on a sample of the already used
training records could still be better than nothing.

In my experience, the NN ensemble is most useful for correcting bias
caused by using different kinds of training data for different backends.
For example in our YSO models, we've trained the Omikuji and fastText
projects using a large amount of short text metadata records from Finna,
while MLLM has been trained on a much smaller amount of longer fulltext
documents. We mostly want to apply Annif for fulltext documents. So
we've trained the NN ensemble with fulltext documents as well, and that
helps to "adapt" the ensemble for fulltext even though the source
projects were trained mostly on metadata records. In your case, I
understood that you only had one type of data (records from your library
catalog) so I'm not sure if the NN ensemble will provide much
improvement over a basic averaging ensemble. But you should try it in
order to find out! You can even train it first with a small number of
records (maybe a thousand or two) and then evaluate. If it helped, try
training with more records. An incremental approach is usually much
better than going head first into an unknown direction!

Hope this helps,

Osma
> https://groups.google.com/d/msgid/annif-users/61bf8462-3e9b-411b-ba05-d63f7be18146n%40googlegroups.com <https://groups.google.com/d/msgid/annif-users/61bf8462-3e9b-411b-ba05-d63f7be18146n%40googlegroups.com> <https://groups.google.com/d/msgid/annif-users/61bf8462-3e9b-411b-ba05-d63f7be18146n%40googlegroups.com?utm_medium=email&utm_source=footer <https://groups.google.com/d/msgid/annif-users/61bf8462-3e9b-411b-ba05-d63f7be18146n%40googlegroups.com?utm_medium=email&utm_source=footer>>.
>
> --
> Osma Suominen
> D.Sc. (Tech), Information Systems Specialist
> National Library of Finland
> P.O. Box 15 (Unioninkatu 36)
> 00014 HELSINGIN YLIOPISTO
> Tel. +358 50 3199529
> osma.s...@helsinki.fi
> http://www.nationallibrary.fi <http://www.nationallibrary.fi>
>
> --
> You received this message because you are subscribed to the Google
> Groups "Annif Users" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to annif-users...@googlegroups.com
> <mailto:annif-users...@googlegroups.com>.
> To view this discussion visit
> https://groups.google.com/d/msgid/annif-users/3074031e-a6ca-4f75-bced-74281365272bn%40googlegroups.com <https://groups.google.com/d/msgid/annif-users/3074031e-a6ca-4f75-bced-74281365272bn%40googlegroups.com?utm_medium=email&utm_source=footer>.
Reply all
Reply to author
Forward
0 new messages