Xtransformer backend - Out of memory error

11 views
Skip to first unread message

Ball

unread,
Oct 20, 2025, 5:02:49 PM (9 days ago) Oct 20
to Annif Users
Hi all,
We have been trying to run Xtransformer on a VM but kept running into out of memory error at the very last stage of the training. We raised the VM spec to 48 CPU cores and 700 GB system memory and finally got a model file created for a corpus in short text format with 465K entries (TSV file, 447 MB in size) using 1 epoch. We then tried 2 epoch and ran into out of memory issue again. We monitored the HTOP and throughout the training part and until the very last step, only up to 70GB was used. Then the memory usage crept up slowly until it threw an out of memory error and killed the job. 

Following is the log we got when running 3 epoch (the default setting):
| [   3/   3][ 87100/ 87243] | 28937/29081 batches | ms/batch 2467.5658 | train_loss 6.180879e-02 | lr 1.639100e-07
INFO:pecos.xmc.xtransformer.matcher:| [   3/   3][ 87100/ 87243] | 28937/29081 batches | ms/batch 2467.5658 | train_loss 6.180879e-02 | lr 1.639100e-07
| [   3/   3][ 87200/ 87243] | 29037/29081 batches | ms/batch 2484.6621 | train_loss 6.929326e-02 | lr 4.928762e-08
INFO:pecos.xmc.xtransformer.matcher:| [   3/   3][ 87200/ 87243] | 29037/29081 batches | ms/batch 2484.6621 | train_loss 6.929326e-02 | lr 4.928762e-08
Reload the best checkpoint from /tmp/tmp2plp2fym
INFO:pecos.xmc.xtransformer.matcher:Reload the best checkpoint from /tmp/tmp2plp2fym
Predict on input text tensors(torch.Size([465287, 256])) in OVA mode
INFO:pecos.xmc.xtransformer.matcher:Predict on input text tensors(torch.Size([465287, 256])) in OVA mode
./train-all-the-things-xtransformer.sh: line 26:   354 Killed                  annif train $projectid "${!CORPUSVAR

Out-of-memory error
Sep 29 13:44:51 workhorse kernel: [422619.277102] annif invoked oom-killer: gfp_mask=0x1100dca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), order=0, oom_score_adj=0
Sep 29 13:44:52 workhorse kernel: [422619.277152]  oom_kill_process.cold+0xb/0x10
Sep 29 13:44:52 workhorse kernel: [422619.277451] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
Sep 29 13:44:52 workhorse kernel: [422619.277677] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=docker-c9719ab0ee4d90d99910b4ef3ec4d76bd244681b359ee988ff93a69716b8fc00.scope,mems_allowed=0-1,global_oom,task_memcg=/system.slice/docker-c9719ab0ee4d90d99910b4ef3ec4d76bd244681b359ee988ff93a69716b8fc00.scope,task=annif,pid=81263,uid=998
Sep 29 13:44:52 workhorse kernel: [422619.278041] Out of memory: Killed process 81263 (annif) total-vm:220463416kB, anon-rss:205914796kB, file-rss:0kB, shmem-rss:108kB, UID:998 pgtables:408188kB oom_score_adj:0
Sep 29 13:45:00 workhorse kernel: [422629.294310] oom_reaper: reaped process 81263 (annif), now anon-rss:0kB, file-rss:0kB, shmem-rss:108kB

We are puzzled because 700GB of RAM is already a lot of allocated resource but it is still not enough. Did we do something wrong? Did others run into similar out of memory issue when running Xtransformer?

Thanks,
Lucas

Kähler, Maximilian

unread,
Oct 21, 2025, 7:37:06 AM (8 days ago) Oct 21
to Ball, Annif Users

Dear Lucas,

 

Can you share some details of what hyperparameters you used to train the backend? What was the size of your vocabulary? Did you use GPUs for training?

 

Generally, out-of-memory-errors in deep learning can be caused by batch-sizes that are chosen too large. Maybe you may want to try decreasing that.

 

In our work with X-Transformer we found it to be quite challenging to find hyper-parameters that work well. Here is what we used for a German large size vocab (~200K concepts) and a short-document corpus of 900K entities.

 

[x-transformer-roberta]

name="X-Transformer Roberta-XML"

language=de

backend=xtransformer

analyzer=spacy(de_core_news_lg)

batch_size=32

vocab=gnd

limit=100

min_df=2

ngram=2

max_leaf_size=400

nr_splits=256

Cn=0.52

Cp=5.33

cost_sensitive_ranker=true

threshold=0.015

max_active_matching_labels=500

post_processor=l3-hinge

negative_sampling=tfn+man

ensemble_method=concat-only

loss_function=weighted-squared-hinge

num_train_epochs=5

warmup_steps=200

logging_steps=500

save_steps=500

model_shortcut=FacebookAI/xlm-roberta-base

 

Maybe you can compare this with your settings. None of our experiments needed more than 128 GB of system (CPU-)RAM. But the neural matcher part of the script was running on the GPUs, and I have no record of the GPU-memory footprint.

 

Best of luck,

 

Maximilian

 

Maximilian Kähler

Deutsche Nationalbibliothek
Automatische Erschließungsverfahren, Netzpublikationen

Deutscher Platz 1

D-04103 Leipzig

Telefon: + 49 341 2271- 133
mailto:m.ka...@dnb.de
https://www.dnb.de/ki-projekt  

--
You received this message because you are subscribed to the Google Groups "Annif Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to annif-users...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/annif-users/d59b56bc-3d90-40e4-90cc-2da0f9151836n%40googlegroups.com.

Ball

unread,
Oct 21, 2025, 9:57:44 AM (8 days ago) Oct 21
to Annif Users
Hi Maximilian,

We don't have GPU on the VM, so it's training on CPU only. Also, we are training the event, geographic, and topical facets of FAST vocabulary separately. No issue with the event facet since both its vocabulary and corpus sizes are small but started running into issues with the geographic facet. Size of the vocabulary file for geographic facet is 8.5 MB (131,719 concepts) and we are only using the TSV format not SKOS.
Since we just started the trial, we basically copied the parameters available in one of the comments on the Github page https://github.com/NatLibFi/Annif/pull/798#issuecomment-2373754962. Here is our setting:
 
[xtransformer-geo-en]
name=XTransformer English
backend=xtransformer
analyzer=simplemma(en)
language=en
vocab=fast-geographic
batch_size=16
truncate_length=256
learning_rate=0.0001
num_train_epochs=3
max_leaf_size=18000
model_shortcut=distilbert/distilbert-base-uncased

Failures always happen at the very end of the job. Our programs suspect the failure happens at the time the process tries to start saving the model file. We see the following line in the log only in the successful 1-epoch job
| **** saving model (avg_prec=0) to /tmp/tmpc7dvrz98 at global_step 29000 ****
INFO:pecos.xmc.xtransformer.matcher:| **** saving model (avg_prec=0) to /tmp/tmpc7dvrz98 at global_step 29000 ****

Lucas

Kähler, Maximilian

unread,
Oct 22, 2025, 2:36:30 AM (8 days ago) Oct 22
to Ball, Annif Users

Dear Lucas,

 

the parameters you copied are meant to work with a much smaller classification problem, I think. Particularly, I suspect that max_leaf_size=18000 could be related to increased modelsize. But I am not totally sure. Osma and his team have used X-Transformer recently with a larger vocab. You can find their parameter settings here:

https://github.com/NatLibFi/Annif-LLMs4Subjects-GermEval2025/blob/main/projects.toml

Maybe you try these or the ones I suggested below.

 

Best,

 

Maximilian

Reply all
Reply to author
Forward
0 new messages