CPU speed optimisations

Transcrobes Project

unread,

May 25, 2021, 7:14:15 AM5/25/21

to marian-nmt

I might be wanting something impossible but I have VPSes with decent amounts of RAM but only CPU, and that are not particularly fast either. I am playing around with the Tatoeba ZH -> EN model and modified the Opus MT docker to get it all on one of the 16GB RAM VPSes I have for testing. It runs a marian-server. I have been trying to play around with --workspace and --cpu-threads but nothing seems to have any effect. I am getting about 8 chinese characters per second, which means long sentences (60+ characters) are taking far too long.

My problematic use case is to have single sentences that are translated as quickly as possible. The other batching use cases can be a little slow but this one would ideally be in the sub-second range per sentence.

Am I simply dreaming to think this could be done in the sub 1-2s range without throwing some beastie GPUs at it?

Cheers,

A

Marcin Junczys-Dowmunt

unread,

May 25, 2021, 11:31:07 AM5/25/21

to maria...@googlegroups.com

Hi,

What models are those? In terms of size, embedding dimensions, layers etc? It really will all depend on the model, if things are pre-trained and large, there may not be much that can be done.

Ah, but also remember to reduce beam-size, the default is 12. Set –beam-size 4 or smaller.

--
You received this message because you are subscribed to the Google Groups "marian-nmt" group.
To unsubscribe from this group and stop receiving emails from it, send an email to marian-nmt+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/marian-nmt/9b9d424b-6287-4631-84ff-6fe4a56b61c2n%40googlegroups.com.

Transcrobes Project

unread,

May 25, 2021, 8:57:21 PM5/25/21

to marian-nmt

What models are those? In terms of size, embedding dimensions, layers etc? It really will all depend on the model, if things are pre-trained and large, there may not be much that can be done.

https://github.com/Helsinki-NLP/Tatoeba-Challenge/tree/master/models/zho-eng (https://object.pouta.csc.fi/Tatoeba-MT-models/zho-eng/opus-2020-07-17.zip). My understanding is that means they are pre-trained but unfortunately I have no idea whether it is considered large or not, being a total n00b in this area. I'll try and work out whether the data they used for training are public, though I guess they are, so I might be able to train something a little lighter.

Ah, but also remember to reduce beam-size, the default is 12. Set –beam-size 4 or smaller.

Actually it looks like the default config for the Opus MT example already reduced that to 6, the default params are:

--allow-unk -b 6 --mini-batch 64 --normalize 0.6 --maxi-batch-sort src --maxi-batch 100

to which I added

--cpu-threads 8 --workspace 2000

which seemed to have no effect. Am I correct in assuming that the batch-related config is not going to (positively or negatively) affect my use case of single sentences (typically up to around a few dozen Chinese characters)?

I just realised there were two more recent models on the github link above so will give those a try. The most recent is of type "transformer-align" rather than just "transformer" so it might have different performance characteristics. I was hoping to not have to get into the details of NMT for my PhD project but it looks like that was wishful thinking! Any other suggestions welcome and thanks for your help!

Transcrobes Project

unread,

May 25, 2021, 11:43:05 PM5/25/21

to marian-nmt

Ah, but also remember to reduce beam-size, the default is 12. Set –beam-size 4 or smaller.

Reducing --beam-size to 4 didn't seem to have much effect but reducing it to 1 got my test sentence of ~60 characters to 2.5s, and the translation it gave for that sentence (at least with the Tatoeba 2021-04-30 transformers+bt model) didn't change. Any suggestions for further optimisations would be most appreciated but that is actually getting pretty close to acceptable.

Thanks again.

Hieu Hoang

unread,

May 26, 2021, 12:09:17 AM5/26/21

to maria...@googlegroups.com

With beam=1, you can use

--skip-cost

You might also wanna try

--output-approx-knn 100 1024

this may help but it can reduce translation quality

Hieu Hoang

http://statmt.org/hieu

--

You received this message because you are subscribed to the Google Groups "marian-nmt" group.
To unsubscribe from this group and stop receiving emails from it, send an email to marian-nmt+...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/marian-nmt/d09bb5b0-6814-4d30-9e7d-7a1a17d275f8n%40googlegroups.com.

Transcrobes Project

unread,

May 26, 2021, 1:48:00 AM5/26/21

to marian-nmt

Thanks. The --output-approx-knn 100 1024 had a pretty dramatic effect on accuracy and the bump from skip-cost seemed to be relatively modest. It seems I can only get single 60-char sentences down to around 2s.

In terms of compilation, I have no idea how this works but the laptop I am building on (in a HyperV VM) is Intel and the server (in a KVM VM) is AMD. I saw something for OpenNMT that it was better *not* to use intel-mkl for AMD cpus and another lib should be used. I saw somewhere else that there is a compile (? runtime?) option for intel-mkl that can make it perform better on AMD cpus. Would the difference be negligible anyway? Am I talking rubbish or is it worth looking at?

Thanks again, A

Transcrobes Project

unread,

May 26, 2021, 5:44:15 AM5/26/21

to marian-nmt

Ok, so my testing isn't particularly scientific but I definitely seem to be getting a significant (~30%, down to ~1.5s for my test sentence ) reduction in execution time when I add the runtime envvar MKL_DEBUG_CPU_TYPE=5, which is the unofficial hack I had seen. I also tried with the default debian stable openblas ( 0.3.5) but it was significantly slower than intel-mkl, even though several sources seemed to suggest that openblas was better than intel-mkl on AMD.

Has anyone had any success compiling recent openblas with specific configs for AMD? Is it worth trying to get the latest (0.3.15) or is it almost certain to be worse than intel-mkl, even on AMD?

Thanks!

Roman Grundkiewicz

unread,

May 26, 2021, 8:10:47 AM5/26/21

to marian-nmt

As a side note, if you look for fast CPU-optimized NMT models trained with Marian, some are available from https://github.com/browsermt/students. That repo also includes a training recipe if you have data and resources.

Reply all

Reply to author

Forward

Message has been deleted