How to disable Sentencepiece integration when pre-compiled

Eleftherios Avramidis

unread,

May 4, 2021, 11:39:57 AM5/4/21

to marian-nmt

One of the Marian examples gives the possibility to compile Marian with Sentencepiece integration. [1] This makes tokenization and sub-word units to be run by default. Is there a way to disable that, in cases that Sentencepiece has been pre-compiled? It is not really obvious in the commandline parameters...

[1] https://github.com/marian-nmt/marian-examples/tree/master/training-basics-sentencepiece

Eleftherios Avramidis

unread,

May 4, 2021, 3:59:20 PM5/4/21

to marian-nmt

And one more question. The link above suggests testing that Sentencepiece has been integrated, by running the command "./marian --help |& grep sentencepiece". This command works nevertheless, even if the sentencepiece parameter -DUSE_SENTENCEPIECE=ON is not passed while compiling Marian. Is the check command wrong, or does Sentencepiece get integrated anyway?

Eleftherios Avramidis

unread,

Jun 21, 2021, 9:40:31 AM6/21/21

to marian-nmt

Hi! may I bump this question? How can one train Marian without Sentencepiece? Do I need to compile without it or there is some commandline option to switch it off?

And if I go for compiling without Sentencepiece, how can I check that Sentencepiece is indeed not integrated?

best
Lefteris

Roman Grundkiewicz

unread,

Jun 21, 2021, 1:04:46 PM6/21/21

to marian-nmt

Hi Lefteris,

SentencePiece is now compiled by default, so it actually needs to be explicitly disabled via -DUSE_SENTENCE=OFF if one want to compile without it. In that case there will be no `--sentencepiece-*` options in `--help`.

There is no option to disable SentencePiece subword segmentation (it can be disabled on the output only via `--no-spm-decode`), however, it is used only with .spm vocabs. If .yml vocabs are provided, Marian does not trigger pre-/post-processing with SentencePiece.

Roman

Eleftherios Avramidis

unread,

Jun 22, 2021, 2:37:42 PM6/22/21

to maria...@googlegroups.com, Roman Grundkiewicz

Hi Roman

thanks for the clarification.

So if I understand properly your instruction, one has to manually create a .yml file? or is it created automatically by Μarian when it sees a different filename suffix in the commandline params?

If we have to create the .yml vocabs ourselves, is there any documentation how to create .yml vocabs? I cannot find any.

Would this solution work, if one wants to not use Sentencepiece in one side of the translation, e.g. source side but it should be enabled at the target side ?

best
Lefteris

--
You received this message because you are subscribed to the Google Groups "marian-nmt" group.
To unsubscribe from this group and stop receiving emails from it, send an email to marian-nmt+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/marian-nmt/6249ee0e-0a4d-4f42-b8e0-fccac149ea3bn%40googlegroups.com.

Marcin Junczys-Dowmunt

unread,

Jun 22, 2021, 3:12:52 PM6/22/21

to maria...@googlegroups.com, Roman Grundkiewicz

Hi Lefteris,

Can you describe what your exact scenario is? How does the tokenization happen? What does the input to Marian look like? Is there other segmentation happening on the outside?

Marcin

To view this discussion on the web visit https://groups.google.com/d/msgid/marian-nmt/044fb5c6-358c-da20-9c69-504ce5392d79%40gmail.com.

Eleftherios Avramidis

unread,

Jun 23, 2021, 4:26:36 AM6/23/21

to marian-nmt

Hi Marcin,

exactly, our source is sign language glosses, which is a symbolic language with no inflection and therefore we want to test pretokenizing/segmenting that side ourselves. The target language is German, so normal subword segmentation should be taking place.