Hello,
I would like to train a masked language model (Bert/Robert) in Dutch using whole word masking.
I saw this is possible in this
issue with --mask-whole-words
For this, I downloaded the Dutch wikipedia dump : nlwiki-20200720-pages-meta-current.xml.bz2
Now several subfolders containing plain text articles with blank lines spacing.
My question is to be sure I follow the right steps to have a whole word masking model, that will predict and tokenize whole words and not word pieces. The goal is to then do fill in mask with the model. All this for Dutch.
Following "
Pretraining RoBERTa using your own data", I should preprocess my text articles by encoding and binarizing them with GPT-2 BPE. For this, it seems I need to pass my own encoder and vocabulary in the parameters : --encoder-json and --vocab-bpe (this can be produced with huggingface tokenizer). Would this be all that need to be changed or am I missing something ?
Also, in the documentation the input is indicated as .raw, this corresponds to plain text files as I extracted them ?
Finally, to do the training should I used the command as provided in the
documentation with just --mask-whole-words added at the end ? Or is the mask whole word training done differently ?
This would give :
TOTAL_UPDATES=125000 # Total number of training steps
WARMUP_UPDATES=10000 # Warmup the learning rate over this many updates
PEAK_LR=0.0005 # Peak learning rate, adjust as needed
TOKENS_PER_SAMPLE=512 # Max sequence length
MAX_POSITIONS=512 # Num. positional embeddings (usually same as above)
MAX_SENTENCES=16 # Number of sequences per batch (batch size)
UPDATE_FREQ=16 # Increase the batch size 16x
DATA_DIR=Dutch_Corpus
fairseq-train --fp16 $DATA_DIR \
--task masked_lm --criterion masked_lm \
--arch roberta_base --sample-break-mode complete --tokens-per-sample $TOKENS_PER_SAMPLE \
--optimizer adam --adam-betas '(0.9,0.98)' --adam-eps 1e-6 --clip-norm 0.0 \
--lr-scheduler polynomial_decay --lr $PEAK_LR --warmup-updates $WARMUP_UPDATES --total-num-update $TOTAL_UPDATES \
--dropout 0.1 --attention-dropout 0.1 --weight-decay 0.01 \
--max-sentences $MAX_SENTENCES --update-freq $UPDATE_FREQ \
--max-update $TOTAL_UPDATES --log-format simple --log-interval 1 \
--mask-whole-words
Lastly, in --arch roberta_base could this be replaced by another pretrained model on huggingface ? In that case, I could start with robBert, the dutch RoBERTa.
Thank you for your help and suggestions !