Mask Whole Word Roberta training

149 views

Skip to first unread message

emmavan...@gmail.com

unread,

Jul 31, 2020, 3:38:08 PM7/31/20

to fairseq Users

Hello,

I would like to train a masked language model (Bert/Robert) in Dutch using whole word masking.

I saw this is possible in this issue with --mask-whole-words

For this, I downloaded the Dutch wikipedia dump : nlwiki-20200720-pages-meta-current.xml.bz2

Then, I ran WikiExtractor.py to extract and clean the text from this dump. This is done in the same fashion as "Pre-Training with Whole Word Masking for Chinese BERT".

Now several subfolders containing plain text articles with blank lines spacing.

My question is to be sure I follow the right steps to have a whole word masking model, that will predict and tokenize whole words and not word pieces. The goal is to then do fill in mask with the model. All this for Dutch.

Following "Pretraining RoBERTa using your own data", I should preprocess my text articles by encoding and binarizing them with GPT-2 BPE. For this, it seems I need to pass my own encoder and vocabulary in the parameters : --encoder-json and --vocab-bpe (this can be produced with huggingface tokenizer). Would this be all that need to be changed or am I missing something ?

Also, in the documentation the input is indicated as .raw, this corresponds to plain text files as I extracted them ?

Finally, to do the training should I used the command as provided in the documentation with just --mask-whole-words added at the end ? Or is the mask whole word training done differently ?

This would give :

TOTAL_UPDATES=125000    # Total number of training steps
WARMUP_UPDATES=10000    # Warmup the learning rate over this many updates
PEAK_LR=0.0005          # Peak learning rate, adjust as needed
TOKENS_PER_SAMPLE=512   # Max sequence length
MAX_POSITIONS=512       # Num. positional embeddings (usually same as above)
MAX_SENTENCES=16        # Number of sequences per batch (batch size)
UPDATE_FREQ=16          # Increase the batch size 16x


DATA_DIR=Dutch_Corpus


fairseq-train --fp16 $DATA_DIR \
    --task masked_lm --criterion masked_lm \
    --arch roberta_base --sample-break-mode complete --tokens-per-sample $TOKENS_PER_SAMPLE \
    --optimizer adam --adam-betas '(0.9,0.98)' --adam-eps 1e-6 --clip-norm 0.0 \
    --lr-scheduler polynomial_decay --lr $PEAK_LR --warmup-updates $WARMUP_UPDATES --total-num-update $TOTAL_UPDATES \
    --dropout 0.1 --attention-dropout 0.1 --weight-decay 0.01 \
    --max-sentences $MAX_SENTENCES --update-freq $UPDATE_FREQ \
    --max-update $TOTAL_UPDATES --log-format simple --log-interval 1 \
    --mask-whole-words

Lastly, in --arch roberta_base could this be replaced by another pretrained model on huggingface ? In that case, I could start with robBert, the dutch RoBERTa.