Mask Whole Word Roberta training

149 views
Skip to first unread message

emmavan...@gmail.com

unread,
Jul 31, 2020, 3:38:08 PM7/31/20
to fairseq Users
Hello, 

I would like to train a masked language model (Bert/Robert) in Dutch using whole word masking. 
I saw this is possible in this issue with --mask-whole-words
For this, I downloaded the Dutch wikipedia dump : nlwiki-20200720-pages-meta-current.xml.bz2 
Then, I ran WikiExtractor.py to extract and clean the text from this dump. This is done in the same fashion as "Pre-Training with Whole Word Masking for Chinese BERT".
Now several subfolders containing plain text articles with blank lines spacing. 
My question is to be sure I follow the right steps to have a whole word masking model, that will predict and tokenize whole words and not word pieces. The goal is to then do fill in mask with the model. All this for Dutch.
Following "Pretraining RoBERTa using your own data", I should preprocess my text articles by encoding and binarizing them with GPT-2 BPE. For this, it seems I need to pass my own encoder and vocabulary in the parameters : --encoder-json and --vocab-bpe (this can be produced with huggingface tokenizer). Would this be all that need to be changed or am I missing something ?
Also, in the documentation the input is indicated as .raw, this corresponds to plain text files as I extracted them ? 
Finally, to do the training should I used the command as provided in the documentation with just --mask-whole-words added at the end ? Or is the mask whole word training done differently ?
This would give :

TOTAL_UPDATES=125000    # Total number of training steps
WARMUP_UPDATES
=10000    # Warmup the learning rate over this many updates
PEAK_LR
=0.0005          # Peak learning rate, adjust as needed
TOKENS_PER_SAMPLE
=512   # Max sequence length
MAX_POSITIONS
=512       # Num. positional embeddings (usually same as above)
MAX_SENTENCES
=16        # Number of sequences per batch (batch size)
UPDATE_FREQ
=16          # Increase the batch size 16x


DATA_DIR
=Dutch_Corpus


fairseq
-train --fp16 $DATA_DIR \
   
--task masked_lm --criterion masked_lm \
   
--arch roberta_base --sample-break-mode complete --tokens-per-sample $TOKENS_PER_SAMPLE \
   
--optimizer adam --adam-betas '(0.9,0.98)' --adam-eps 1e-6 --clip-norm 0.0 \
   
--lr-scheduler polynomial_decay --lr $PEAK_LR --warmup-updates $WARMUP_UPDATES --total-num-update $TOTAL_UPDATES \
   
--dropout 0.1 --attention-dropout 0.1 --weight-decay 0.01 \
   
--max-sentences $MAX_SENTENCES --update-freq $UPDATE_FREQ \
   
--max-update $TOTAL_UPDATES --log-format simple --log-interval 1 \
   
--mask-whole-words



Lastly, in --arch roberta_base could this be replaced by another pretrained model on huggingface ? In that case, I could start with robBert, the dutch RoBERTa. 

Thank you for your help and suggestions ! 

Reply all
Reply to author
Forward
0 new messages