Automatic hyperparameter tuning on Marian

60 views

Skip to first unread message

Jash Dubal

unread,

May 25, 2024, 2:27:26 PM5/25/24

to marian-nmt

Hello,

I've been working with Opus-Marian for the past few weeks, re-training en-fr base models on our datasets.

At this point, I am looking into tuning hyperparameters for re-training in Marian. I believe these hyperparameter settings can significantly improve our model inferencing.

Currently, the translations are not at an acceptable accuracy. The goal is to increase model complexity to capture more complex relationships.

I have been looking into frameworks like Optuna, which performs automatic hyperparameter tuning and determines the ideal parameters for your model. I am wondering if anyone else has gotten set up on this and could provide any advice on how to build a workflow for tuning hyperparameters on base model re-training.

Any other advice on how we can optimize our hyperparameters to capture these more complex relationships would be very helpful. For reference, we are training a transformer big model with the following hyperparameters:

after: 0e
after-batches: 0
after-epochs: 0
all-caps-every: 0
allow-unk: false
authors: false
beam-size: 8
bert-class-symbol: "[CLS]"
bert-mask-symbol: "[MASK]"
bert-masking-fraction: 0.15
bert-sep-symbol: "[SEP]"
bert-train-type-embeddings: true
bert-type-vocab-size: 2
build-info: ""
check-gradient-nan: false
check-nan: false
cite: false
clip-norm: 0
cost-scaling:
[]
cost-type: ce-mean-words
cpu-threads: 0
data-threads: 32
data-weighting: ""
data-weighting-type: sentence
dec-cell: gru
dec-cell-base-depth: 2
dec-cell-high-depth: 1
dec-depth: 6
devices:
- 0
- 1
- 2
- 3
dim-emb: 1024
dim-rnn: 1024
dim-vocabs:
- 32000
- 32000
disp-first: 0
disp-freq: 1000u
disp-label-counts: true
dropout-rnn: 0
dropout-src: 0
dropout-trg: 0
dump-config: ""
dynamic-gradient-scaling:
[]
early-stopping: 5
early-stopping-on: first
embedding-fix-src: false
embedding-fix-trg: false
embedding-normalization: false
embedding-vectors:
[]
enc-cell: gru
enc-cell-depth: 1
enc-depth: 6
enc-type: bidirectional
english-title-case-every: 0
exponential-smoothing: 0.0001
factor-weight: 1
factors-combine: sum
factors-dim-emb: 0
gradient-checkpointing: false
gradient-norm-average-window: 100
guided-alignment: none
guided-alignment-cost: ce
guided-alignment-weight: 0.1
ignore-model-config: false
input-types:
[]
interpolate-env-vars: false
keep-best: true
label-smoothing: 0.1
layer-normalization: false
learn-rate: 0.0002
lemma-dependency: ""
lemma-dim-emb: 0
log-level: info
log-time-zone: ""
logical-epoch:
- 1e
- 0
lr-decay: 0
lr-decay-freq: 50000
lr-decay-inv-sqrt:
- 8000
lr-decay-repeat-warmup: false
lr-decay-reset-optimizer: false
lr-decay-start:
- 10
- 1
lr-decay-strategy: epoch+stalled
lr-report: false
lr-warmup: 8000
lr-warmup-at-reload: false
lr-warmup-cycle: false
lr-warmup-start-rate: 0
max-length: 100
max-length-crop: false
max-length-factor: 3
maxi-batch: 1000
maxi-batch-sort: trg
mini-batch: 1000
mini-batch-fit: true
mini-batch-fit-step: 10
mini-batch-round-up: true
mini-batch-track-lr: false
mini-batch-warmup: 0
mini-batch-words: 0
mini-batch-words-ref: 0
multi-loss-type: sum
n-best: false
no-reload: false
no-restore-corpus: false
normalize: 1
normalize-gradient: false
num-devices: 0
optimizer: adam
optimizer-delay: 1
optimizer-params:
- 0.9
- 0.998
- 1e-09
output-omit-bias: false
overwrite: true
precision:
- float32
- float32
pretrained-model: ""
quantize-biases: false
quantize-bits: 0
quantize-log-based: false
quantize-optimization-steps: 0
quiet: false
quiet-translation: true
relative-paths: false
right-left: false
save-freq: 5000
seed: 0
sentencepiece-alphas:
[]
sentencepiece-max-lines: 2000000
sentencepiece-options: ""
shuffle: data
shuffle-in-ram: true
sigterm: save-and-exit
skip: false
sqlite: ""
sqlite-drop: false
sync-sgd: true
tempdir: /tmp
tied-embeddings: false
tied-embeddings-all: true
tied-embeddings-src: false
train-embedder-rank:
[]
transformer-aan-activation: swish
transformer-aan-depth: 2
transformer-aan-nogate: false
transformer-decoder-autoreg: self-attention
transformer-decoder-dim-ffn: 0
transformer-decoder-ffn-depth: 0
transformer-depth-scaling: false
transformer-dim-aan: 2048
transformer-dim-ffn: 4096
transformer-dropout: 0.1
transformer-dropout-attention: 0
transformer-dropout-ffn: 0
transformer-ffn-activation: relu
transformer-ffn-depth: 2
transformer-guided-alignment-layer: last
transformer-heads: 16
transformer-no-projection: false
transformer-pool: false
transformer-postprocess: dan
transformer-postprocess-emb: d
transformer-postprocess-top: ""
transformer-preprocess: ""
transformer-rnn-projection: false
transformer-tied-layers:
[]
transformer-train-position-embeddings: false
tsv: false
tsv-fields: 0
type: transformer
ulr: false
ulr-dim-emb: 0
ulr-dropout: 0
ulr-keys-vectors: ""
ulr-query-vectors: ""
ulr-softmax-temperature: 1
ulr-trainable-transformation: false
unlikelihood-loss: false
valid-freq: 2000
valid-max-length: 1000
valid-metrics:
- bleu-detok
- chrf
- ce-mean-words
valid-mini-batch: 8
valid-reset-all: false
valid-reset-stalled: false
valid-script-args:
[]
valid-script-path: ""
vocabs:
[]
word-penalty: 0
word-scores: false