Hi Damir,
this is a great question and it is hard to easily answer, can you share more details? Here are classical approaches and how we consider them (based on earlier discussion we had). Importantly, describe it in the paper and our system description poll, so we can clearly mark it in findings.
- Mixture-of-experts - the total number of parameters counts, not the active number of parameters (there is also efficiency shared task at WMT for such models)
- model ensemble - sum of all models parameters counts
- smaller model trained off larger model outputs - the number of parameters of the smaller (inference) model counts (not teacher if used solely for training)
- best of N answers or post-editing polishing with the same model - the size of the model counts only once (not how many times it was requested per single translation)
- usage of MT metrics in inference (such as in MBR) - ignore size of the MT metrics models if and only if they are used to provide quality estimation/ranking and do not provide feedback on how to improve the translation. However, if a proprietary model is used for quality estimation (like GPT), the model is automatically unconstrained since it breaks the requirement about your model being published and translations being reproducable
I hope this helps a bit, please, let us know if you are using some alternative setup which isn't easy to categorize into above.
Have a lovely day,
Kocmi
(in Europe, [kotsmi], he/him)