General MT: deadlines, data donation

58 views
Skip to first unread message

Tom Kocmi

unread,
Apr 5, 2023, 10:51:35 AM4/5/23
to wmt-...@googlegroups.com

Hi,

 

we have updated our deadlines to align with EMNLP and the submission week for this year is 13th July to 20th July.

http://www2.statmt.org/wmt23/translation-task.html

 

Secondly, with the recent rise of LLMs trained on unknown data, we want to create challenging test sets that can also evaluate LLMs capacities.

Thus, we are looking for unseen or fresh data ideally created this year. Please, contact us, if you know about any source, where we could collect fresh monolingual data with research permissive licence.

Or let us know if you can donate monolingual data to General MT. We are looking for even small quantities of couple of hundred sentences.

 

Thank you and have a lovely day,

Tom

(in Germany, he/him)

 

 

Adam Bittlingmayer

unread,
Apr 6, 2023, 7:15:21 AM4/6/23
to wmt-...@googlegroups.com
It seems to me that a key promise of LLMs is the ability to use context.

So it seems to me that we need to make test sets where the task shape is different than our traditional abstract source:target paradigm.

For example, let the model see the reference translation for the preceding n lines, or at least see the source for the surrounding lines, during both training and inference.

This is tricky because it adds complexity and fragmentation.

But if we don't try, machine translation may end up even more like search or ads, where research is not relevant to the top production systems.

Adam



--
You received this message because you are subscribed to the Google Groups "Workshop on Statistical Machine Translation" group.
To unsubscribe from this group and stop receiving emails from it, send an email to wmt-tasks+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/wmt-tasks/AS4PR83MB05235DA367C59EE998FAEB91FA909%40AS4PR83MB0523.EURPRD83.prod.outlook.com.


--
Adam Bittlingmayer
CEO
ModelFront
Translation quality prediction

Barry Haddow

unread,
Apr 6, 2023, 8:49:36 AM4/6/23
to wmt-...@googlegroups.com
Hi Adam

On 06/04/2023 12:15, Adam Bittlingmayer wrote:
> For example, let the model see the reference translation for the
> preceding n lines, or at least see the source for the surrounding
> lines, during both training and inference.

At inference time, the surrounding source is normally available, but I
don't think it makes sense to give access to the surrounding reference.
At training time, we keep sentences in order whenever possible, so the
surrounding source and reference is available. This is difficult for
paracrawl because of the way it is extracted, but we do it for other
data sets.

I agree that any new test sets we add should also preserve context where
they can.

best

Barry

The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. Is e buidheann carthannais a th’ ann an Oilthigh Dhùn Èideann, clàraichte an Alba, àireamh clàraidh SC005336.

Aaron Poet

unread,
Apr 7, 2023, 5:02:28 PM4/7/23
to wmt-...@googlegroups.com


Hi Tom, and WMT orgnisers,

Please have a look into our Multi-lingual parallel corpora AlphaMWE: https://github.com/aaronlifenghan/AlphaMWE

It is overall 750 sentences split into 5 files, covering English, Chinese, Polish, German, and Arabic (partial). it is mixed domain data, also featuring multi-word expressions.


Kind regards,

Lifeng


--
You received this message because you are subscribed to the Google Groups "Workshop on Statistical Machine Translation" group.
To unsubscribe from this group and stop receiving emails from it, send an email to wmt-tasks+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/wmt-tasks/AS4PR83MB05235DA367C59EE998FAEB91FA909%40AS4PR83MB0523.EURPRD83.prod.outlook.com.


--

News

CFPs and participantsHealTAC23 (Manchester June 14-16) | 

CF-Participants & Sponsorship MWE23@EACL (joint w ClinicalNLP@ACL) |

our work:

ClinicalMT@WMT22_w_EMNLP | Meta-eval Tutorial / HumanEval (paper_w_tool) / TranslationUncertainty (paper) @LREC22 | ClinicalTextMinging (ML-Tools) @HealTAC2022 || Covid-Topic-Modeling (arXiv-2023) | Measuring_IRR(inter-rater reliability) | AlphaMWE-Arabic (corpus)

Serving as ACL2023 AC (area chair): resource and evaluation |

MWE-SIGLEX elected Standing Committee Board member (2022-2024) |

Ph.D. in Computer Application (Machine Translation, thesis), M.Sc. (Software Engineering, thesis excellent-award), B.Sc. (Math, GPA 80/100)

Google-Scholar , Presentation(ppt) Research-Gate

Google-site Linkedin, Writer(poetry)

Postdoctoral Research Associate at HECTA group, The University of Manchester, UK

https://www.research.manchester.ac.uk/portal/lifeng.han.html Office: 2.90 Kilburn, Oxford Road, Manchester 

Reply all
Reply to author
Forward
0 new messages