(Possibly Basic) Question on (Lack of) Alternative Translations in the Dev Set

50 views
Skip to first unread message

Abhishek Pandit

unread,
Mar 22, 2020, 11:02:31 PM3/22/20
to duolingo-sharedtask-2020
Hi Team,

Quick intro: I'm a grad student in linguistics at the University of Chicago who speaks 12 languages. I've recently plunged into the more computational side of language, and this challenge has been a great learning experience on that front. So here's a seemingly basic question:

Why does the dev set for each language not contain the alternative (not 'gold') translations for each prompt sentence? Without those other alternatives, the task seems to be one of checking standard translation quality. Also, there would be no weights for the estimation of the Macro F1 metric in the evaluation. So for all practical purposes, doesn't it become unweighted?
Please forgive my ignorance. I learn very quickly, pinky swear! Also, great getting to know the team behind Duolingo AI through your linked profiles. Except I don't know much about Bill yet- am too busy dancing to RickRoll now! :)

-Abe

Stephen Mayhew

unread,
Mar 23, 2020, 11:00:01 AM3/23/20
to Abhishek Pandit, duolingo-sharedtask-2020
Hi Abe,

Thanks for contacting us, and thanks for your interest in the shared task! Recall that the goal of the task is to predict alternative translations, which are then scored according to user weights. We've released the dev set as a blind set, which means that we have withheld the gold alternative translations. All scoring on the Codalab leaderboard is done against these withheld translations. The point of a blind set is for participants to get an idea of how well their systems work on unseen data, and to tune them (limited by the daily submission limit). Hopefully, then, in the final test phase (which will also be blind) there won't be surprises.

Stephen

--
You received this message because you are subscribed to the Google Groups "duolingo-sharedtask-2020" group.
To unsubscribe from this group and stop receiving emails from it, send an email to duolingo-sharedtas...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/duolingo-sharedtask-2020/4e83a8c5-5eae-4f0d-81ca-b52da43e2583%40googlegroups.com.

Abhishek Pandit

unread,
Mar 24, 2020, 10:18:50 AM3/24/20
to duolingo-sharedtask-2020
Hi Stephen,

Thanks for the clarification! Time to hit the high gear now. I've applied for permission to submit results to CodaLab. Would that request be routed to Duolingo?

Will be sending you folks a few more questions in the coming days. And I promise they're going to move from 'as-easy-as-pie' to 'as-complex-as-π ' pretty quickly. :)
To unsubscribe from this group and stop receiving emails from it, send an email to duolingo-sharedtask-2020+unsub...@googlegroups.com.

Abhishek Pandit

unread,
Mar 24, 2020, 12:31:18 PM3/24/20
to duolingo-sharedtask-2020
So here's the first set of questions:
1) a) The languages you've provided in the task are morphologically and syntactically very different. It might make more sense to build out distinct models for say, Portuguese instead of Japanese. How would you recommend that we preprocess only specific languages at a time, instead of the whole bunch at once? I was considering changing the code in variables.sh: Right at the top, we see this foreboding message:

# this doesn't change.
src
=en
tgt
=hu

I still tried to change tgt='ja'. for exclusively Japanese. I then ran variables.sh and preprocess.sh.
I got this error message:
WARNING: No known abbreviations for language 'ja', attempting fall-back to English version.
Is this warning a cause for concern? And would you recommend some other way of focusing on one language at a time?

b) I don't fully understand what exact purpose the 'hu' label above is serving in your code. Is 'hu' short for 'human'- as in, all the human-supplied translations? In that sense, it would seem like 'hu' in a stand-in for 'all' languages and 'all' translations. Would that be a fair assumption?

Thanks!

Stephen Mayhew

unread,
Mar 24, 2020, 2:31:18 PM3/24/20
to Abhishek Pandit, duolingo-sharedtask-2020
Hi Abe,

Unless you are using a massively-multilingual representation like multilingual BERT, it probably makes sense to process each language separately. In fact, all of the code stubs do process each language separately. In the snippet you showed, the "src" means "source", and the "tgt" means "target" (maybe "trg" would be clearer). src=en means that English is the source language (true in all cases), and tgt=hu, means that the target is Hungarian. To process other languages, you would change tgt to one of (pt, ja, ko, vi). 

That warning, I believe, is coming from the moses tokenizer. It may signify that tokenization is slightly weird which is not usually a problem. However, in the specific case of Japanese, you need a segmenter. For that, we recommend: http://www.phontron.com/kytea/

Stephen

To unsubscribe from this group and stop receiving emails from it, send an email to duolingo-sharedtas...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "duolingo-sharedtask-2020" group.
To unsubscribe from this group and stop receiving emails from it, send an email to duolingo-sharedtas...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/duolingo-sharedtask-2020/690ec965-ebbb-4ec6-8b70-186bfb173713%40googlegroups.com.

Abhishek Pandit

unread,
Mar 25, 2020, 2:31:45 PM3/25/20
to duolingo-sharedtask-2020
Hi Stephen,

Thanks for the clarifications! Now for today, a couple new questions:

Q1) Train-Test-Validation Errors
I'm currently working on the data for Portuguese. Since we have only a full 'train' data set at the moment, I modified line 20 in preprocess to just FOLDS = (train) instead of the original FOLDS=(train dev test).
However, there seem to be other places in your overall code base that loop over all three sets in FOLDS. For example, I get the following error messages  

Error 1:
While running preprocess.sh:
FileNotFoundError: [Errno 2] No such file or directory: 'data/courses/en-pt/dev-sents.clean.bpe.en'

Naturally, because I never did any work on the 'dev' sentences. 

Error 2: 
Then on train.sh:
FileNotFoundError: Dataset not found: valid (data/courses/en-pt/bin/)
We never used the term 'valid' (we refer to it here as 'dev' set) here.

It seems like we'll need a long time to comb through your .py files and even fairseq modules to figure out what went wrong. But I figured this must be a fairly common concern among other contributors too. Would you have any suggestions? I'd just like to get this baseline fairseq model up and running ASAP so we can compare the performance of the other models I'm working on.

Q2) Which Portuguese?
Just to confirm, this is Brazilian Portuguese, right? I see lots of 'voce' in the text- so that's a fairly strong indication.

Thanks again!
To unsubscribe from this group and stop receiving emails from it, send an email to duolingo-sharedtask-2020+unsub...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "duolingo-sharedtask-2020" group.
To unsubscribe from this group and stop receiving emails from it, send an email to duolingo-sharedtask-2020+unsub...@googlegroups.com.

Stephen Mayhew

unread,
Mar 26, 2020, 10:46:03 AM3/26/20
to Abhishek Pandit, duolingo-sharedtask-2020
Hi Abe,

Q1) Right -- I'm sorry for this confusion. The errors that you are seeing come from fairseq expecting train/valid/test file ("valid" short for "validation" is their term for dev set). Probably the simplest way to get around this is to split the train set into 3 sets, perhaps using 80/10/10% split or similar, and treat the large one as train, and the two smaller sets as dev and test. Does that make sense?

Q2) correct, it is Brazilian Portuguese!

Stephen

To unsubscribe from this group and stop receiving emails from it, send an email to duolingo-sharedtas...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "duolingo-sharedtask-2020" group.
To unsubscribe from this group and stop receiving emails from it, send an email to duolingo-sharedtas...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "duolingo-sharedtask-2020" group.
To unsubscribe from this group and stop receiving emails from it, send an email to duolingo-sharedtas...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/duolingo-sharedtask-2020/289ae5a9-a6f4-4ba7-97ce-ba73e8872e98%40googlegroups.com.

Abhishek Pandit

unread,
Mar 27, 2020, 10:20:08 PM3/27/20
to duolingo-sharedtask-2020

Hi Stephen,
Thanks! I internally split the training data into training, test and validation sets. My code runs fine for Poruguese, but I now have strange characters popping up in the test and training set. Specifically, '@@'.
For example, 
Portuguese: quer@@ ido david , como vai você ?
English: what pa@@ ges do we have to read ?
The @ signs doesn't replace any other character. It just seems to pop up in the middle of correctly spelt words. Interestingly, I don't see it in the training data. Has anyone else dealt with this @nomaly before?
To unsubscribe from this group and stop receiving emails from it, send an email to duolingo-sharedtask-2020+unsub...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "duolingo-sharedtask-2020" group.
To unsubscribe from this group and stop receiving emails from it, send an email to duolingo-sharedtask-2020+unsub...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "duolingo-sharedtask-2020" group.
To unsubscribe from this group and stop receiving emails from it, send an email to duolingo-sharedtask-2020+unsub...@googlegroups.com.

Stephen Mayhew

unread,
Mar 30, 2020, 11:09:06 AM3/30/20
to Abhishek Pandit, duolingo-sharedtask-2020
Hi Abe,

I believe what you're seeing are markers for wordpieces, probably coming out of fairseq-generate. You can read more about wordpiece, and why it's used here.

Notice that in run_pretrained.sh, there's a line that removes all of these wordpiece markers. Maybe this sed command isn't working for you?

Stephen

To unsubscribe from this group and stop receiving emails from it, send an email to duolingo-sharedtas...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "duolingo-sharedtask-2020" group.
To unsubscribe from this group and stop receiving emails from it, send an email to duolingo-sharedtas...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "duolingo-sharedtask-2020" group.
To unsubscribe from this group and stop receiving emails from it, send an email to duolingo-sharedtas...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "duolingo-sharedtask-2020" group.
To unsubscribe from this group and stop receiving emails from it, send an email to duolingo-sharedtas...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/duolingo-sharedtask-2020/3509c7c8-4018-4010-88c1-332099c00732%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages