Data preparation (multiple sentences in pairs)

57 views
Skip to first unread message

Konstantin Keller

unread,
Oct 25, 2018, 12:22:29 PM10/25/18
to Google Cloud Translation API
Can my dataset contain pairs with multiple sentences?

George (Cloud Platform Support)

unread,
Oct 25, 2018, 3:37:42 PM10/25/18
to Google Cloud Translation API
Hello Konstantin, 

The purpose of providing pairs of sentences, in each case one sentence from the source and the corresponding one from the target language, is to facilitate the training of your model. Do you mean one source sentence with multiple possible translated sentences? Why is this pattern important to you? If this is the case, you can simply provide a set of pairs made of the same sentence from the source language in pair each time with a different translation variant. For reference, you may have a look at "How AutoML Translation works" illustration on the "Cloud Translation" documentation page

 In fact, how do you see your multiple sentences data set? How would you structure this data set, if not in simple pairs? More relevant details are needed to set the foundation of a successful investigation in your issue.    

Konstantin Keller

unread,
Oct 25, 2018, 3:58:02 PM10/25/18
to Google Cloud Translation API
Hello George,

We have a dataset that contains a few sentences in English (that describe some aspects of the game) and translated version in the target language. 

For example: 
English: "Track damaged. Risk of breakage increased."
Japanese: "履帯損傷、耐久性低下"

As you can see the English version has two sentences and a Japanese one. We are afraid to split those cases into sentences automatically.

Can we use cases like this in training dataset or we have to avoid it?

George (Cloud Platform Support)

unread,
Oct 26, 2018, 4:32:08 PM10/26/18
to Google Cloud Translation API
Hello Konstantin, 

AutoML treats each sentence pair as an independent training item, without assuming any correlation between separate pairs. This means that the point you placed between the two sentences should not worry you above measure. To avoid using a point and splitting your text in two, I suggest a half column instead: ";". This should help in evaluating your source text as one whole sentence. You may gather more insight from the "Preparing Training Data" documentation page

Konstantin Keller

unread,
Oct 26, 2018, 5:05:01 PM10/26/18
to Google Cloud Translation API
So, if I understood correctly: we can use sentence pairs that contain multiple pairs on both sides. Even if the number of sentences not matched exactly.

For example 6 sentences in source and 7 sentences in target:
Thank you for contacting customer support. Our developers are aware of this issue and are looking into solving it. I know this can be frustrating however we are working to rectify this problem as fast as we can.  We apologize for any inconvenience this can cause. If you have any more questions, please don't hesitate to ask. Best regards.\tお世話になります。 カスタマーサポートにお問い合わせいただきありがとうございます。 弊社の開発者はこの問題を認識しており、解決方法を調査中です。 苛立たしい状態ですが、できるだけ早くこの問題を解決するよう努めておりますので、もう少々お待ちください。  ご不便をおかけし、申し訳ございません。 さらにご質問がある場合は、お気軽にお問い合わせください。 宜しくお願いいたします。

George (Cloud Platform Support)

unread,
Oct 26, 2018, 5:42:31 PM10/26/18
to Google Cloud Translation API
Hi Konstantin, 

Your first example was quite different, as it could be still viewed as a pair of one sentence to one sentence, after modifying the inner dot with a semicolon. By contrast, your long piece of text provided in your last posting cannot possibly pass as a sentence in the language of origin, and probably neither in the target language, Japanese. This text is divisible in sentences, so you should really pair one sentence to the corresponding sentence. The model will then get properly trained, as expected. 

Konstantin Keller

unread,
Oct 26, 2018, 5:48:29 PM10/26/18
to Google Cloud Translation API
Ok. Thank you so much for swiftly response!
Reply all
Reply to author
Forward
0 new messages