problem with commas and complex sentences

141 views
Skip to first unread message

Senior Avr

unread,
Feb 3, 2021, 3:54:47 PM2/3/21
to Google Cloud Translation API
I'm translating technical documents from Spanish to English. I'm using my glossary which is  mainly technical abbreviations in Spanish to abbreviation in English.

(in both examples, the glossary translates es:"temp"->en:"tmp")

Issue#1 example:
es: "usando 1  ,  5 temp"  --> en: "using 1.5 tmp" (translation result)
Note: the comma became a decimal point and both parts of the sentence joined :(
How can I prevent that "numbering" manipulation?

Issue#2 example:
es: "*2 temp en el primer temp, 1 temp en los sig" -> en: "* 2 in the first tmp tmp 1 tmp in following" (translation result)
Note: the comma is gone (??) and sentence structure is totally corrupted.
Any ideas?

This is killing the project.
Thanks in advanced.

Adam Bittlingmayer

unread,
Feb 5, 2021, 1:23:36 PM2/5/21
to Google Cloud Translation API
My instinct says that these are both pretty noisy - a lot of humans wouldn't know how to parse them either.

Are you trying to rely on raw machine translation for your project, or do you send it for human post-editing?

Is "temp" actually what ends up in the final string, or is it a placeholder that gets swapped out?

Adam

Konstantin Savenkov

unread,
Feb 5, 2021, 3:57:33 PM2/5/21
to Google Cloud Translation API
Hi,

The problem with the ad-hoc abbreviations is that those are not "normal" words and the MT language model does not know them. Therefore, the sentence gets corrupted and the glossary won't work in the way it's implemented in most MT systems (via looking for the glossary entry in the translation results and replacing it with the right word).

We solve the abbreviations issue by expanding abbreviations before the translation and then using MT glossaries to translate them using a proper target language abbreviation. 

Drop me a line at k...@inten.to and I may show how it works on your data.

cheers,
Konstantin.

Senior Avr

unread,
Feb 8, 2021, 3:12:46 AM2/8/21
to Google Cloud Translation API
Hi Adam,
This should be raw machine translation of these instructions into a web page.
"temp", as you guessed, is a placeholder but terms are mostly abbreviations and is the reason for creating a gloss.
In any case a phrase such as ...1  ,  5 ... with lots of white spaces does not resemble a number.
Any tricks or API params to let google know not to mess around with removing or adding white spaces?
I think this has to do a lot with the bad translation.
Avi

Kevin Islas

unread,
Feb 8, 2021, 3:12:52 AM2/8/21
to Google Cloud Translation API
Hi, 

Issue #1 is a bit complicated since the input is a bit noisy. This is not related to the use of glossaries but to the '1  ,  5', as sending the request without glossary

es: "usando 1  ,  5 temp" 
Results in "using 1.5 temp"
So the main problem is determining what "1  ,  5" gets translated into.

As Adam pointed out. I think even human translators wouldn't be sure how to parse that. However, if you want to keep the "1  ,  5" as it is in the original text you can try wrapping it around "notranslate" spans. This will ensure that any text within the "notranslate" spans is kept as is (https://cloud.google.com/translate/troubleshooting)

As an example, this is a request using the "notranslate" with a glossary that translates es:"temp"->en:"tmp". 
I set up a translation request with the following attributes:
contents: 'usando <span class="notranslate">1  ,  5</span> temp',
mime_type: 'text/html',

And got "using <span class="notranslate">1 , 5</span> tmp" as a result.

As for issue#2, I tried replicating the request with a glossary that translates es:"temp"->en:"tmp" but got a different result:
contents: '*2 temp en el primer temp, 1 temp en los sig'

And got "* 2 tmp in the first tmp, 1 tmp in the next" as a result.

Could you provide more details on the request for issue #2 so we can replicate it?
The things that would be helpful are whether you are using a custom model, the default (nmt) model or the base (pbmt) model for translation and what client library you are using to send requests.

Best,
Kevin

Adam Bittlingmayer

unread,
Feb 8, 2021, 4:29:23 AM2/8/21
to Google Cloud Translation API
Hi all,

Re #1, I think it's risky to put a span with notranslate around the numbers (even if it works in this specific example).  We want the clauses to be seen more separately, not less, and we have no guarantee what the order inside each clause will be in the target language.

This requires sequence-awareness.  So it just isn't a job for glossaries, this is a job for normal AutoML customisation from training examples.

If we're going to hack around with preproc, I'd put it around each number, or replace the comma with a less ambiguous control character (like semi-colon).  But again, I think better to put the time into creating training examples.

Let us know how it goes

Adam

Senior Avr

unread,
Feb 8, 2021, 2:11:25 PM2/8/21
to Google Cloud Translation API
Hi Kevin,
Thanks for the detailed testing.
Issue#1:  
I'm not sure when to wrap a text with a " notranslate"  span and when not to. As an example for both issues (#1,#2) I've noticed that when I wrote "2temp" (no space, which is a common writing in our field) it was not translating at all (result es: "2temp") so I used a regex to separate such case, which worked fine, but ended up separating also 2nd,3rd which made things worse :(

In addition I simply went to google-translator's web interface (https://translate.google.com/?sl=auto&tl=es&text=1%2C%205&op=translate) and you can see that en:"1  ,  5" translates to "15" in Spanish. And most horribly also vise-versa: https://translate.google.com/?sl=es&tl=en&text=1%2C%205&op=translate .  Any ideas??

Issue#2 is not consistent and I found it depends again on the "*" and spaces. 
I'm using default (i.e. nmt) model and using a PHP client lib.

Kevin Islas

unread,
Feb 17, 2021, 6:27:12 AM2/17/21
to Google Cloud Translation API
Hi,

The "2temp" scenario is expected given that glossary terms need to match exactly the word that they want to replace.
In this case, I think that it depends on the context. If "2temp" is a common writing in your field you might want to consider explicitly adding "2temp" as a glossary term.

If other terms end with temp and you also want to have them all translated to tmp (say 3temp->3tmp, 4temp->4tmp, etc.)
Then a regex makes sense. I'm not sure if I understand what you mean by separating 2nd and 3rd, but I think refining the regex might help so it only matches the x-temp terms you want to process as "temp" (e.g matching to [0-9]temp). However, ultimately if the amount of field-specific terms is relatively small the most straightforward way is to add each one as a glossary term.

As for the "1  ,  5" issue, this is an interesting case, thanks for sharing both examples ! I'll check with the team to investigate this further and get back to you. However, this seems to be a complex case since it is not clear even for human translators to parse the meaning of "1  ,  5". I think that currently the only way for ensuring to preserve exactly that chunk of text would be wrapping it around notranslate spans.

Best,
Kevin

Senior Avr

unread,
Feb 24, 2021, 3:11:32 AM2/24/21
to Google Cloud Translation API
I'm currently getting around the numbers-issue with a <span>1,   5</span> pre-translation and removing it post-translation. Same for the "2temp" where I use a regex to separate it to "2 temp" pre-translation. 
Two more issues I got:
1) I noticed, in the API documentation, that POS column in the CSV glossary is not used by the algorithm. So is there any other way to distinguish between word meaning such as "round" as a noun and "round" (such as round edge) as an adjective? As those two are translated totally differently (e.g. in my case to Spanish) Alternatively, how can I detect (pre-translation) if that's a noun or not?
2) Is there a way to use /regex/ in the glossary in any way?

Thx,
Avi

Reply all
Reply to author
Forward
0 new messages