New machine translation article

1 view
Skip to first unread message

clpe...@gmail.com

unread,
Dec 1, 2006, 7:49:41 PM12/1/06
to Honyaku E<>J translation list
There's a new article about machine translation at Wired, the link is
<http://www.wired.com/wired/archive/14.12/translate.html>. The process
in the article uses an accurate, but slow, algorithm.

--
Chris Pearce
Kobe

Fred Uleman

unread,
Dec 1, 2006, 10:23:42 PM12/1/06
to hon...@googlegroups.com
Thanks for the pointer to the article. Interesting.
A few shallow comments:

Carbonell says that "we acknowledge our responsibility" would be a
better translation than "we declare our responsibility." I disagree.
If they are proud of what they did, they do not simply acknowledge --
much less admit -- but proudly declare. Even proclaim.

The article notes that "algorithyms do well only when applied to the
same type of text on which they've been trained." Same thing is true
of humans.

"The commercial translation market is now roughly $10 billion
annually." Wonder where that munber came from. And wonder if there are
language-combination breakdowns. For example, what's the J<->E market?

The article quotes Language Weaver CEO Bryce Benjamin as saying his
company's system "is being used day in and day out to catch bad guys."
Name three. This strikes me as akin to the government's claim that
wiretapping and torture are okay because they catch bad guys every day
-- we just haven't caught any yet. Name three.

Jack Halpern is mentioned. Does this mean Japanese is not far behind?

Finally, it notes that some of the problems in the English derived
from the Spanish, which was a bad translation from Arabic. Yes, that's
the beauty of using a bad text to start with. You can blame
infelicities on the source text. So why didn't they use a good Spanish
original? Say, something from a Madrid newspaper about the
bullfighting business? Something that started off in natural Spanish?
Was this on purpose?

Interesting article. Interesting approach. Bears watching.

--
Fred Uleman

Jon Johanning

unread,
Dec 1, 2006, 11:10:11 PM12/1/06
to hon...@googlegroups.com
Furthermore, to consider a possible application to the Japanese-
English pair, this algorithm proceeds by 8-word segments. But most
Japanese patents, etc., contain numerous sentences which one can't
make any sense out of until one has gone a dozen or even two dozen
words or more. I suppose the algorithm could be expanded in that
direction, but it seems that that would slow it down even further.
And the bigger the chunks it works with, the more likelihood that it
would end up with the usual MT word-salad, it seems to me.

Also, how would it divide Japanese texts up into words? That puzzles
me with other Japanese MT systems, too. I suppose there is some
algorithm for doing it. But there can't be a completely mechanical
way of dividing up strings of kanji and kana in exactly the right way
every time, it seems to me. And if the MT system doesn't do that, it
can easily be lead completely astray.


Jon Johanning // jjoha...@igc.org
__________________________
Belinda: Ay, but you know we must return good for evil.
Lady Brute: That may be a mistake in the translation.
-- Sir John Vanbrugh: The Provok’d Wife (1697), I.i.

Mark Spahn

unread,
Dec 2, 2006, 1:51:49 AM12/2/06
to hon...@googlegroups.com
==UNQUOTE==
 
Thank you, Chris, for finding this interesting article.
It mentions a "BLEU score" as a measure of translation accuracy.
is a not-very-clear explanation of this "Bilingual Evaluation Understudy"
method of scoring, which "works by measuring the n-gram co-occurrence
between a given translation and a set of reference translations".
It looks like the BLEU score was defined so as to be easy to
calculate by computer, with no subjective human intervention.
(What?  It talks about "n-grams"?  Shades of L. Ron Hubbard.)
 
-- Mark Spahn  (West Seneca, NY)
 

Jim Breen

unread,
Dec 3, 2006, 4:23:41 AM12/3/06
to hon...@googlegroups.com
"clpe...@gmail.com" wrote:
>> There's a new article about machine translation at Wired, the link is
>> <http://www.wired.com/wired/archive/14.12/translate.html>. The process
>> in the article uses an accurate, but slow, algorithm.

Quite a good article. Once you adjust for the journalese, and the
press-release puffery from Meaningful Machines (who run these things
whenever they need more venture capital), it's actually a reasonable
roundup of what's going on in statistical MT.

"Fred Uleman" added:

>> The article quotes Language Weaver CEO Bryce Benjamin as saying his
>> company's system "is being used day in and day out to catch bad guys."
>> Name three. This strikes me as akin to the government's claim that
>> wiretapping and torture are okay because they catch bad guys every day
>> -- we just haven't caught any yet. Name three.

Apparently the single biggest market for statistical MT is Arabic-English.
And the biggest customer is the US defence/intelligence community. 9/11
caught them with the trousers round their ankles and it's been a frantic
catch-up ever since. A heap of money has been thrown at MT in an attempt
to bridge the gap, and I suspect the bilingual texts, on which statistical
MT depends, are almost all classified documents from unmentionable sources.

>> Jack Halpern is mentioned. Does this mean Japanese is not far behind?

I doubt it. Jack's outfit out at 新座市 just happens to be one of the major
supplier of industrial-strength lexicons. He was telling me over
lunch earlier this year that they do most of their work outside Japanese.

The problem with Japanese and statistical MT is that there are few good
sizeable parallel corpora available. The newspapers are useless for this;
a glance at the English versions of the Asahi and Yomiuri quickly shows
that the English articles are invariable paraphrases of the Japanese
originals with material added and dropped willy-nilly. Useless for MT
training.

>> Interesting article. Interesting approach. Bears watching.

Certainly. It's by far the most promising MT approach to date. All
you JE translators are lucky you are working for clients who for obvious
reasons aren't about to pool the fruits of your labouring. If they did,
.....

and Jon Johanning added:

>> Furthermore, to consider a possible application to the Japanese-
>> English pair, this algorithm proceeds by 8-word segments. But most
>> Japanese patents, etc., contain numerous sentences which one can't
>> make any sense out of until one has gone a dozen or even two dozen
>> words or more. I suppose the algorithm could be expanded in that
>> direction, but it seems that that would slow it down even further.
>> And the bigger the chunks it works with, the more likelihood that it
>> would end up with the usual MT word-salad, it seems to me.

I think the mention of the 8-word segments was in the post-processing
after a mass of possible translations are done for each chunk.

>> Also, how would it divide Japanese texts up into words? That puzzles
>> me with other Japanese MT systems, too. I suppose there is some
>> algorithm for doing it. But there can't be a completely mechanical
>> way of dividing up strings of kanji and kana in exactly the right way
>> every time, it seems to me.

This is not as hard as it seems, and there are several commercial and freeware
systems that do it very well. Google/Yahoo/etc. segment Japanese (and
Chinese, etc.) text before indexing it. If you want to play with a reasonable
freeware Japanese morphological analyzer, look up "Chasen".

Cheers

Jim

--
Jim Breen http://www.csse.monash.edu.au/~jwb/
Clayton School of Information Technology, Tel: +61 3 9905 9554
Monash University, VIC 3800, Australia Fax: +61 3 9905 5146
(Monash Provider No. 00008C) ジム・ブリーン@モナシュ大学

James Sparks

unread,
Dec 3, 2006, 1:16:55 PM12/3/06
to hon...@googlegroups.com
Fred Uleman wrote:
> Carbonell says that "we acknowledge our responsibility" would be a
> better translation than "we declare our responsibility." I disagree.
> If they are proud of what they did, they do not simply acknowledge --
> much less admit -- but proudly declare. Even proclaim.

Maybe, but you might be thinking that the verb declarar is closer to
"to declare" than it really is. While the two verbs map perfectly in
many contexts, they diverge in others. For example, "declararse
inocente" means "to plead not guilty," and I get the feeling that there
is some of that nuance in the original here (although, again, if it was
translated from Aarabic, all bets are off). I think "acknowledge" is a
very good choice here.

> So why didn't they use a good Spanish original?

Probably because (1) the reporter, who chose the text, wasn't a
translator and it didn't occur to him that it might be a translation
into Spanish, and (2) it is more dramatic (and therefore interesting to
the average reader) to use a text from Al Qaeda, because the importance
of the translation is not lost on anyone.

> Interesting article. Interesting approach. Bears watching.

Lions and tigers, too? <g> (That's not gratuitous; it is a perfect
example of a mistake a computer would make.)

I found the following to be the most relevant, and frightening, part of
the article:

"The results - according to the company, which hasn't released the data
publicly - sounded at first like a typical MT failure: The output from
the automated system required twice as many human hours to clean up. But
the experiment also showed that cleaning up errors takes only a small
fraction of the time required for the initial human translation. Thus,
even with slightly sloppier first drafts, replacing the initial
translator with a machine cuts the total human-hours of paid work in half."

We translators are now proudly playing our violins in the first chair,
but the day when we move back to second fiddle may not be that far off,
at which point there will be some serious natural selection occuring
among our ranks.

James Sparks

PS: When I sent this post, I was informed that it contained Unicode,
but while I assume it is in the quote I pasted in, I can't figure out
where it is exactly. Is there any way to do this?

Fred Uleman

unread,
Dec 3, 2006, 8:06:55 PM12/3/06
to hon...@googlegroups.com
re acknowledge vs. declare:

Actually, after I sent the note -- and this sort of thing also happens
after I send translations to clients -- I thought "claim" would be
better. They don't acknowledge responsibility for ... They claim the
credit for ...

--
Fred Uleman

Reply all
Reply to author
Forward
0 new messages