MMM in the Tool Kil

31 views
Skip to first unread message

Ignacio Garcia

unread,
Nov 10, 2010, 5:11:52 AM11/10/10
to Moses.for.Mere.Mortals, i.ga...@uws.edu.au

Jost Zetzsche publishes a very well read Tool Kit for professional
translators dealing with technical matters. In its last 178th issue he
refered to MMM

+++
Achim Ruopp brought to my attention Moses for Mere Mortals, a
specially packaged and presumably much, much easier-to-use version of
the open-source statistical MT Moses monster. So in theory, it's a
free MT engine that is easy enough for the freelance translator or
smallish LSP to use and feed with their own TM data and come out with
some kind of results for specialized MT purposes. Again, I have no
experience with it and would love some feedback for that as well.
+++

I've pasted below the feedback I sent to him responding to that:

+++
Hi Jost
Since you asked…
I’ve been using the Moses for Mere Mortals these last weeks – not for
translating, though, but for research/training-related reasons.
I tried to install Moses first from the step-by-step guide in
statmt.org and run into all sort of problems (the guide is good, but
my Linux literacy is limited). With much less effort I was able to put
Moses together following the quite clear MMM guides, and gathering
some available data from the statmt.org site. And it works!, although
not as good as Google Translate - yet.
MMM has also some Windows add-ins that allow to export Moses output
into TMX, and to import TMX into the format Moses reads so, in theory,
yes, it can be easily integrated into our TenT’s workflow.
Would it be really useful, right now, for freelance translators?
Perhaps for those who had collected over the years (or are able to
access) BIG memories closely related to the type of texts they still
translate. For me and for the few freelancers I’m in contact with,
that’s probably not the case.
But even if you had the memories, my guess now is that it will take
too many hours to finetune the system(s) – finding out what the right
proportion of “general” versus “in-domain” data should be - to get
results that are significantly better than those of the free online
systems.
Having said that, I loved MMM and I’m sure the future of technical
translation will end up integrating TM (ok, TenTs) and MT in that way.
But for the freelance (as against corporations and big LSPs) to
benefit, clean “general” and “domain-tagged” monolingual and bilingual
data need to be made available first. And for this, projects such as
those of TAUS Data Association need to succeed.
Cheers
+++

Just wanting to let you know that you've done a great work there. I'm
myself now trying to get some funding to see if I can put MMM to serve
some actual purpose.

All the best.

Ignacio


Moses for Mere Mortals

unread,
Nov 10, 2010, 8:53:28 AM11/10/10
to mosesform...@googlegroups.com
Hi, Ignacio,

I'm glad that you like this work. We had the same difficulties you had
when we started using Moses (in fact, so big that the idea of MMM was
born ;-) ). You will need to have a good deal of aligned segments (in
our case, 14 million, but very good results already with just 6). Since
Moses uses "phrases" (that is, arbitrarily and sometimes very small
sequences of words - frequently just one or two - that have nothing to
do with our notion of grammar), I would not be too peculiar in my corpus
choices (though, if you want it to translate reasonably, you'll have to
find a way to put the representative vocabulary in it and I'd also
choose styles of texts adequate for the task at hand). Some languages
are much more forgiving than others (first and foremost the EN language,
but also the romance languages). Others, morphologically richer (DE, FI
and so on), might not give as good results.

I agree with your impression that the synergy between MT and TM is
important.

Thanks for your input,

Jo�o Rosas

On 11/10/10 11:11, Ignacio Garcia wrote:
> Jost Zetzsche publishes a very well read Tool Kit for professional
> translators dealing with technical matters. In its last 178th issue he
> refered to MMM
>
> +++
> Achim Ruopp brought to my attention Moses for Mere Mortals, a
> specially packaged and presumably much, much easier-to-use version of
> the open-source statistical MT Moses monster. So in theory, it's a
> free MT engine that is easy enough for the freelance translator or
> smallish LSP to use and feed with their own TM data and come out with
> some kind of results for specialized MT purposes. Again, I have no
> experience with it and would love some feedback for that as well.
> +++
>
> I've pasted below the feedback I sent to him responding to that:
>
> +++
> Hi Jost

> Since you asked�
> I�ve been using the Moses for Mere Mortals these last weeks � not for


> translating, though, but for research/training-related reasons.
> I tried to install Moses first from the step-by-step guide in
> statmt.org and run into all sort of problems (the guide is good, but
> my Linux literacy is limited). With much less effort I was able to put
> Moses together following the quite clear MMM guides, and gathering
> some available data from the statmt.org site. And it works!, although
> not as good as Google Translate - yet.
> MMM has also some Windows add-ins that allow to export Moses output
> into TMX, and to import TMX into the format Moses reads so, in theory,

> yes, it can be easily integrated into our TenT�s workflow.


> Would it be really useful, right now, for freelance translators?
> Perhaps for those who had collected over the years (or are able to
> access) BIG memories closely related to the type of texts they still

> translate. For me and for the few freelancers I�m in contact with,
> that�s probably not the case.


> But even if you had the memories, my guess now is that it will take

> too many hours to finetune the system(s) � finding out what the right
> proportion of �general� versus �in-domain� data should be - to get


> results that are significantly better than those of the free online
> systems.

> Having said that, I loved MMM and I�m sure the future of technical


> translation will end up integrating TM (ok, TenTs) and MT in that way.
> But for the freelance (as against corporations and big LSPs) to

> benefit, clean �general� and �domain-tagged� monolingual and bilingual

Tom Hoar

unread,
Nov 14, 2010, 11:05:24 PM11/14/10
to mosesform...@googlegroups.com
Hi João,

I couldn't agree with you more. In my opinion, the three most important aspects of training corpus, in order of priority, are: 1) matching the data to the purpose, 2) preparing the data for training, and 3) growing the data an effective size.

We're preparing to announce the BETA release of the Do Moses Yourself (DoMY) product. With today's updates (v 1.31), DoMY covers almost all of the Linux functionality in MMM. Ubuntu PPA's not only automate Installation, but enable users to automatically update to the newest version via the Update Manager. Shell scripts train new models and support all LM types (including KenLM and SRILM if manually installed outside of PPA). The Recaser can use any LM type. DoMY is missing scripts to move models to different hosts.

DoMY also includes an embedded version of Corpus Filtergraph. It prepares training data and runs the translation process. Eventually, it will replace all shell scripts. Our W

Our documentation needs a LOT of work, but I think the README file is a good start and each script has good --help function.

FYI, since you last tested it, we added building a specific LM corpus that yields a 37,500 segment TM and 47,500 phrase LM from the exact same sample data. This one enhancement boosted BLEU scores on the sample data from 0.25 to 0.36, although the README file still only reports 0.25.

MMM reaches into the Moses internals in ways that DoMY does not yet address by letting users edit the MMM scripts to change almost every configurable option. Hence, it places the source/binary files in an accessible place. So, MMM is a better solution for researchers and individuals with an interest to dig deep.

DoMY users edit config files instead of the scripts, which do not yet support the depth of options in MMM. So, it is focused on individuals and commercial users who want ease and convenience.

We're now looking for BETA testers. If anyone on this forum is interested, please contact me.

Tom
Managing Director
Precision Translation Tools Co., Ltd.
tah...@precisiontranslationtools.com
tah...@gmail.com



On Wed, Nov 10, 2010 at 8:53 PM, Moses for Mere Mortals <moses.for.m...@gmail.com> wrote:
Hi, Ignacio,

I'm glad that you like this work. We had the same difficulties you had when we started using Moses (in fact, so big that the idea of MMM was born ;-) ). You will need to have a good deal of aligned segments (in our case, 14 million, but very good results already with just 6). Since Moses uses "phrases" (that is, arbitrarily and sometimes very small sequences of words - frequently just one or two - that have nothing to do with our notion of grammar), I would not be too peculiar in my corpus choices (though, if you want it to translate reasonably, you'll have to find a way to put the representative vocabulary in it and I'd also choose styles of texts adequate for the task at hand). Some languages are much more forgiving than others (first and foremost the EN language, but also the romance languages). Others, morphologically richer (DE, FI and so on), might not give as good results.

I agree with your impression that the synergy between MT and TM is important.

Thanks for your input,

   João Rosas


On 11/10/10 11:11, Ignacio Garcia wrote:
Jost Zetzsche publishes a very well read Tool Kit for professional
translators dealing with technical matters. In its last 178th issue he
refered to MMM

+++
Achim Ruopp brought to my attention Moses for Mere Mortals, a
specially packaged and presumably much, much easier-to-use version of
the open-source statistical MT Moses monster. So in theory, it's a
free MT engine that is easy enough for the freelance translator or
smallish LSP to use and feed with their own TM data and come out with
some kind of results for specialized MT purposes. Again, I have no
experience with it and would love some feedback for that as well.
+++

I've pasted below the feedback I sent to him responding to that:

+++
Hi Jost
Since you asked…
I’ve been using the Moses for Mere Mortals these last weeks – not for

translating, though, but for research/training-related reasons.
I tried to install Moses first from the step-by-step guide in
statmt.org and run into all sort of problems (the guide is good, but
my Linux literacy is limited). With much less effort I was able to put
Moses together following the quite clear MMM guides, and gathering
some available data from the statmt.org site. And it works!, although
not as good as Google Translate - yet.
MMM has also some Windows add-ins that allow to export Moses output
into TMX, and to import TMX into the format Moses reads so, in theory,
yes, it can be easily integrated into our TenT’s workflow.

Would it be really useful, right now, for freelance translators?
Perhaps for those who had collected over the years (or are able to
access) BIG memories closely related to the type of texts they still
translate. For me and for the few freelancers I’m in contact with,
that’s probably not the case.

But even if you had the memories, my guess now is that it will take
too many hours to finetune the system(s) – finding out what the right
proportion of “general” versus “in-domain” data should be - to get

results that are significantly better than those of the free online
systems.
Having said that, I loved MMM and I’m sure the future of technical

translation will end up integrating TM (ok, TenTs) and MT in that way.
But for the freelance (as against corporations and big LSPs) to
benefit, clean “general” and “domain-tagged” monolingual and bilingual

Tom Hoar

unread,
Nov 14, 2010, 11:07:27 PM11/14/10
to mosesform...@googlegroups.com
Sorry, I accidentally hit a keyboard sequence that sent the message. I was going to add that "Our Windows functionality will come from Corpus Filtergraph, which is Python and runs on Windows quite well".

Regards,
Tom

On Mon, Nov 15, 2010 at 11:05 AM, Tom Hoar <tah...@gmail.com> wrote:
Hi João,

I couldn't agree with you more. In my opinion, the three most important aspects of training corpus, in order of priority, are: 1) matching the data to the purpose, 2) preparing the data for training, and 3) growing the data an effective size.

We're preparing to announce the BETA release of the Do Moses Yourself (DoMY) product. With today's updates (v 1.31), DoMY covers almost all of the Linux functionality in MMM. Ubuntu PPA's not only automate Installation, but enable users to automatically update to the newest version via the Update Manager. Shell scripts train new models and support all LM types (including KenLM and SRILM if manually installed outside of PPA). The Recaser can use any LM type. DoMY is missing scripts to move models to different hosts.

DoMY also includes an embedded version of Corpus Filtergraph. It prepares training data and runs the translation process. Eventually, it will replace all shell scripts. Our W <above>
Reply all
Reply to author
Forward
0 new messages