Gale-Church sentence alignment

666 views
Skip to first unread message

Christopher Crowner

unread,
Nov 13, 2009, 10:50:48 AM11/13/09
to nltk...@googlegroups.com
Does anyone know if there is a Python implementation of the Gale-Church sentence alignment algorithm available?

In general is there a good source for alignment and other MT algorithms in Python?

Thanks,
Chris Crowner

Nitin Madnani

unread,
Nov 13, 2009, 12:20:03 PM11/13/09
to nltk...@googlegroups.com
Chris,

AFAIK, there is no Python implementation of the Gale-Church algorithm
but it'd be great if there was. I am sure someone has implemented this
in Python somewhere in some MT group in the world but I am not aware
of it. However, there *is* a Perl implementation available as part of
the tools available for processing the raw version of the Europarl
corpus. You can find them here: http://www.statmt.org/europarl/ [look
for the "tools" link]. I have always wanted to port that to Python and
make the algorithm available in NLTK but haven't gotten around to it.

As to your more general question about the availability of alignment
and other MT algorihms in Python: I believe the answer is some but not
a lot. For statistical MT, most systems out there rely on obtaining
word alignments from the unsupervised GIZA++ aligner (written and
released to the community in 1999 by Franz Och, currently at Google).
GIZA++ is written entirely in C++. A more recent word aligner that
incorporates the state-of-the-art research in unsupervised word
alignment is the Berkeley Aligner (http://nlp.cs.berkeley.edu/Main.html#WordAligner
) and is written in Java. The only word aligner in Python I am aware
of is 'anymalign' (previously 'malign') written by Adrien Lardilleux (http://users.info.unicaen.fr/~alardill/anymalign/
). From their RANLP paper (http://users.info.unicaen.fr/~alardill/pub/ranlp09-lardilleux.pdf
), it looks like using word alignments from this aligner in a phrase-
based MT system performs comparably with those obtained from GIZA++
but I have no personal experience with using it.

Moving beyond word alignment, I believe that most state-of-the-art, if
not all, open-source MT decoders are written in either C++ (Moses, http://www.statmt.org/moses/)
or Java (Joshua, http://joshua.sourceforge.net/Joshua/Welcome.html).
David Chiang (http://www.isi.edu/~chiang/) wrote the first
hierarchical phrase-based decoder called "Hiero" in 2005 while he was
at Maryland and coded it pretty much entirely in Python. However,
since then he has ported a lot of it to C (using the Pyrex extension
for Python) but I believe that quite a bit of it remains in Python.
Unfortunately, I don't think he has ever open-sourced Hiero. It may
be worth sending him an email to ask whether he can share the code.
You might also find that he has implemented other parts of the MT
pipeline in Python (like the Gale-Church algorithm).

Hope this helps,
Nitin

Torsten Marek

unread,
Nov 27, 2009, 10:34:55 AM11/27/09
to nltk...@googlegroups.com

Hi Chris,

have a look at

http://www.cl.uzh.ch/kitt/hg/sta/master/file/36c91c797b76/STA/ling/sentence_align.py
http://www.cl.uzh.ch/kitt/hg/sta/master/file/36c91c797b76/STA/ling/char_align.py

It's not a full implementation because it does not handle soft/hard
delimiters, but it wouldn't be too difficult to add those.

The license at the top says GPLv2, but if you need it in another
license, I'll be more than happy to cut that part out of our project and
put it under (for instance) BSD terms---our base library is BSD anyway.


best,


Torsten

--
.: Torsten Marek
.: http://shlomme.diotavelli.net
.: tor...@diotavelli.net -- GnuPG: 1024D/A244C858

signature.asc

Christopher Crowner

unread,
Dec 2, 2009, 8:58:42 AM12/2/09
to nltk...@googlegroups.com
Hi Thorsten,

Thanks very much for the code. From 12 some pages of C code to about 3 of Python - got to love Python!

I ran your code on the turinde.tok and turinen.tok data (attached) and looked at the results. I also ran the hunalign implementation of the Gale_Church C code (which is in fact just the code given in the paper). The results didn't "align" ;-). I spent some time tweaking your code, got a little closer but then just slavishly translated the C code into Python (attached). This matched the outputs (I also matched some Madame Bovary data I have been working with).

So my translation is attached along with the data and at this point I'll defer to you to do with it what you desire. In translating the C I had in mind the closest perhaps "dumbest/simplest" translation - so please excuse the utter lack of sophistication. Please let me know if I can be of any help in what you choose to do (e.g., I can provide you with my tweaking efforts - I spent some tracing the execution and finding where it diverged - could be something simple?)

I am continuing my work in alignment and must say that the work you are doing with the TreeAligner project (along with so much of the European efforts) is of great help and inspiration - thank you!!

Chris
sentence_align.py
turinde.tok
turinen.tok

Steven Bird

unread,
Dec 2, 2009, 3:27:48 PM12/2/09
to nltk-dev
Chris, are you (or if not, is anyone else) interested in doing more
work on this code to get it into NLTK, with guidance from me along the
way?

In short it would need some inline documentation, some doctests, some
minor reorganisation, and some data (preferrably a new corpus in
nltk_data). It would also be good to have some evaluation code.

http://code.google.com/p/nltk/wiki/DevelopersGuide

-Steven Bird

2009/12/3 Christopher Crowner <ccro...@gmail.com>:
> --
>
> You received this message because you are subscribed to the Google Groups
> "nltk-dev" group.
> To post to this group, send email to nltk...@googlegroups.com.
> To unsubscribe from this group, send email to
> nltk-dev+u...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/nltk-dev?hl=en.
>

Torsten Marek

unread,
Dec 2, 2009, 3:58:42 PM12/2/09
to nltk...@googlegroups.com
Hi Chris,


> Thanks very much for the code. From 12 some pages of C code to about 3
> of Python - got to love Python!

> I ran your code on the turinde.tok and turinen.tok data (attached) and
> looked at the results. I also ran the hunalign implementation of the
> Gale_Church C code (which is in fact just the code given in the
> paper). The results didn't "align" ;-). I spent some time tweaking
> your code, got a little closer but then just slavishly translated the
> C code into Python (attached). This matched the outputs (I also
> matched some Madame Bovary data I have been working with).

I remember some differences between the Python code and the C version,
if I remember correctly it was due to numerical issues. The C version
uses scaled negative logarithms (cf. line 256 in your algorithm). I
would assume that the results become the same when you change line 68 in
our code:

return 2 * (1 - norm_cdf(delta)) * params.PRIORS[alignment]

to

return nlog(2 * (1 - norm_cdf(delta))) + nlog(params.PRIORS[alignment])


with

def nlog(x):
return -100 * math.log(x)

and probably fix the class priors to be ints. *Actually*, the original
implementation is the less accurate one in that regard.

> So my translation is attached along with the data and at this point
> I'll defer to you to do with it what you desire. In translating the C
> I had in mind the closest perhaps "dumbest/simplest" translation - so
> please excuse the utter lack of sophistication. Please let me know if
> I can be of any help in what you choose to do (e.g., I can provide you
> with my tweaking efforts - I spent some tracing the execution and
> finding where it diverged - could be something simple?)

As I said, I think it's due to the different weights/probabilities.

> I am continuing my work in alignment and must say that the work you
> are doing with the TreeAligner project (along with so much of the
> European efforts) is of great help and inspiration - thank you!!

I'm glad to hear that! We have more work in parallel treebanks lined up
for next year, including some machine translation projects. It seems
that I have to do some more sentence alignment work to do then, but I'd
like something more sophisticated than Gale/Church, probably already
involving simple translation models.

signature.asc

Torsten Marek

unread,
Dec 2, 2009, 4:03:56 PM12/2/09
to nltk...@googlegroups.com
Hi Steven,


Am Donnerstag, den 03.12.2009, 07:27 +1100 schrieb Steven Bird:
> Chris, are you (or if not, is anyone else) interested in doing more
> work on this code to get it into NLTK, with guidance from me along the
> way?

which code do you mean? The ported C version or the condensed Python
one? I could clean the latter one up and add tests, documentation etc.

> In short it would need some inline documentation, some doctests, some
> minor reorganisation, and some data (preferrably a new corpus in
> nltk_data). It would also be good to have some evaluation code.

Do you have parallel corpora/treebanks in the NLTK data? If you're
interested, I'm sure we could give you some part of SMULTRON (maybe the
sampler which comes with the TreeAligner anyway, which is about 100
sentences German, English and Swedish), although I have to check back
with the department first.

The corpora are in TIGER-XML, but I can contribute the loader for that,
too.

signature.asc

Steven Bird

unread,
Dec 2, 2009, 4:32:26 PM12/2/09
to nltk-dev
2009/12/3 Torsten Marek <shl...@gmx.net>:
> which code do you mean? The ported C version or the condensed Python
> one? I could clean the latter one up and add tests, documentation etc.

Fine -- that would be a very welcome contribution, thanks.

> Do you have parallel corpora/treebanks in the NLTK data?

No.

> If you're
> interested, I'm sure we could give you some part of SMULTRON (maybe the
> sampler which comes with the TreeAligner anyway, which is about 100
> sentences German, English and Swedish), although I have to check back
> with the department first.

Yes please!

> The corpora are in TIGER-XML, but I can contribute the loader for that,
> too.

Sure, but NB: nltk_contrib.tiger

-Steven

Torsten Marek

unread,
Dec 2, 2009, 4:35:50 PM12/2/09
to nltk...@googlegroups.com
> Fine -- that would be a very welcome contribution, thanks.
>
> > Do you have parallel corpora/treebanks in the NLTK data?
>
> No.
>
> > If you're
> > interested, I'm sure we could give you some part of SMULTRON (maybe the
> > sampler which comes with the TreeAligner anyway, which is about 100
> > sentences German, English and Swedish), although I have to check back
> > with the department first.
>
> Yes please!
>
> > The corpora are in TIGER-XML, but I can contribute the loader for that,
> > too.
>
> Sure, but NB: nltk_contrib.tiger

I know. That code also needs some serious updating, given that I spent
the better part of this year improving it for our own releases. This
parser produces its own data structures though; and is contrib, not
core. I remember there was a discussion on a proper graph data structure
as well, but I've lost track of that. Is there some consensus now?

signature.asc

Steven Bird

unread,
Dec 2, 2009, 5:41:07 PM12/2/09
to nltk-dev
2009/12/3 Torsten Marek <shl...@gmx.net>:
> This parser produces its own data structures though; and is contrib, not
> core.

OK -- its a separate issue then.

> I remember there was a discussion on a proper graph data structure
> as well, but I've lost track of that. Is there some consensus now?

I also lost track of that. I wonder about using networkx.

-Steven

Nitin Madnani

unread,
Dec 2, 2009, 7:23:26 PM12/2/09
to nltk...@googlegroups.com
Technically speaking, we *do* have parallel corpora (without treebanks) in the NLTK data: I added the Europarl parallel corpus and a reader module to NLTK a few months ago. It isn't really available from the nltk.downloader module yet (which is something we need to figure out how to do) but it's available if people are interested.

Steven, may be we should finally add the europarl corpus distribution to the data index.xml page and to nltk.downloader?

Nitin
Reply all
Reply to author
Forward
0 new messages