Hi Chris,
have a look at
http://www.cl.uzh.ch/kitt/hg/sta/master/file/36c91c797b76/STA/ling/sentence_align.py
http://www.cl.uzh.ch/kitt/hg/sta/master/file/36c91c797b76/STA/ling/char_align.py
It's not a full implementation because it does not handle soft/hard
delimiters, but it wouldn't be too difficult to add those.
The license at the top says GPLv2, but if you need it in another
license, I'll be more than happy to cut that part out of our project and
put it under (for instance) BSD terms---our base library is BSD anyway.
best,
Torsten
--
.: Torsten Marek
.: http://shlomme.diotavelli.net
.: tor...@diotavelli.net -- GnuPG: 1024D/A244C858
> Thanks very much for the code. From 12 some pages of C code to about 3
> of Python - got to love Python!
> I ran your code on the turinde.tok and turinen.tok data (attached) and
> looked at the results. I also ran the hunalign implementation of the
> Gale_Church C code (which is in fact just the code given in the
> paper). The results didn't "align" ;-). I spent some time tweaking
> your code, got a little closer but then just slavishly translated the
> C code into Python (attached). This matched the outputs (I also
> matched some Madame Bovary data I have been working with).
I remember some differences between the Python code and the C version,
if I remember correctly it was due to numerical issues. The C version
uses scaled negative logarithms (cf. line 256 in your algorithm). I
would assume that the results become the same when you change line 68 in
our code:
return 2 * (1 - norm_cdf(delta)) * params.PRIORS[alignment]
to
return nlog(2 * (1 - norm_cdf(delta))) + nlog(params.PRIORS[alignment])
with
def nlog(x):
return -100 * math.log(x)
and probably fix the class priors to be ints. *Actually*, the original
implementation is the less accurate one in that regard.
> So my translation is attached along with the data and at this point
> I'll defer to you to do with it what you desire. In translating the C
> I had in mind the closest perhaps "dumbest/simplest" translation - so
> please excuse the utter lack of sophistication. Please let me know if
> I can be of any help in what you choose to do (e.g., I can provide you
> with my tweaking efforts - I spent some tracing the execution and
> finding where it diverged - could be something simple?)
As I said, I think it's due to the different weights/probabilities.
> I am continuing my work in alignment and must say that the work you
> are doing with the TreeAligner project (along with so much of the
> European efforts) is of great help and inspiration - thank you!!
I'm glad to hear that! We have more work in parallel treebanks lined up
for next year, including some machine translation projects. It seems
that I have to do some more sentence alignment work to do then, but I'd
like something more sophisticated than Gale/Church, probably already
involving simple translation models.
Am Donnerstag, den 03.12.2009, 07:27 +1100 schrieb Steven Bird:
> Chris, are you (or if not, is anyone else) interested in doing more
> work on this code to get it into NLTK, with guidance from me along the
> way?
which code do you mean? The ported C version or the condensed Python
one? I could clean the latter one up and add tests, documentation etc.
> In short it would need some inline documentation, some doctests, some
> minor reorganisation, and some data (preferrably a new corpus in
> nltk_data). It would also be good to have some evaluation code.
Do you have parallel corpora/treebanks in the NLTK data? If you're
interested, I'm sure we could give you some part of SMULTRON (maybe the
sampler which comes with the TreeAligner anyway, which is about 100
sentences German, English and Swedish), although I have to check back
with the department first.
The corpora are in TIGER-XML, but I can contribute the loader for that,
too.
I know. That code also needs some serious updating, given that I spent
the better part of this year improving it for our own releases. This
parser produces its own data structures though; and is contrib, not
core. I remember there was a discussion on a proper graph data structure
as well, but I've lost track of that. Is there some consensus now?