Including Europarl into NLTK

676 views
Skip to first unread message

Nitin Madnani

unread,
Jun 24, 2009, 11:58:33 PM6/24/09
to nltk...@googlegroups.com
Hi All,

I have been using Europarl for a while now for my MT work and I
finally have some time to try and add it as a corpus to the NLTK data
distribution. I have obtained permission to do so from Philipp Koehn
(who was very generous). However, I need some suggestions as to how
best to implement such an inclusion since this will be the first
multilingual corpus in NLTK. The CESS treebank corpora are
multilingual; however, they are actually included as disparate
corpora: one for Spanish (cess_esp) and one for Catalan (cess_cat). I
would prefer to include the entire Europarl collection as one corpus.

First a brief introduction to the Europarl collection (http://www.statmt.org/europarl
): There are 10 separate corpora in the Europarl corpus collection;
one for each of the following European languages:

Danish
German
Greek
English
Spanish
Finnish
French
Italian
Dutch
Portuguese
Swedish

Each corpus for each language is split into several text files: one
for each day of the European parliament proceedings. However, not each
day's proceedings are available for every language. In addition, files
for the same day in multiple languages are not sentence aligned. For
use in MT, 10 separate sentence-aligned bitexts have been created
from this raw data: one for each language with English as the second
language. So, there is Danish-English bitext, German-English bitext
and so on.

So, the question now arises as to how to incorporate the
multilinguality of the corpus into NLTK. There are several possible
ways that I can see:

(a) Just have a separate corpus for each language with no relationship
between languages. So, such a corpus would be used as follows:

>>> from nltk.corpus import europarl
>>> ensents = europarl.sents('english') # English sentences
>>> frwords = europarl.words('french') # French words ... and so on

This would require that a language argument would be required. This
may be the easiest to implement but may be not as useful for true
multilingual processing (processing two or more languages at the same
time) as (2) or (3) below. Doing this is only marginally different
from how we include the CESS corpora.

(b) Cross-align all the 10 languages to each other using the existing
X-English bitexts and then have a corpus that is linked at the
sentence as well as document level (so, the 'sents()' method would
return a sentence in each language). However, the problem here could
be that because we are inferring a 10-way alignment by transitivizing
10 1-way alignments, some of the sentences may not have parallel
counterparts in all the languages. So, we could just throw away those
sentences and only keep the ones that are 10-way aligned. However,
removing sentences would have an effect on the document coherence.

(c) Same as (2) but rather than throwing out sentences that are not 10-
way aligned, keep them. So, this would mean that, for example, for
some of the sentences, the 'sents()' method would return a list of
length 10 with one or more empty strings (representing languages in
which there was no parallel counterpart for this sentence). This would
certainly mean a larger portion of the corpus would be available in
NLTK as compared to (2).

Of course, there may be a better way of doing it that I haven't
thought of. Comments and suggestions welcome.

Thanks,
Nitin


Nitin Madnani

unread,
Jun 29, 2009, 4:47:42 PM6/29/09
to nltk-dev
I have been thinking more about this and I think it might be best to
include two different versions of Europarl:

(1) 'europarl_raw': This is what option (a) is referring to above. A
separate europarl corpus for each language with no links between the
languages.

(2) 'europarl_aligned': A sentence-aligned version of Europarl as in
(b) above which would certainly lead to some data being thrown out and
only sentences that can be 10-way aligned being included.

I think having this dichotomy will be useful because some people would
just want to use large amounts of data in a single European language
without needing any parallel data in a second language. For these
people, it's more important that the full corpus be available in the
desired language which would be served by (a).

Again, comments and suggestions are welcome!

Nitin Madnani

unread,
Jun 30, 2009, 11:04:02 PM6/30/09
to nltk-dev
I have made progress on this front and added this as in issue so that
progress can be tracked as I continue. You can find the issue here:

http://code.google.com/p/nltk/issues/detail?id=415&sort=-id

Please provide any comments there rather than in this thread so that I
don't have to track two different threads.
Reply all
Reply to author
Forward
0 new messages