importing parallel corpora with NLTK

808 views
Skip to first unread message

Dani

unread,
Jan 5, 2013, 2:33:21 PM1/5/13
to nltk-...@googlegroups.com
Hi all,

I trying to import parallel (bilingual) corpora with NLTK. The only one available in the NLTK collection is EuroParl but it's been removed from the website.

I had a look at:

Europarl: http://www.statmt.org/europarl/
and many others in the OPUS project: http://opus.lingfil.uu.se/

it seems they are encoded in some XMLish format (which I don't know how to deal with) and it'd be great to be able to use them in NLTK.
I found something useful here: http://moin.delph-in.net/MtRuleExtraction, I haven't played with it yet but it explains how to use parallel corpora with Animalign which is written in Python.

I also found this post in a Google Group by Nitin Madnani that speaks about the issue:



"Technically speaking, we *do* have parallel corpora (without treebanks) in the NLTK data: I added the Europarl parallel corpus and a reader module to NLTK a few months ago. It isn't really available from the nltk.downloader module yet (which is something we need to figure out how to do) but it's available if people are interested.

Steven, may be we should finally add the europarl corpus distribution to the data index.xml page and to nltk.downloader?

Nitin"

Is there any way to import the parallel corpora into NLTK?
can I access to the Europarl parallel corpus NLTK version?

I'd like to help in this but I'm not an expert progammer and I'm only able to program in Python.
thanks for the support,

Daniele


Steven Bird

unread,
Jan 5, 2013, 3:41:52 PM1/5/13
to nltk-...@googlegroups.com
Hi Daniele,

There is a Europarl sample distributed with the NLTK corpus collection [1], and you can access it as follows:
>>> from nltk.corpus.europarl_raw import german, english

There is an implementation of the Gale-Church aligner in nltk_contrib [2].

NLTK also includes the Comtrans word-aligned corpus, and you can access it as follows:
>>> from nltk.corpus import comtrans

Some of my students have implemented some of the IBM word alignment models, and I intend to add these to NLTK in future.

-Steven Bird



--
 
 

Daniele Panizza

unread,
Jan 8, 2013, 5:44:37 PM1/8/13
to nltk-...@googlegroups.com
Hi Steven,

thanks a lot for the ready answer and the info.
I'll go thru the Europarl sample and I'll look into the aligner. back with more questions soon.

dani

--
 
 

vyvian somaya

unread,
Dec 15, 2017, 5:35:02 PM12/15/17
to nltk-users
Hello,
Do you think you can help me import europarl parallel corpora for  german as source language and english as target language using NLTK in python? please do reply i'm having a hard time with my project. Thank you

regards,

vyvian somaya Nellira
Reply all
Reply to author
Forward
0 new messages