You do not have permission to delete messages in this group
Copy link
Report message
Sign in to report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to nltk-...@googlegroups.com
Hi all,
I trying to import parallel (bilingual) corpora with NLTK. The only one available in the NLTK collection is EuroParl but it's been removed from the website.
it seems they are encoded in some XMLish format (which I don't know how to deal with) and it'd be great to be able to use them in NLTK. I found something useful here: http://moin.delph-in.net/MtRuleExtraction, I haven't played with it yet but it explains how to use parallel corpora with Animalign which is written in Python.
I also found this post in a Google Group by Nitin Madnani that speaks about the issue:
"Technically
speaking, we *do* have parallel corpora (without treebanks) in the NLTK
data: I added the Europarl parallel corpus and a reader module to NLTK a
few months ago. It isn't really available from the nltk.downloader
module yet (which is something we need to figure out how to do) but it's
available if people are interested.
Steven, may be we should finally add the europarl corpus distribution to the data index.xml page and to nltk.downloader?
Nitin"
Is there any way to import the parallel corpora into NLTK? can I access to the Europarl parallel corpus NLTK version?
I'd like to help in this but I'm not an expert progammer and I'm only able to program in Python. thanks for the support,
Daniele
Steven Bird
unread,
Jan 5, 2013, 3:41:52 PM1/5/13
Reply to author
Sign in to reply to author
Forward
Sign in to forward
Delete
You do not have permission to delete messages in this group
Copy link
Report message
Sign in to report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to nltk-...@googlegroups.com
Hi Daniele,
There is a Europarl sample distributed with the NLTK corpus collection [1], and you can access it as follows:
>>> from nltk.corpus.europarl_raw import german, english
There is an implementation of the Gale-Church aligner in nltk_contrib [2].
NLTK also includes the Comtrans word-aligned corpus, and you can access it as follows:
>>> from nltk.corpus import comtrans
Some of my students have implemented some of the IBM word alignment models, and I intend to add these to NLTK in future.
You do not have permission to delete messages in this group
Copy link
Report message
Sign in to report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to nltk-...@googlegroups.com
Hi Steven,
thanks a lot for the ready answer and the info. I'll go thru the Europarl sample and I'll look into the aligner. back with more questions soon.
dani
--
vyvian somaya
unread,
Dec 15, 2017, 5:35:02 PM12/15/17
Reply to author
Sign in to reply to author
Forward
Sign in to forward
Delete
You do not have permission to delete messages in this group
Copy link
Report message
Sign in to report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to nltk-users
Hello,
Do you think you can help me import europarl parallel corpora for german as source language and english as target language using NLTK in python? please do reply i'm having a hard time with my project. Thank you