English-Nepali-Urdu Parallel Corpus

Sarmad Hussain

unread,

Sep 29, 2008, 3:06:43 PM9/29/08

to urdu_co...@yahoogroups.com, Urd...@googlegroups.com, pak...@yahoogroups.com, PANLoca...@yahoogroups.com, Regional Secretariat

We are pleased to release Urdu and Nepali corpora parallel to 100,000 words of common English source from PENN Treebank corpus, available through Linguistic Data Consortium (LDC). The text files used are listed in the README files provided for each corpus.

As both Urdu and Nepali corpora are parallel to the same English corpus, they are also parallel to each other.

The work has been done at CRULP (www.crulp.org for Urdu) and MPP (www.mpp.org.np for Nepali), and has been supported by the Language Resource Association (GSK) of Japan and International Development Research Center (IDRC) of Canada, through PAN Localization project (www.PANL10n.net).

The POS tagged version of the corpora will also be released soon. Please visit http://www.crulp.org/software/ling_resources/UrduNepaliEnglishParallelCorpus.htm to access the corpora. Similar parallel corpora for other participating languages will also be made available through PAN Localization project later.

Eid Mubarak,

Sarmad

CRULP, NUCES (www.crulp.org, www.nu.edu.pk)

Hardie, Andrew

unread,

Sep 29, 2008, 9:36:33 PM9/29/08

to Urd...@googlegroups.com

Dear Sarmad,

This is very exciting news. I wonder, could you tell us anything about the alignment of the corpora? Such as, is it sentence-level only or word level, and was it applied manually or automatically?

Many thanks

Andrew.

Andrew Hardie

Linguistics & English Language

Bowland College

Lancaster University

Lancaster LA1 4YT

United Kingdom

www.ling.lancs.ac.uk/staff/hardie

Sarmad Hussain

unread,

Sep 30, 2008, 1:58:38 AM9/30/08

to Urd...@googlegroups.com

It has been translated and thus aligned by sentence.

Regards,
Sarmad

style='font-size:12.0pt;font-family:"Times New Roman","serif"'>

Reply all

Reply to author

Forward