We are pleased to release Urdu and Nepali corpora parallel to 100,000 words of common English source from PENN Treebank corpus, available through Linguistic Data Consortium (LDC). The text files used are listed in the README files provided for each corpus.
As both Urdu and Nepali corpora are parallel to the same English corpus, they are also parallel to each other.
The work has been done at CRULP (www.crulp.org for Urdu) and MPP (www.mpp.org.np for Nepali), and has been supported by the Language Resource Association (GSK) of Japan and International Development Research Center (IDRC) of Canada, through PAN Localization project (www.PANL10n.net).
The POS tagged version of the corpora will also be released soon. Please visit http://www.crulp.org/software/ling_resources/UrduNepaliEnglishParallelCorpus.htm to access the corpora. Similar parallel corpora for other participating languages will also be made available through PAN Localization project later.
Eid Mubarak,
Sarmad
CRULP, NUCES (www.crulp.org, www.nu.edu.pk)
Dear Sarmad,
This is very exciting news. I wonder, could you tell us anything about the alignment of the corpora? Such as, is it sentence-level only or word level, and was it applied manually or automatically?
Many thanks
Andrew.
Andrew Hardie
Linguistics & English Language
Bowland College
Lancaster University
Lancaster LA1 4YT
United Kingdom
It has been translated and thus aligned by sentence.
Regards,
Sarmad
style='font-size:12.0pt;font-family:"Times New Roman","serif"'>