Fwd: Corpora to support book translation

111 views
Skip to first unread message

Alisa_IPN

unread,
Mar 28, 2012, 7:43:57 PM3/28/12
to nltk-r...@googlegroups.com


---------- Forwarded message ----------
From: Steven Bird <s...@csse.unimelb.edu.au>
Date: Mar 14 2009, 4:02 pm
Subject: Corpora to support book translation
To: nltk-translation


The majority of examples and exercises in the NLTK book use English
corpora.  Ideally, any translation of the book would use corpora in
the same language as the translation, although these may not be
available (or if they're available, not redistributable).  What
minimal corpus collection should we have to provide the basis for
examples and exercises in the translated book?  Here are some
suggested corpora, along with their uses in the translation.

* text collection: tokenization, word frequency distributions,
concordances
* categorized text: conditional frequency distributions, text
classification)
* wordlist: spell checking, identifying out-of-vocabulary items (> 2k
entries?)
* POS-tagged text: lexical categories, sequence labelling, tagger
training (10k words?)
* others?

Less crucial but still very useful:

* lexicon with part-of-speech, morphology, pronunciation, semantic
domains, text frequency
* more richly annotated text
* larger text collections, with a wider variety of classifications
(genre, topic, etc)
* semantic network (e.g. wordnet)
* others?

For any corpus heavily used by the book, we need to be able to
redistribute the data, and have an NLTK corpus reader.

How realistic is it to obtain data from the first list above, for the
languages where translations are planned?  Are there other kinds of
data we should include on the list?  If POS-tagged data is not
available, how realistic is it to create a small sample by hand (e.g.
1k words with a simple tagset), for use in illustrations?  What would
be the required minimum number of examples and exercises in the target
language?  Presumably this would be more important in the early
chapters.

-Steven Bird

Reply all
Reply to author
Forward
0 new messages