APronouncing Dictionary of American English, also referred to as Kenyon and Knott, was first published by the G. & C. Merriam Company in 1944, and written by John Samuel Kenyon and Thomas A. Knott. It provides a phonemic transcription of General American pronunciations of words, using symbols largely corresponding to those of the IPA. A similar work for English pronunciation is the English Pronouncing Dictionary by Daniel Jones, originally published in 1917 and available in revised editions ever since.[1]
Edward Artin, who succeeded Kenyon as the pronunciation editor of Webster's Dictionary, sought to revise the pronouncing dictionary many years after the publication of Webster's Third (1961), but to no avail, since none of the publishers Artin approached, including the Merriam company, thought it profitable to publish a new edition of the dictionary.[2] After 40 years since its publication, the pronouncing dictionary was still considered the "only major pronouncing dictionary of this century to appear in the United States" according to linguistics historian Arthur J. Bronstein.[3]
One principal application of Kenyon and Knott's system is to teach American English pronunciation to non-native speakers of English. It is commonly used for this purpose in Taiwan, where it is commonly known as "KK Phonetic Transcription" in Chinese.
The CMU Pronouncing Dictionary (also known as CMUdict) is an open-source pronouncing dictionary originally created by the Speech Group at Carnegie Mellon University (CMU) for use in speech recognition research.
CMUdict provides a mapping orthographic/phonetic for English words in their North American pronunciations. It is commonly used to generate representations for speech recognition (ASR), e.g. the CMU Sphinx system, and speech synthesis (TTS), e.g. the Festival system. CMUdict can be used as a training corpus for building statistical grapheme-to-phoneme (g2p) models[1] that will generate pronunciations for words not yet included in the dictionary.
The database is distributed as a plain text file with one entry to a line in the format "WORD " with a two-space separator between the parts. If multiple pronunciations are available for a word, variants are identified using numbered versions (e.g. WORD(1)). The pronunciation is encoded using a modified form of the ARPABET system, with the addition of stress marks on vowels of levels 0, 1, and 2. A line-initial ;;; token indicates a comment. A derived format, directly suitable for speech recognition engines is also available as part of the distribution; this format collapses stress distinctions (typically not used in ASR).
The CMU Pronouncing Dictionary (also known as cmudict) is a public domainpronouncing dictionary created by Carnegie Mellon University (CMU).It defines a mapping from English words to their North Americanpronunciations, and is commonly used in speech processing applications.
NLTK includes a small selection of texts from the Project Gutenbergelectronic text archive, which containssome 25,000 free electronic books, hosted at We beginby getting the Python interpreter to load the NLTK package,then ask to see nltk.corpus.gutenberg.fileids(), the file identifiers inthis corpus:
In 1, we showed how youcould carry out concordancing of a text such as text1 with thecommand text1.concordance(). However, this assumes that you areusing one of the nine texts obtained as a result of doing fromnltk.book import *. Now that you have started examining data fromnltk.corpus, as in the previous example, you have to employ thefollowing pair of statements to perform concordancing and othertasks from 1:
When we defined emma, we invoked the words() function of the gutenbergobject in NLTK's corpus package.But since it is cumbersome to type such long names all the time, Python providesanother version of the import statement, as follows:
Let's write a short program to display other information about eachtext, by looping over all the values of fileid corresponding tothe gutenberg file identifiers listed earlier and then computingstatistics for each text. For a compact output display, we will roundeach number to the nearest integer, using round().
This program displays three statistics for each text:average word length, average sentence length, and the number of times each vocabularyitem appears in the text on average (our lexical diversity score).Observe that average word length appears to be a general property of English, sinceit has a recurrent value of 4. (In fact, the average word length is really3 not 4, since the num_chars variable counts space characters.)By contrast average sentence length and lexical diversityappear to be characteristics of particular authors.
The previous example also showed how we can access the "raw" text of the book ,not split up into tokens. The raw() function gives us the contents of the filewithout any linguistic processing. So, for example, len(gutenberg.raw('blake-poems.txt'))tells us how many letters occur in the text, including the spaces between words.The sents() function divides the text up into its sentences, where each sentence isa list of words:
Although Project Gutenberg contains thousands of books, it represents establishedliterature. It is important to consider less formal language as well. NLTK'ssmall collection of web text includes content from a Firefox discussion forum,conversations overheard in New York, the movie script of Pirates of the Carribean,personal advertisements, and wine reviews:
There is also a corpus of instant messaging chat sessions, originally collectedby the Naval Postgraduate School for research on automatic detection of Internet predators.The corpus contains over 10,000 posts, anonymized by replacing usernames with genericnames of the form "UserNNN", and manually edited to remove any other identifying information.The corpus is organized into 15 files, where each file contains several hundred postscollected on a given date, for an age-specific chatroom (teens, 20s, 30s, 40s, plus ageneric adults chatroom). The filename contains the date, chatroom,and number of posts; e.g., 10-19-20s_706posts.xml contains 706 posts gathered fromthe 20s chat room on 10/19/2006.
The Brown Corpus was the first million-word electroniccorpus of English, created in 1961 at Brown University.This corpus contains text from 500 sources, and the sourceshave been categorized by genre, such as news, editorial, and so on.1.1 gives an example of each genre(for a complete list, see -los.html).
The Brown Corpus is a convenient resource for studying systematic differences betweengenres, a kind of linguistic inquiry known as stylistics.Let's compare genres in their usage of modal verbs. The first stepis to produce the counts for a particular genre. Remember toimport nltk before doing the following:
Next, we need to obtain counts for each genre of interest. We'll useNLTK's support for conditional frequency distributions. These arepresented systematically in 2,where we also unpick the following code line by line. For the moment,you can ignore the details and just concentrate on the output.
Observe that the most frequent modal in the news genre is will,while the most frequent modal in the romance genre is could.Would you have predicted this? The idea that word countsmight distinguish genres will be taken up again in chap-data-intensive.
The Reuters Corpus contains 10,788 news documents totaling 1.3 million words.The documents have been classified into 90 topics, and groupedinto two sets, called "training" and "test"; thus, the text withfileid 'test/14826' is a document drawn from the test set. This split is fortraining and testing algorithms that automatically detect the topic of a document,as we will see in chap-data-intensive.
Unlike the Brown Corpus, categories in the Reuters corpus overlap witheach other, simply because a news story often covers multiple topics.We can ask for the topics covered by one or more documents, or for thedocuments included in one or more categories. For convenience, thecorpus methods accept a single fileid or a list of fileids.
In 1, we looked atthe Inaugural Address Corpus,but treated it as a single text. The graph in fig-inauguralused "word offset" as one of the axes; this is the numerical index of theword in the corpus, counting from the first word of the first address.However, the corpus is actually a collection of 55 texts, one for each presidential address.An interesting property of this collection is its time dimension:
Let's look at how the words America and citizen are used over time.The following codeconverts the words in the Inaugural corpusto lowercase using w.lower() ,then checks if they start with either of the "targets"america or citizen using startswith() .Thus it will count words like American's and Citizens.We'll learn about conditional frequency distributions in2; for now just considerthe output, shown in 1.1.
Many text corpora contain linguistic annotations, representing POS tags,named entities, syntactic structures, semantic roles, and so forth. NLTK providesconvenient ways to access several of these corpora, and has data packages containing corporaand corpus samples, freely downloadable for use in teaching and research.1.2 lists some of the corpora. For information aboutdownloading them, see more examples of how to access NLTK corpora,please consult the Corpus HOWTO at
The last of these corpora, udhr, contains the Universal Declaration of Human Rightsin over 300 languages. The fileids for this corpus includeinformation about the character encoding used in the file,such as UTF8 or Latin1.Let's use a conditional frequency distribution to examine the differences in word lengthsfor a selection of languages included in the udhr corpus.The output is shown in 1.2 (run the program yourself to see a color plot).Note that True and False are Python's built-in boolean values.
Unfortunately, for many languages, substantial corpora are not yet available. Often there isinsufficient government or industrial support for developing language resources, and individualefforts are piecemeal and hard to discover or re-use. Some languages have noestablished writing system, or are endangered. (See 7for suggestions on how to locate language resources.)
3a8082e126