Specifying Stress Level with nltk.corpus.cmudict.dict

AaronF

unread,

Dec 5, 2012, 5:09:03 PM12/5/12

to nltk-...@googlegroups.com

Hi,

I am new to the nltk and Python but I have a very specific question about geting a list of phonemes from a corpora using nltk.corpus.cmudict.dict. I have a corpora that I have already tokenized but I would to turn it into phonemes using the CMU Dictionary, but I would like to only have the resulting phonomes for each token correlate to stress level 1 from the CMU Dictionary. I have noticed that if I run the prondict[] on my corpora, it gives me a list of phonemes for each stress level for each token, instead of just for a specific stress level. Is this there an easy way to do this with the nltk?

Thanks

Joe

unread,

Dec 6, 2012, 1:05:28 AM12/6/12

to nltk-...@googlegroups.com

This description seems kind of ambiguous. Can you provide a couple of simple examples of the output that you are expecting/trying to obtain, versus the incorrect/unexpected output that you are currently seeing? That would probably make it easier for people to answer your question.

AaronF

unread,

Dec 6, 2012, 9:45:36 AM12/6/12

to nltk-...@googlegroups.com

For example, if in Python, I type prondict = nltk.corpus.cmudict.dict() and then type prondict['anointed'], the results are [['AH0', 'N', 'OY1', 'N', 'T', 'AH0', 'D'], ['AH0', 'N', 'OY1', 'N', 'T', 'IH0', 'D'], ['AH0', 'N', 'OY1', 'N', 'AH0', 'D'], ['AH0', 'N', 'OY1', 'N', 'IH0', 'D']], however, I only seek ['AH0', 'N', 'OY1', 'N', 'T', 'IH0', 'D'] as the desired result, and my corpora is very large so I would like to avoid making the distinction on a token by token basis if possible so I was curious if anyone else has done this before with the nltk.

Joe

unread,

Dec 6, 2012, 4:41:57 PM12/6/12

to nltk-...@googlegroups.com

Hi,

I guess you should be able to get a specific pronunciation by just calling the index,

prondict['anointed'][1]

although in general it would probably be simplest to just use the first pronunciations, prondict['anointed'][0]. The pronunciation order should (I think) be the same as the cmudict,

https://cmusphinx.svn.sourceforge.net/svnroot/cmusphinx/trunk/cmudict/

It's still not entirely clear to me what you mean about the stress level though, as your desired result still contains two different stress markers.

Best

AaronF

unread,

Dec 7, 2012, 10:24:21 AM12/7/12

to nltk-...@googlegroups.com

Thank you for your reply, this is what I was looking for.

Reply all

Reply to author

Forward