N-grams probability and manipulation

72 views
Skip to first unread message

MAJED ALJEFRI

unread,
Aug 11, 2014, 9:03:20 AM8/11/14
to nltk-...@googlegroups.com
Hi every one,

I am developing a spell checker for Arabic text, but I am new to python
I want to know how to access the probability of a specific ngram say trigram
for example:
for w1, w2, w3 in nltk.trigrams(words):
            print w1, w2, w3

now I want to search for a specific trigram or find the probability of a specific trigram 
and how to find all trigrams that start with w1 and w2

I also find that when building a language model people used to used nltk.ngrammodel but it seems that it is not their anymore


Your help is really appreciated

Denzil Correa

unread,
Aug 11, 2014, 11:08:23 AM8/11/14
to nltk-...@googlegroups.com
Have a look at the nltk.probability module. Here's an example, extend it to your own use case using FreqDist and equivalent Probability sub-class.


>>> from nltk.corpus import gutenberg
>>> gutenberg.fileids()
['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']
>>> words = gutenberg.words('shakespeare-hamlet.txt')
>>> print len(words)
37360
>>> words[:100]
['[', 'The', 'Tragedie', 'of', 'Hamlet', 'by', ...]
>>> from nltk import FreqDist
>>> fd1 = FreqDist(words)
>>> fd1
<FreqDist with 5447 samples and 37360 outcomes>
>>> from nltk import SimpleGoodTuringProbDist
>>> p = SimpleGoodTuringProbDist(fd1)
>>> p.prob('The')
0.0035236862572628346
>>> p.prob('thou')
0.0024574084004526232
>>> p.prob('fail')
0.09138115631691648
>>> p.prob('love')
0.09138115631691648
>>> 



--Regards,
Denzil



--
You received this message because you are subscribed to the Google Groups "nltk-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nltk-users+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Shan Khan

unread,
Jan 22, 2017, 8:18:12 AM1/22/17
to nltk-users
This way is out dated as per latest version, can you please provide the newer version for calculate n grams
Reply all
Reply to author
Forward
0 new messages