score = (pab - min_count) / pa / pb * len(vocab)
1) To calculate PMI, using 'export_phrases' method is convenient because the formula you wrote gives the PMI value (as written in Christopher Manning & Hinrich Schütze in 1999, chapter 5.4 'Mutual Information') of co-occurred words.
2) To create window size of, let say, 5 words, first I need to preprocess the raw text with dividing it into 5 word sentences for each line.
from collections import Counterfrom math import log
def gen_bigrams(data, window_size=5): for idx in range(len(data)): window = data[idx: idx + window_size]
if len(window) < 2: break w = window[0] for next_word in window[1:]: yield (w, next_word)
def construct_vocab(data): vocab = Counter() for (w1, w2) in gen_bigrams(data, window_size=5): # count 1gram & 2gram vocab.update([w1, w2, (w1, w2)]) return vocab
def calc_pmi(vocab): det = sum(vocab.values()) for (w1, w2) in filter(lambda el: isinstance(el, tuple), vocab): p_a, p_b = float(vocab[w1]), float(vocab[w2]) p_ab = float(vocab[(w1, w2)]) yield (w1, w2, log((det * p_ab) / (p_a * p_b), 2))
corpus = ["a", "b", "c", "d", "e", "b", "g", "a", "h"]vocab = construct_vocab(corpus)
for (w1, w2, pmi) in calc_pmi(vocab): print("{}_{}: {:.3f}".format(w1, w2, pmi))
--
You received this message because you are subscribed to a topic in the Google Groups "gensim" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/gensim/6ftKTlIGwJo/unsubscribe.
To unsubscribe from this group and all its topics, send an email to gensim+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
To unsubscribe from this group and all its topics, send an email to gensim+un...@googlegroups.com.