Calculating Pointwise Mutual Information(PMI) using Phrases Module

Berkay Can

unread,

May 20, 2017, 2:44:13 AM5/20/17

to gensim

Hi,

I have 15GB sized raw text data. I want to calculate PMI of each word pair that is co-occurred in a given window size. My question is:

is Phrases.bigram.export_phrases(sentences) method returns the score as PMI value and if so, is the PMI value calculated by global and sparse co-occurrence matrix?

Thank you.

Berkay Can

unread,

May 20, 2017, 2:49:35 AM5/20/17

to gensim

Hi,

I also cannot find the way of tuning window size parameter. Any idea?

Thank you.

20 Mayıs 2017 Cumartesi 09:44:13 UTC+3 tarihinde Berkay Can yazdı:

Ivan Menshikh

unread,

May 20, 2017, 4:30:40 AM5/20/17

to gensim

Hello brky,

Before `Phrases.export_phrases` you should call `add_vocab` (if you dont pass sentenses to `Phrases.__init__`). This method calculates freq of words and bigrams (2 words coming in a row, for this reason, there is no `window_size` parameter).

After this, you call `export_phrases` and for every sentence, this method generate bigram and calculate `score`

score = (pab - min_count) / pa / pb * len(vocab)

If this score > threshold, `export_phrases` yield this bigram with `score`

brky

unread,

May 20, 2017, 4:51:19 AM5/20/17

to gensim

Hi Ivan,

Firstly, thank you for your answer and consideration.

Let me tell you what I understand from your sentences, than you can confirm or improve my understanding.

1) To calculate PMI, using 'export_phrases' method is convenient because the formula you wrote gives the PMI value (as written in Christopher Manning & Hinrich Schütze in 1999, chapter 5.4 'Mutual Information') of co-occurred words.

2) To create window size of, let say, 5 words, first I need to preprocess the raw text with dividing it into 5 word sentences for each line.

Lastly, if it will not be too much effort for you, can you share me a code that is capable of calculating PMI value from given raw text? Because, I have some trouble to understand how exactly I should preprocess the raw text into the proper format for 'Phrases' module.

Thank you in advance.

20 Mayıs 2017 Cumartesi 11:30:40 UTC+3 tarihinde Ivan Menshikh yazdı:

Ivan Menshikh

unread,

May 20, 2017, 7:14:18 AM5/20/17

to gensim

1) To calculate PMI, using 'export_phrases' method is convenient because the formula you wrote gives the PMI value (as written in Christopher Manning & Hinrich Schütze in 1999, chapter 5.4 'Mutual Information') of co-occurred words.

It's not really PMI from Christopher Manning & Hinrich Schütze but it's very similar and works well in practice

2) To create window size of, let say, 5 words, first I need to preprocess the raw text with dividing it into 5 word sentences for each line.

Could you describe me, what you means by "window"

For example, my sentence: ["a", "b", "c", "d", "e", "b", "c"]

Bigrams for this sentence: ["a_b", "b_c" , "c_d", "d_e", "e_b"]

How should the bigrams look after applying `window_size=5` ?

brky

unread,

May 20, 2017, 7:40:11 AM5/20/17

to gensim

The meaning is for example;

I have a raw text with words ["a", "b", "c", "d", "e", "b", "g", "a", "h"]

I want bigrams for a window_size=5 and we will use 'sliding window' principle. Bigrams will depend on window size as follows:

First window => [a, b, c, d, e] , then bigrams => [a_b, a_c, a_d, a_e]

Second window => [b, c, d, e, b], then bigrams => [b_c, b_d, b_e, b_b]

Third window => [c, d, e, b, g], then bigrams => [c_d, c_e, c_b, c_g]

Fourth window => [d, e, b, g, a], then bigrams => [d_e, d_b, d_g, d_a]

Fifth window => [e, b, g, a, h], then bigrams => [e_b, e_g, e_a, e_h]

For the rest =>, then bigrams [b_g, b_a, b_a, b_h, g_a, g_h, a_h]

Then we say that, pair words in bigram lists are co-occurred. Then we can calculate PMI using co-occurrence and single word frequencies.

This is how the logic behind window size and desired PMI calculation for me.

20 Mayıs 2017 Cumartesi 14:14:18 UTC+3 tarihinde Ivan Menshikh yazdı:

Ivan Menshikh

unread,

May 20, 2017, 8:52:57 AM5/20/17

to gensim

I write some code, I hope this helps you

from collections import Counter
from math import log

def gen_bigrams(data, window_size=5):
    for idx in range(len(data)):
        window = data[idx: idx + window_size]
        
        if len(window) < 2:
            break
            
        w = window[0]
        for next_word in window[1:]:
            yield (w, next_word)
            

def construct_vocab(data):
    vocab = Counter()
    
    for (w1, w2) in gen_bigrams(data, window_size=5): # count 1gram & 2gram
        vocab.update([w1, w2, (w1, w2)])
    return vocab
        

def calc_pmi(vocab):
    det = sum(vocab.values())
    
    for (w1, w2) in filter(lambda el: isinstance(el, tuple), vocab):
        p_a, p_b = float(vocab[w1]), float(vocab[w2])
        p_ab = float(vocab[(w1, w2)])
        
        yield (w1, w2, log((det * p_ab) / (p_a * p_b), 2))
    

corpus = ["a", "b", "c", "d", "e", "b", "g", "a", "h"]
vocab = construct_vocab(corpus)

for (w1, w2, pmi) in calc_pmi(vocab):
    print("{}_{}: {:.3f}".format(w1, w2, pmi))

brky

unread,

May 20, 2017, 10:24:59 AM5/20/17

to gensim

Hi Ivan,

Thank you for your effort. I also have a code like that for calculating PMI for small files. But when it comes to process a big file, I requires high cpu usage and efficient memory management, this is why I wrote my question here; on the chance of 'Phrases' module of gensim may help me for doing this. Do you have any suggestions about that?

20 Mayıs 2017 Cumartesi 15:52:57 UTC+3 tarihinde Ivan Menshikh yazdı:

Ivan Menshikh

unread,

May 21, 2017, 5:09:42 AM5/21/17

to gensim

Unfortunately, you need an alternative logic for bigrams calculation than don't support in Phrases.

Of course, you can remake your corpus, but this will not give you a performance boost. `Phrases.export_phrases` write on pure-python and it is comparable to the code that I sent you.

The longer part of the algorithm is compiling a vocab. If you can fit vocab in memory, you can use code from the previous message. Also, you should filter vocab (remove very rare bigrams, It will save memory very much)

But if your text occupies > 50 Gb you can use MapReduce approach (with Hadoop-MR, Spark or something else. It's classic MR case like word-count) OR get the sample from the corpus and fit vocab based on the sample.

If you have

Radim Řehůřek

unread,

May 22, 2017, 1:41:24 AM5/22/17

to gensim

Hello brky,

the Phrases module is actually suitable for processing large corpora. Check out its `max_vocab_size` parameter -- this will automatically prune infrequent words, every once in a while, so you are not limited by RAM.

As for speed -- we'll be writing a fast (C-compiled) version of Phrases, as part of our Incubator programme during this year's Google Summer of Code. Hopefully parallelized too :)

HTH,

Radim

Michael Sherman

unread,

May 22, 2017, 7:49:18 PM5/22/17

to gensim

brky,

Another note--the default PMI-esque scoring in gensim (from the Mikolov, et. al. word2vec paper) is not great for this kind of information extraction task because co-locations with very common tokens come out with low scores. For example, "new york" rarely registers very high with the scoring metric because "new" is such a common token in English. In the corpus I've been working with recently, over 99% of instances of "york" are proceeded by "new", but because "new" is such a common token the score is very low (< .5), scoring beneath many incorrect co-locations.

I got MUCH better results with normalized PMI, which required me to keep a count of words in the corpus as a class variable that's incremented by add_vocab, and to tweak the score calculation in export_phrases. If there's interest from the gensim community, I can get permission from my corporate masters to contribute the code back to gensim, as an optional scoring function to replace the default. But I'm not sure if that's a good idea given the code is up for rewrite this summer, and keeping a corpus word count and using log in the scoring function does slow things down a little bit.

The PMI provided by Ivan's code would be less dramatically off than the PMI-esque score in Mikolov, et. al., but will still underperform normalized PMI

Normalized PMI: https://svn.spraakdata.gu.se/repos/gerlof/pub/www/Docs/npmi-pfd.pdf . FWIW, our results on a "real" corpus of text from documents in a specific domain improved much more than the ones stated in this paper.

Radim Řehůřek

unread,

May 23, 2017, 5:43:56 AM5/23/17

to gensim

Hi Michael,

yes, please clear it up with your corporate overlords :) A better/alternative scoring function would be welcome indeed.

In fact, if this turns out to be a common use-case (adjusting the scoring fnc), we could even make it customizable, via parameters / code injection. The frequency counting is the "hard" thing to do fast in Python; the scoring is trivial in comparison.

And if the normalized PMI is universally better, we could just use that (replace the current one), instead of adding parameters. Thoughts?

Cheers,

Radim

Michael Sherman

unread,

May 24, 2017, 7:07:55 PM5/24/17

to gensim

I've created an issue https://github.com/RaRe-Technologies/gensim/issues/1363 .

Michael Sherman

unread,

May 24, 2017, 7:10:19 PM5/24/17

to gensim

Oh, and I'm against replacing the default method with normalized PMI. It may cause backwards compatibility problems. The default method still works relatively well, and for many corpora there won't be a noticable difference between the default and npmi. Npmi is also slower, since it requires keeping a count of the corpus size, and the calculation of logs. And people coming to gensim knowing the mikolov paper will be caught off guard by a default n-gram scoring method that differs from the method in the paper. So I don't think it is a good idea to change the default, but I do think it is a good idea to offer the option.

iv...@radimrehurek.com

unread,

May 27, 2017, 2:40:25 AM5/27/17

to gensim

I agree with you Michael, It will be cool as a new option. let's continue the discussion in issue 1363

Ramya Y S

unread,

Jul 10, 2017, 7:27:27 AM7/10/17

to gensim

Hi everyone!
I need to train wiki corpus for comparison of phrases. Which would you suggest is better for phrase comparison word2vec phrases or doc2vec itself?

Hoping for a quick response,
Thanks

aidana.ka...@nu.edu.kz

unread,

May 31, 2018, 4:04:54 AM5/31/18

to gensim

Hi, brky,

Have you figured out how to efficiently write a code for PMI matrix?

Berkay Can

unread,

May 31, 2018, 4:07:42 PM5/31/18

to gen...@googlegroups.com

Hey aidana,

I wrote my own code which works in a multithread fashion to calculate PMIs over a large corpus (over 15gb). In this implementation, I used sparse co-occurrence matrix. It was in my old computer and might already be erased. If it didn't I can post it down here.

--
You received this message because you are subscribed to a topic in the Google Groups "gensim" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/gensim/6ftKTlIGwJo/unsubscribe.
To unsubscribe from this group and all its topics, send an email to gensim+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

aidana.ka...@nu.edu.kz

unread,

Jun 1, 2018, 5:01:23 AM6/1/18

to gensim

Oh that's great! I will be very glad if you find it and post it.

Thank you!!!

To unsubscribe from this group and all its topics, send an email to gensim+un...@googlegroups.com.

Michael Sherman

unread,

Jun 1, 2018, 6:48:26 PM6/1/18

to gensim

Hi! Gensim actually already includes support for calculating npmi, if you're okay with having your pmi normalized.

See https://radimrehurek.com/gensim/models/phrases.html. When you create a ensim.models.phrases.Phrases object, set the scoring parameter to 'npmi' and that will be all scoring you get from them.

You can also create an pmi scoring function as a pluggable scoring metric to Phrases. If you look at the code https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/phrases.py you can see how npmi_scorer is written, if you follow that template and remove the normalization term, you'll have a pmi function you can pass to the scoring parameter.

Reply all

Reply to author

Forward