How to get the cooccurrence matrix from cooccurrence.bin?

368 views
Skip to first unread message

applebas...@gmail.com

unread,
Jan 3, 2017, 11:59:55 PM1/3/17
to GloVe: Global Vectors for Word Representation
Hi,

I would like to see the cooccurrence matrix in text.
How can I read the cooccurrence.bin from the binary structure?
Seems that it's not provided with both .txt and .bin like vector does but only .bin.

Thanks for reading!

Best,
James

cannb...@gmail.com

unread,
May 12, 2017, 4:21:03 PM5/12/17
to GloVe: Global Vectors for Word Representation, applebas...@gmail.com
Hey,

I also need the co-occurrence matrix of words in txt format. Did you find any solution to that? If so, can you share, please.
Thank you in advance.

Best,
Berkay

4 Ocak 2017 Çarşamba 07:59:55 UTC+3 tarihinde applebas...@gmail.com yazdı:

Marc

unread,
May 12, 2017, 8:35:50 PM5/12/17
to GloVe: Global Vectors for Word Representation, applebas...@gmail.com, cannb...@gmail.com
Hi all,

I actually implemented a python wrapper to train, test and analyze GloVe models.
you can find the `read_cooccurrences()` method that allows just this.

Hope this helps!
Marc

cannb...@gmail.com

unread,
May 13, 2017, 3:36:50 AM5/13/17
to GloVe: Global Vectors for Word Representation, applebas...@gmail.com, cannb...@gmail.com
Hi Marc,

Actually, I don't understand how to get 'just' co-occurrence matrix from your code. I only need the global co-occurrence matrix, not the whole Glove training. After that I will just use this matrix for Pointwise Mutual Information calculation. Thus, can you share a main method with me just for calculating the matrix and writing it to a .txt file by using your script?

Best regards.

13 Mayıs 2017 Cumartesi 03:35:50 UTC+3 tarihinde Marc yazdı:

aklouch...@gmail.com

unread,
Oct 2, 2018, 6:04:05 AM10/2/18
to GloVe: Global Vectors for Word Representation
Hi,

Did you get an answer or find a way to do that ?

Regards 

Guglielmo Reggio

unread,
Jul 20, 2021, 5:15:23 AM7/20/21
to GloVe: Global Vectors for Word Representation
Hi,

Has there been any updates on this? 

Thank you,
Guglielmo

Guglielmo Reggio

unread,
Jul 21, 2021, 10:59:41 AM7/21/21
to GloVe: Global Vectors for Word Representation
Hi,

I've found a solution:

import struct
file = open("/path/to/cooccurrence.bin", "rb")

byte = file.read(16)
while byte:
cooccur_data = struct.unpack('iid',byte)
print(cooccur_data)
byte = file.read(16)

file.close()

Hope this is useful!

Best,
Guglielmo

Sal Aguiñaga

unread,
Mar 8, 2024, 3:48:55 PM3/8/24
to GloVe: Global Vectors for Word Representation
Just to clarify: this output "(1, 2, 409566.0412529093)"
means that words 1 & 2 occur this many times on average if a given corpus --hence the fraction?

Murat Aydogdu

unread,
Mar 9, 2024, 10:01:32 AM3/9/24
to GloVe: Global Vectors for Word Representation
I am not sure about this solution. I would expect the last number should be a (large) integer, the co-occurence count for the pair.
I tried to use a code converter (https://www.codeconvert.ai/java-to-python-converter). According to that, the file structure may be:

    def __init__(self, word1, word2, val, id):
        self.word1 = word1
        self.word2 = word2
        self.val = val
        self.id = id


So maybe there is a fourth number, an identifier for each pair or words.

Sal Aguiñaga

unread,
Mar 9, 2024, 1:56:14 PM3/9/24
to GloVe: Global Vectors for Word Representation
Looking the .c source (I too expect that 3rd element to be an int) but they apply a distance weighting that I'm still trying to figure out. 
It's not a specific count; do correct me if I'm wrong. 

The cooccur value is stored into a ''real" bigram_table and into a variable r of type "real"
... 
    real *bigram_table = NULL, r;
...
            if ((r = bigram_table[lookup[x-1] - 2 + y]) != 0) {
....
                fwrite(&x, sizeof(int), 1, fid);
                fwrite(&y, sizeof(int), 1, fid);
                fwrite(&r, sizeof(real), 1, fid);

Murat Aydogdu

unread,
Mar 10, 2024, 5:04:03 PM3/10/24
to GloVe: Global Vectors for Word Representation
OK, then why are the precisions have so many digits? I wonder if this has anything to do with how Python represents floats.
Reply all
Reply to author
Forward
0 new messages