Re training a google new word2vec model

Biswa G Singh

unread,

Jan 15, 2017, 8:57:25 AM1/15/17

to gensim

Hi ,

I am rying to train a google new word2vec model with new sentences.

But it theows me some error , Can you please let me know Whats the thing I am missing. Sorry I am new to these things.

wv = Word2Vec.load_word2vec_format("GoogleNews-vectors-negative300.bin.gz", binary=True)
#wv.init_sims(replace=True)
wv.train(doc_list)


Traceback (most recent call last):
  File "word2vec_trained.py", line 66, in <module>
    wv.train(doc_list)
  File "C:\Users\bgsingh\Anaconda2\lib\site-packages\gensim\models\word2vec.py", line 695, in train
    if self.corpus_count:
AttributeError: 'Word2Vec' object has no attribute 'corpus_count'

Can you please let me know whats happening here.

Thanks,

Biswa

Lev Konstantinovskiy

unread,

Jan 15, 2017, 3:12:07 PM1/15/17

to gensim

Hi Biswa,

The `load_word2vec_format()` function works with the vectors-only format of the original word2vec.c implementation. That's not enough to continue training; a model so loaded is only good for comparisons of the existing vectors.

See alternatives in this mailing list thread.

Thanks for raising this. There will be a more explicit exception in the next Gensim release.

Regards

Lev

Biswa G Singh

unread,

Jan 20, 2017, 7:32:02 AM1/20/17

to gensim

Thanks Lev

Biswa G Singh

unread,

Jan 22, 2017, 3:52:27 AM1/22/17

to gensim

Hi Lev,

Sorry to come back again. Actually I use a google news pretrained model and generated word vectors of my small corpus.

Using your following code. Can you please suggest what similarity function I should use to get the semantic similarity of two averaged documents vectors? Actually my query is a new sentence, and I want to find the the closed semantically similar sentence/document from my small corpus. Thanks for your help. Appreciate it

def w2v_tokenize_text(text):
    tokens = []
    for sent in nltk.sent_tokenize(text, language='english'):
        for word in nltk.word_tokenize(sent, language='english'):
            if len(word) < 2:
                continue
            tokens.append(word)
    return tokens


doc_list = []


#with open('new_faq.txt') as alldata:
with smart_open.smart_open('new_faq.txt', encoding="utf-8") as alldata:
    for line_no, line in enumerate(alldata):
    line = line.decode('utf-8')
        token = w2v_tokenize_text(line)
        doc_list.append(token)




wv = Word2Vec.load_word2vec_format("GoogleNews-vectors-negative300.bin.gz", binary=True)


wv.init_sims(replace=True)


def word_averaging(wv, words):
    all_words, mean = set(), []
    
    for word in words:
        if isinstance(word, np.ndarray):
            mean.append(word)
        elif word in wv.vocab:
            mean.append(wv.syn0norm[wv.vocab[word].index])
            all_words.add(wv.vocab[word].index)
            print "biswa"


    if not mean:
        logging.warning("cannot compute similarity with no input %s", words)
        # FIXME: remove these examples in pre-processing
        return np.zeros(wv.layer_size,)


    mean = gensim.matutils.unitvec(np.array(mean).mean(axis=0)).astype(np.float32)
    print mean
    return mean


def  word_averaging_list(wv, text_list):
    return np.vstack([word_averaging(wv, review) for review in text_list ])


X_train_word_average = word_averaging_list(wv, doc_list)

Lev Konstantinovskiy

unread,

Jan 22, 2017, 9:46:05 AM1/22/17

to gensim

Hi Biswa,

There are two similarity metrics in Gensim word2vec cosine and cosmul . Try them both to see what gives good results for your purpose.

If you have a small corpus(1mln words) and specifically want semantic similarity then training a new WordRank embedding is a better fit than Google News.

Regards

Lev

Biswa G Singh

unread,

Jan 22, 2017, 11:16:17 AM1/22/17

to gensim

Thanks Lev. This is very useful. I don't even have 1 mln word corpus. Is there a wordrank pretrained model which I can load and generate word vector for my corpus, similar to the pre trained word2vec google news model.

Regards

Biswa

Biswa G Singh

unread,

Jan 22, 2017, 11:15:52 PM1/22/17

to gensim

Hi Lev,

One follow up question please. Actually gensim word2vec cosine or sosmul does not take vector as argument. Problem in my case is my corpus is small so i dont re train the google new model with my corpus, rather I genrate average word vector of my corpus documents, Now I want to compare similarity between vectors. Does gensim has a method for that. I actually do something like below now:

from scipy.spatial.distance import cosine
X_train_word_average = word_averaging_list(wv, doc_list)
similarity = 1 - cosine(X_train_word_average[0], X_train_word_average[1])
print(similarity)
Enter code here...

Is this method alright?

Appreciate your help

Biswa

Reply all

Reply to author

Forward