Re training a google new word2vec model

1,610 views
Skip to first unread message

Biswa G Singh

unread,
Jan 15, 2017, 8:57:25 AM1/15/17
to gensim
Hi ,

I am rying to train a google new word2vec model with new sentences.

But it theows me some error , Can you please let me know Whats the thing I am missing. Sorry I am new to these things.

wv = Word2Vec.load_word2vec_format("GoogleNews-vectors-negative300.bin.gz", binary=True)
#wv.init_sims(replace=True)
wv
.train(doc_list)


Traceback (most recent call last):
 
File "word2vec_trained.py", line 66, in <module>
    wv
.train(doc_list)
 
File "C:\Users\bgsingh\Anaconda2\lib\site-packages\gensim\models\word2vec.py", line 695, in train
   
if self.corpus_count:
AttributeError: 'Word2Vec' object has no attribute 'corpus_count'


Can you please let me know whats happening here. 

Thanks,
Biswa

Lev Konstantinovskiy

unread,
Jan 15, 2017, 3:12:07 PM1/15/17
to gensim
Hi Biswa,

The `load_word2vec_format()` function works with the vectors-only format of the original word2vec.c implementation. That's not enough to continue training; a model so loaded is only good for comparisons of the existing vectors. 

See alternatives in this mailing list thread.

Thanks for raising this. There will be a more explicit exception in the next Gensim release.

Regards
Lev

Biswa G Singh

unread,
Jan 20, 2017, 7:32:02 AM1/20/17
to gensim
Thanks Lev

Biswa G Singh

unread,
Jan 22, 2017, 3:52:27 AM1/22/17
to gensim
Hi Lev,

Sorry to come back again. Actually I use a google news pretrained model and generated word vectors of my small corpus. 
Using your following code. Can you please suggest what similarity function I should use to get the semantic similarity of two averaged documents vectors? Actually my query is a new sentence, and I want to find the the closed semantically similar sentence/document from my small corpus. Thanks for your help. Appreciate it

def w2v_tokenize_text(text):
    tokens
= []
   
for sent in nltk.sent_tokenize(text, language='english'):
       
for word in nltk.word_tokenize(sent, language='english'):
           
if len(word) < 2:
               
continue
            tokens
.append(word)
   
return tokens


doc_list
= []


#with open('new_faq.txt') as alldata:
with smart_open.smart_open('new_faq.txt', encoding="utf-8") as alldata:
   
for line_no, line in enumerate(alldata):
    line
= line.decode('utf-8')
        token
= w2v_tokenize_text(line)
        doc_list
.append(token)



wv
= Word2Vec.load_word2vec_format("GoogleNews-vectors-negative300.bin.gz", binary=True)

wv
.init_sims(replace=True)


def word_averaging(wv, words):
    all_words
, mean = set(), []
   
   
for word in words:
       
if isinstance(word, np.ndarray):
            mean
.append(word)
       
elif word in wv.vocab:
            mean
.append(wv.syn0norm[wv.vocab[word].index])
            all_words
.add(wv.vocab[word].index)
           
print "biswa"


   
if not mean:
        logging
.warning("cannot compute similarity with no input %s", words)
       
# FIXME: remove these examples in pre-processing
       
return np.zeros(wv.layer_size,)


    mean
= gensim.matutils.unitvec(np.array(mean).mean(axis=0)).astype(np.float32)
   
print mean
   
return mean


def  word_averaging_list(wv, text_list):
   
return np.vstack([word_averaging(wv, review) for review in text_list ])


X_train_word_average
= word_averaging_list(wv, doc_list)

Lev Konstantinovskiy

unread,
Jan 22, 2017, 9:46:05 AM1/22/17
to gensim
Hi Biswa,

There are two similarity metrics in Gensim word2vec cosine   and cosmul . Try them both to see what gives good results for your purpose.

If you have a small corpus(1mln words) and specifically want semantic similarity then training a new WordRank embedding is a better fit than Google News.

Regards
Lev

Biswa G Singh

unread,
Jan 22, 2017, 11:16:17 AM1/22/17
to gensim
Thanks Lev. This is very useful. I don't even have 1 mln word corpus. Is there a wordrank pretrained model which I can load and generate word vector for my corpus, similar to the pre trained word2vec google news model.

Regards
Biswa

Biswa G Singh

unread,
Jan 22, 2017, 11:15:52 PM1/22/17
to gensim
Hi Lev, 

One follow up question please. Actually gensim word2vec cosine or sosmul does not take vector as argument. Problem in my case is my corpus is small so i dont re train the google new model with my corpus, rather I genrate average word vector of my corpus documents, Now I want to compare similarity between vectors. Does gensim has a method for that. I actually do something like below now:

from scipy.spatial.distance import cosine
X_train_word_average
= word_averaging_list(wv, doc_list)
similarity
= 1 - cosine(X_train_word_average[0], X_train_word_average[1])
print(similarity)
Enter code here...




Is this method alright?

Appreciate your help
Biswa
Reply all
Reply to author
Forward
0 new messages