Word2Vector model's odd behaviour

15 views
Skip to first unread message

Uroš Poček

unread,
Oct 28, 2021, 9:54:14 AM10/28/21
to Gensim
Hello everyone, 
My name is Uros and I was working with ML for 2 years, but I am pretty new in NLP and Genesim. I was trying to train my own w2v model from latest wiki dump following instructions from Genesim website,  but I run into a few problems,  by consulting documentation deeper I was able to write script that seem to works fine for my task.
This is my code:

import logging
import gensim
from gensim.models.word2vec import Word2Vec
from gensim.corpora.wikicorpus import WikiCorpus
import multiprocessing
import pickle

if __name__ == '__main__': 

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

corpus = gensim.corpora.textcorpus.TextCorpus('enwiki-20210920-pages-articles-multistream.xml.bz2')

cores = multiprocessing.cpu_count()

w2v_model = Word2Vec(corpus, workers=cores, min_count=10, vector_size=300)

filename = 'w2v_fullwiki_model.pickle'
pickle.dump(w2v_model, open(filename, 'wb'))

And at the end(after 5 days) I got my 8gb .pickle file. But now when I try to test it with this genesim code:

w2v_model = pickle.load(open('w2v_fullwiki_model.pickle', 'rb'))
wv = w2v_model.wv

for index, word in enumerate(wv.index_to_key):
    if index == 10:
        break
    print(f"word #{index}/{len(wv.index_to_key)} is {word}")


i get this output:

word #0/3477406 is (61, 1)
word #1/3477406 is (279, 2)
word #2/3477406 is (45, 1)
word #3/3477406 is (37, 1)
word #4/3477406 is (49, 1)
word #5/3477406 is (52, 1)
word #6/3477406 is (46, 1)
word #7/3477406 is (48, 1)
word #8/3477406 is (46, 2)
word #9/3477406 is (1131, 1)

e.g. I don't get words as output rather I get some tuple. As I stated my model is 8gb large so I don't think this is just gibberish, but is there I way for me to access words that my model has learned or if not where did I make mistake with training script and how can I fix it?

Thank you all very much in advanced.
I would really appreciate help about this issue.

Good day,
Uros Pocek

Radim Řehůřek

unread,
Oct 28, 2021, 2:45:16 PM10/28/21
to Gensim
Hi Uros,

Did you check the log from when you created your model? Especially near the beginning of the training.

TextCorpus produces bag-of-words vectors (a sequence of list of (word_id, count) tuples), while Word2Vec expects documents (a sequence of lists of string tokens).

So I think you may have trained on nonsense input – that's why I ask about the log. Didn't you see suspicious items in there?

Best,
Radim

Gordon Mohr

unread,
Oct 28, 2021, 2:49:47 PM10/28/21
to Gensim
I wouldn't expect `TextCorpus` – which expects a plain-text file, & emits bag-of-words representations – to be useful for feeding the article texts from a compressed XML wiki dump to `Word2Vec`.

Try `WikiCorpus`, and also test on a tiny subset first to at least be sure some aspects of the final model look sane (even if not well-trained). For example:

    subset_corpus = itertools.islice(corpus, 10)  # 1st 10 docs only

Separately, the traditional iterable-based interfaace you're using usually maxes its speed at 4-12 workers, due to Python GIL synchronization bottlenecks, even if your `cores` might be larger. (And larger workers values start hurting throughput - so if you're on a machine of 16, 32, or more cores, blindly using the core-count as `workers` will be slower than a smaller number.)

- Gordon

Reply all
Reply to author
Forward
0 new messages