Hello everyone,
My name is Uros and I was working with ML for 2 years, but I am pretty new in NLP and Genesim. I was trying to train my own w2v model from latest wiki dump following instructions from Genesim website, but I run into a few problems, by consulting documentation deeper I was able to write script that seem to works fine for my task.
This is my code:
import logging
import gensim
from gensim.models.word2vec import Word2Vec
from gensim.corpora.wikicorpus import WikiCorpus
import multiprocessing
import pickle
if __name__ == '__main__':
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
corpus = gensim.corpora.textcorpus.TextCorpus('enwiki-20210920-pages-articles-multistream.xml.bz2')
cores = multiprocessing.cpu_count()
w2v_model = Word2Vec(corpus, workers=cores, min_count=10, vector_size=300)
filename = 'w2v_fullwiki_model.pickle'
pickle.dump(w2v_model, open(filename, 'wb'))
And at the end(after 5 days) I got my 8gb .pickle file. But now when I try to test it with this genesim code:
w2v_model = pickle.load(open('w2v_fullwiki_model.pickle', 'rb'))
wv = w2v_model.wv
for index, word in enumerate(wv.index_to_key):
if index == 10:
break
print(f"word #{index}/{len(wv.index_to_key)} is {word}")
i get this output:
word #0/3477406 is (61, 1)
word #1/3477406 is (279, 2)
word #2/3477406 is (45, 1)
word #3/3477406 is (37, 1)
word #4/3477406 is (49, 1)
word #5/3477406 is (52, 1)
word #6/3477406 is (46, 1)
word #7/3477406 is (48, 1)
word #8/3477406 is (46, 2)
word #9/3477406 is (1131, 1)
e.g. I don't get words as output rather I get some tuple. As I stated my model is 8gb large so I don't think this is just gibberish, but is there I way for me to access words that my model has learned or if not where did I make mistake with training script and how can I fix it?
Thank you all very much in advanced.
I would really appreciate help about this issue.
Good day,
Uros Pocek