How to shuffle words in word2vec

401 views
Skip to first unread message

ssh26

unread,
May 8, 2016, 11:30:27 AM5/8/16
to gensim
Hi,

I have this piece of code:

import gensim
import random


file = open('../../../dataset/output/interaction_jobroles_titles_tags.txt')

read_data = file.read()

data = read_data.split('\n')

sentences = [line.split() for line in data]
print(len(sentences))
print(sentences[1])

model = gensim.models.Word2Vec(min_count=1, window=10, size=300, negative=5)
model.build_vocab(sentences)

for epoch in range(5):
shuffled_sentences = random.shuffle(sentences)
#perm_sentences = [sentences_list[i] for i in Idx]
model.train(shuffled_sentences)
print(epoch)
print(model)

model.save("../../../dataset/output/wordvectors_jobroles_titles_300d_10w_wordshuffling" + '.model')

If I print a single sentence, then it output is something like this:

['JO_3787672', 'JO_272304', 'JO_2027410', 'TI_2969041', 'TI_2509936', 'TA_954638', 'TA_4321623', 'TA_339347', 'TA_272304', 'TA_3017535', 'TA_494116', 'TA_798840']

What I need is to shuffle the words before training and then save the model. 

I am not sure whether I am coding it in a right way. I end up with exception:

Exception in thread Thread-8:
Traceback (most recent call last):
  File "/usr/local/Cellar/python3/3.5.1/Frameworks/Python.framework/Versions/3.5/lib/python3.5/threading.py", line 914, in _bootstrap_inner
    self.run()
  File "/usr/local/Cellar/python3/3.5.1/Frameworks/Python.framework/Versions/3.5/lib/python3.5/threading.py", line 862, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.5/site-packages/gensim/models/word2vec.py", line 747, in job_producer
    for sent_idx, sentence in enumerate(sentences):
  File "/usr/local/lib/python3.5/site-packages/gensim/utils.py", line 668, in __iter__
    for document in self.corpus:
TypeError: 'NoneType' object is not iterable

Could anyone suggest me, how can I shuffle words?

Thanks.

Gordon Mohr

unread,
May 8, 2016, 3:46:27 PM5/8/16
to gensim
`random.shuffle()` shuffles a list in-place, and returns nothing. So your `shuffled_sentences` is `None`, causing that exception. 

More generally: 

* it's not absolutely necessary to shuffle the order-of-examples every time, though it may help a little. (If the initial ordering is very imbalanced, with certain words only appearing early/late in the ordering, one initial shuffle can be important.)

* each call to `train()` will, by itself, make `iter` (default 5) passes over the corpus and gradually decrease `alpha` from its starting value to `min_alpha`. So when you call `train()` yourself in a loop to control the training epochs yourself, you probably want to set `iter=1` (so your code is explicitly causing 5 passes, not 5*5), and manage the `alpha`/`min_alpha` to achieve a gradual-decrease of learning-rate over the full session. Alternatively, you can just let the model do the iterations/alpha for you, by choosing your desired `iter` and calling `train()` only once.

- Gordon

shweta tiwari

unread,
May 19, 2016, 6:46:30 AM5/19/16
to gen...@googlegroups.com
Hi Gordon,

Thanks for the reply and explanation.


--
You received this message because you are subscribed to the Google Groups "gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages