Converting pre-trained word vectors to original C word2vec-tool format

2,096 views
Skip to first unread message

Bo

unread,
May 16, 2016, 10:30:39 AM5/16/16
to gensim
Hi all, I want to check whether using pre-trained word vectors to initialise the training of my Doc2vec model helps with the result or now (using intersect_word2vec_format() ). But I am having problems converting my two pre-trained word vectors to the original C word2vec-tool format. 

My first pre-trained word vectors are in numpy array format and is loaded using gensim.models.Word2Vec.load(). But when I use save_word2vec_format() it complains that 

AttributeError: 'Word2Vec' object has no attribute 'vector_size'


My 2nd pre-trained word vectors are in text format with 1st column being the word/phrase and the rest are its corresponding vectors, and all columns are separated by "\t". I can only find links to how to convert bin files to text, not the other way around. Is there any tool that does this?

Thanks! 

Gordon Mohr

unread,
May 16, 2016, 3:52:49 PM5/16/16
to gensim
Is your first model from a much-older release of gensim? After `load()`, you may be able to set the expected field with `model.vector_size = model.size`.

From what tool was the 2nd set of vectors saved? Gensim's Word2Vec should be able to both load/save in the original word2vec.c text/binary formats. 

A note about using `intersect_word2vec_format()` – it by default also *locks* the intersecting word-vectors against further changes. You may or may not want that. To unlock them, return their slots in model.syn0_lockf to 1.0 (instead of 0.0). 

- Gordon

Bo

unread,
May 17, 2016, 1:09:49 PM5/17/16
to gensim
Thanks for your reply Gordon!

The first model was trained last year I believe but I am not sure whether it was trained from a much-older version of gensim, it contains the pickled model itself along with .syn0.npy and .syn1.npy files. Can you please elaborate on how to set its vector size? The model doesn't have the attribute "size" either.

The 2nd model is Tang's sentiment-specific word vectors: http://ir.hit.edu.cn/~dytang/paper/sswe/14ACL.pdf

Can I also ask that the number passes used during doc2vec model training is decided based on the size of the training data right? My training data is fairly small and contains only 7k tweets. With evaluation of text classification, I found that the performance starts to drop after 3 passes. 

Gordon Mohr

unread,
May 17, 2016, 6:02:04 PM5/17/16
to gensim
Aha, I'd thought the older versions always kept `size` as a field. Try instead `model_vector.size = model.syn0.shape[1]`, to patch `vector_size` to match the size of the actual re-loaded word-vectors. 

The Tang format looks pretty simple, but there's no existing support for it. You may be able to use the existing `intersect_word2vec_format()` code as a model to read that format instead. 

Small datasets are always going to give limited results. Sometimes more iterations helps squeeze the best possible out of a small dataset. But if in fact more iterations are resulting in *worse* vectors for your downstream task, I suspect a form of overfitting: the model is large enough to essentially start memorizing idiosyncratic aspects of the (small) dataset, to keep improving at the Doc2Vec training task (predicting words from the doc-vec and/or surrounding words context), in ways that are no longer discovering generalizable relationships for other tasks. 

- Gordon

Bo

unread,
May 18, 2016, 12:08:50 PM5/18/16
to gensim
Thanks Gordon!

I added "137052 50" in the first line of the Tang model file, but when use `intersect_word2vec_format()` it complains:

Traceback (most recent call last):

  File "doc2vec.py", line 162, in <module>

    train_model.intersect_word2vec_format('../word2vec/sswe-u.txt', binary = False)

  File "/Users/bo/anaconda/lib/python2.7/site-packages/gensim/models/word2vec.py", line 1146, in intersect_word2vec_format

    raise ValueError("invalid vector on line %s (is this really the text format?)" % (line_no))

ValueError: invalid vector on line 0 (is this really the text format?)


Using readlines(), the first two lines of the model look as:


['137052 50\n',

 '<unk>\t-0.5788147\t0.925568\t0.4648884\t0.3540596\t-2.63636\t2.299568\t-0.6639488\t0.6152149\t-0.1693218\t0.2535918\t0.5239199\t0.03204051\t-0.4096407\t-0.1301239\t0.1830793\t-0.8452936\t-1.4038\t-0.3235289\t-1.467552\t-0.4617254\t1.235173\t-1.01539\t0.8925248\t1.236531\t1.10637\t0.9484738\t-1.053025\t0.4563896\t-1.523564\t-0.01358143\t0.5384203\t-2.002354\t0.884596\t1.269964\t-1.649029\t-0.9661105\t0.04843707\t-0.01786563\t1.134794\t0.7832708\t-1.525468\t-1.791098\t-0.984019\t-0.1604346\t0.2929637\t0.64561\t2.001577\t1.381008\t-0.7404164\t1.558096\r\n']


Is this the same issue that was fixed in https://github.com/piskvorky/gensim/issues/388 ?

Gordon Mohr

unread,
May 18, 2016, 7:05:37 PM5/18/16
to gensim
Looking at gensim's code, it both writes and expects spaces, not tabs, as the dimension delimiters. It appears the original word2vec.c and related programs expect the same. So your file is not in the word2vec.c text format. Maybe replacing the tabs with spaces would help, but there could be other differences, as well.

- Gordon

Bo

unread,
May 19, 2016, 11:37:34 AM5/19/16
to gensim
Thanks Gordon! The tabs were the problem indeed.

I have been testing different ways of using doc2vec for the task of Twitter sentiment classification since yesterday, based on my findings I have some questions as below:

- Is there any rules for choosing the number of training passes and inference steps? I am guessing such decision should be made based on the size my training and testing data and task at hand but I can't seem to find any useful rules that I should follow other than trying different number of passes and steps and evaluate over a hold-out validation set. 

- I have experimented training doc2vec models using several thousands of labelled tweets, and I have added not only unique doc numbers but also the sentiment label (0/1/2) of each tweet as tags. Then I use infer_vector() to generate features for the testing data. Does this make sense? 

- In general I have found that using infer_vector() to generate feature vectors for both of my training data (which is used to create such doc2vec model in the first place) and testing data gives better classification performance than accessing and retrieving vectors for training data and only infer for testing data. Is this finding valid or hard to say?

- I have also trained several 300-dimensional doc2vec models using 2 millions of unlabelled tweets over 30 training passes. No significant improvement in classification performance. I shall train with more training passes. 

Reply all
Reply to author
Forward
0 new messages