Semi - Supervised approach for word2vec using skip-gram

208 views
Skip to first unread message

sarath r nair

unread,
Aug 19, 2015, 3:32:53 AM8/19/15
to gensim
Hi Guys ,

In the following paper,


at equation 3.3 , they are telling to sum up all the categories of a word/ query ( in their case ) for a semi-supervised skip gram approach.

At the end they are trying to learn representation for word as well as category and calculate a cosine similarity. But in the paper, they have not mentioned, how to update category vectors. So my doubt is ,

If I am having theree category as [ 'cat_1' , 'cat_2', 'cat_3'], I thought of adding a zero matrix of shape ( 3, self.layer_size) . And update the same way you update "model.syn1neg". Is this the correct approach ? Or do I have to assign a matrix for the categories, randomly as you do for "model.syn0" and update it the same way as "model.syn0". Its quite unclear to me, how to learn vector representations for the category.

Gordon Mohr

unread,
Aug 19, 2015, 7:00:28 PM8/19/15
to gensim
While I can't say I completely understand the implication of equation (3.3) on actual (sparser) training steps, I *think* what they're doing is functionally similar to treating the category-tags as if they can be used in place of matching queries, in the construction of individual skip-gram training examples.

They're already treating queries (often multi-word) as if they were single word-tokens. And the query sessions are like sentences. 

So session [q0, q1, q2, q3] is analogous to a sentence [w0, w1, w2, w3]. 

When they know, from their category-labeling, that 'q2' belongs to categories [cat_7, cat_31], then I suspect their actual training examples from a single query-session [q0, q1, q2, q3] are expanded to become somewhat like:

[q0, q1, q2, q3]
[q0, q1, cat_7, q3]
[q0, q1, cat_31, q3]

That is, every category-tag gets to participate in skip-gram training as if it appeared in-place-of the matching queries. 

In that way, the category-tags get vectors in the "same space" as queries. At the end, the vector for (previously hand-categorized) q2 will (probably) be quite close to both cat_7 and cat_31 – but they already knew that. And also, they may get reasonable category-tags for uncategorized queries, like say q1, by finding the cat_N(s) closest to it. 

If that's a proper understanding, then a quick-and-dirty way to simulate this approach in gensim might be to preprocess your corpus, expanding it to include the known category-for-query substitutions. (This might inadvertently overweight the other queries in the same session that don't have category-substitutions... or maybe that's harmless or offsettable by other parameters. Unsure.)

- Gordon

sarath r nair

unread,
Aug 20, 2015, 1:50:09 AM8/20/15
to gen...@googlegroups.com
Hi Gordon , 

I understand the same thing . So for a small explanation purpose, I will give a small example here . 

Lets say my input sentence is [ 'python' , 'django' , 'java' ].

So, this sentence will chop into words as [ 'word1', 'word2', 'word3'].

In gensim I will pass it to train_sg_pair( word, word2 )

Here word = 'python'
word2 = 'django'.

Asssume my category for python is ' it_skill ' .

So, I need to train it by acquiring vector corresponds to 'django' from "model.syn0" .

Lets say , negative sample = 1.
So , labels = [1, 0]

Conventionally if no labels, we will try to find a dot product between

model.syn0['django'] - 100 x 1 vector

model.syn1neg [ 'python' , 'sampled_word'] =  2 x 100

so , I will get 'fb' as dot product in gensim.
I will update model.syn0['django'] and model.syn1neg[ 'python', 'sampled_word'].

If label is there with python, I want to incerease the probabilty of 'django' with the category of 'python'.

If you see the paper the only change in normal equation and modified equation (3.3) is, 
p( word| input_word) + p( word|input_category).

But my main problem , should I update category vectors in the same way as I update input_vectors. In this case update mode.syn1neg['python', 'neg_sample'].
le
or 

should i update category vectors as I update word 'django' as in model,syn0.

Here is my main problem .

If this reply is confusing , sorry :-)

--
You received this message because you are subscribed to a topic in the Google Groups "gensim" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/gensim/PyVIcg7UIjM/unsubscribe.
To unsubscribe from this group and all its topics, send an email to gensim+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Gordon Mohr

unread,
Aug 20, 2015, 3:15:31 AM8/20/15
to gensim
My understanding of their paper is that the categories are practically 'synthetic words', and thus are trained identically to the words. They really just drop into the skip-gram pairs, as if they were themselves words. 

So looking into the details of what happens in `train_sg_pair()` may not be necessary – except to satisfy curiosity. You just want to make sure pairs of [category_tag, target_word] are mixed in with all the other usual pairs of [anchor_word, target_word]. And the easiest way to do that would be preprocessing the corpus to expand it with more synthetic text that represents the categories, rather than editing the training code. 

By way of example, consider a longer sentence like the following, and assume a 'window' of 2 for simplicity:

['many', 'projects', 'are', 'preferring', 'python', 'for', 'machine', 'learning', '.']

In plain skip-gram training, the pairs anchored around 'python' would be:

['python', 'are']
['python', 'preferring'] 
['python', 'for'] 
['python', 'machine'] 

Now let's consider 'python' to be a word in the category 'CAT_PROGRAMMING_LANGUAGE'. I believe (but am not sure) that under the Yahoo paper's training method, the category training is performed, at this anchor-point while iterating over the text, by also training the pairs:

['CAT_PROGRAMMING_LANGUAGE', 'are']
['CAT_PROGRAMMING_LANGUAGE', 'preferring'] 
['CAT_PROGRAMMING_LANGUAGE', 'for'] 
['CAT_PROGRAMMING_LANGUAGE', 'machine'] 

Presumably, this substitution would also affect other contexts that overlap that location. That is, 'CAT_PROGRAMMING_LANGUAGE' would also sometimes appear as the 2nd (target) word of the pair. (But this might not be strictly necessary.)

So really, the known categories are best thought of (and probably implemented as) special words, sometime dropped-in. The most quick-and-dirty way to simulate this would be, when you have the one sentence above. expand your corpus to also include the sentence:

['many', 'projects', 'are', 'preferring', 'CAT_PROGRAMMING_LANGUAGE', 'for', 'machine', 'learning', '.']

That might not be ideal – other words in the sentence, outside the category-influenced window (like 'many', 'projects', etc) are also now artificially getting more training examples. (Maybe in practice that's not a problem.) Excerpting just the window around the word with category-alternates would minimize that effect – though there'd still be a distortion around the excerpt-edges. But this preprocessing-only approach might be an easy initial way to test most of the technique without modifying library code. If that works, then see if a more precise modification of training – by editing the `train_*` methods, in python and cython, to only create the minimal number of new training pairs. 

- Gordon

sarath r nair

unread,
Aug 20, 2015, 5:07:11 AM8/20/15
to gen...@googlegroups.com
Hi Gordon ,

First of all thanks a lot for your reply. I have modified the code yesterday as follows.

['many', 'projects', 'are', 'preferring', python', 'for', 'machine', 'learning', '.']
['many', 'projects', 'are', 'preferring', 'CAT_PROGRAMMING_LANGUAGE', 'for', 'machine', 'learning', '.']

I am training on lets say preferring and python , for and python etc.... and 
I am training on lets say preferring and 'category_python' , for and 'category_python ' . Sums up to get 
' fb ', as in train_sg_pair. 

At the end of one word pair training, I am updating 

word ( 'preferring' )  and separately updating category ( cat_python ). The code is still running , Will update the results after it got over .

Thanks
Sarath

sarath r nair

unread,
Aug 20, 2015, 5:17:00 AM8/20/15
to gen...@googlegroups.com
Hey Gordon ,

One more doubt . If I am doing the training as you said  for ,

['many', 'projects', 'are', 'preferring', python', 'for', 'machine', 'learning', '.']
['many', 'projects', 'are', 'preferring', 'CAT_PROGRAMMING_LANGUAGE', 'for', 'machine', 'learning', '.']


After training first word pair in case of words only ( I mean lets not consider categories and no negative samples  ) ,

I will train on 'preferring' and 'python'. In the end I update 'model.syn0' for preferring and 'model.syn1neg' for python.

Lets come to the category case .

I will train on 'preferring' and 'category_python'. In the end I update 'model.syn0' for preferring and 'model.syn1neg' for category_python.

But, after the model is trained ( in normal or conventional words only case ), I am using 'model.syn0' as the matrix that hold word2vectors for all words. 'model.syn1neg' will not be used right .

So, in the category update if I am using model.syn1neg for categories, will it be useful representation of category vectors ?????


Gordon Mohr

unread,
Aug 20, 2015, 2:07:47 PM8/20/15
to gensim
If you are training category-tokens exactly the same as word-tokens, then the vectors for category-tokens will be in the same array (model.syn0) as word-tokens. 

If you've pre-expanded your text via the category-for-word substitution, before passing it to the gensim Word2Vec constructor/methods, there will be a vocabulary item for the category-token just like any other word. You won't be directly updating any of model.syn0 or model.syn1 (for hierarchical-softmax) or model.syn1neg (for negative-sampling) – you'll just be providing different text. 

At the end, you'll retrieve the vectors for category-tokens the same as word-tokens – they'll both come from model.syn0. 

(I've seen a mention somewhere that the syn1/syn1neg representations can be useful too, perhaps concatenated with same-word syn0 representations. But I haven't tested that, and it's not the basic/original description of word2vec. So for simplicity, we can just consider that part of the model discarded when training is done.)

- Gordon 

sarath r nair

unread,
Aug 20, 2015, 10:17:45 PM8/20/15
to gen...@googlegroups.com

Good information Gordon. I will certainly try that and I will let you know how my results are. In the mean while if you find something useful please share.

Thanks
Sarath

Reply all
Reply to author
Forward
0 new messages