Find positive and negative words in corpus using Word2Vec's most_similar

2,089 views
Skip to first unread message

Lior Magen

unread,
Aug 10, 2016, 3:03:48 AM8/10/16
to gensim
I'm trying to get a list of positive words and a list of negative words using Word2Vec. 

Iv'e create a small set of generic negative/positive words and using them I'm trying to do this. 

words_list = dict(Positive=['good', 'great', 'best'], Negative=['bad', 'awful', 'terribl', 'disappoint'])
sentiments_dict['Positive'] = [x[0] for x in
self.model.most_similar(positive=words_list['Positive'],
negative=words_list['Negative'],
topn=most_common, restrict_vocab=1000)]
sentiments_dict['Positive'].extend(words_list['Positive'])
sentiments_dict['Negative'] = [x[0] for x in
self.model.most_similar(positive=words_list['Negative'],
negative=words_list['Positive'],
topn=most_common, restrict_vocab=1000)]
sentiments_dict['Negative'].extend(words_list['Negative'])


The problem is that in some cases the results are fine but in other they pretty suck. 

In suggestions on how to improve that? I believe that the basic idea is right but there's something I miss I guess.

jayant jain

unread,
Aug 10, 2016, 9:57:16 AM8/10/16
to gensim
Hi Lior,

That is an interesting approach. What corpus and hyperparameters are you using? 

In general, for semantic similarities, I've noticed skipgram models give much better results than cbow (which is the default for gensim). 

Also, you'll want to train on a relevant corpus which uses the words you're planning to use in a similar context (for example, a word like 'great' is used in very different contexts in a corpus of movie reviews as compared to say a corpus of wikipedia articles)

Another thing I'm curious about is your reasoning behind using word2vec for this? word2vec is trained in an unsupervised manner and in my opinion, you'd be much better off using, for example, word embeddings that have been fine tuned for a sentiment classification task?

Lior Magen

unread,
Aug 10, 2016, 10:09:51 AM8/10/16
to gensim
I'm using my own corpus and that's the reason why I'm trying to do this task using Word2Vec, I want it to find Pos/Neg words based on the corpus only (the corpus contains ~80,000 documents (each contains ~20 words)) and each corpus belongs to a different category - fast food (where I would like to get words such as 'tasty', 'disgusting'...), coffee makers, vehicles etc....
About hyper-parameters I'm not so sure what you mean by that.

I'll try using Skipgram, never thought about it before, maybe it'll lead to better results.

I'm not trying to create a sentiment classifier (I don't have per-category training set) but just looking to find corpus-based positive and negative words so this task must be unsupervised. I'm trying to get only ~10 words per sentiment.

I'll update if the results gets better.

jayant jain

unread,
Aug 10, 2016, 11:11:59 AM8/10/16
to gensim
By hyperparameters, I mean the arguments you pass to the word2vec class - here is the documentation.
Training the skipgram model involves changing one of those hyperparameters - `sg`. By default, it is 0, so the cbow model is used for training, you can set it to 1 to use the skipgram model.
Other hyperparameters you might want to look into are window size, number of epochs, model size and learning rate.

Lior Magen

unread,
Aug 10, 2016, 11:18:16 AM8/10/16
to gensim
OK now I understand, size=200, iter=10, window=7 and the other are defaults. I'll change it to use skip-gram.

jayant jain

unread,
Aug 10, 2016, 11:29:23 AM8/10/16
to gensim
Another reason I'm unsure about whether word2vec is the ideal approach for this is that word2vec distance is not always the same as semantic distance. A brief explanation for this would be - word2vec trains word representations on the basis of context words, and often both synonyms and antonyms appear in similar contexts. 

An example from restaurant reviews would be - The food was __. Both negative and positive words would fit well there, and so often antonyms end up having similar representations (from personal experience, `model.most_similar('good')` almost always seems to have `bad` in the top 5 results. 

A few links which I found quite insightful - 


On Wednesday, 10 August 2016 12:33:48 UTC+5:30, Lior Magen wrote:

Gordon Mohr

unread,
Aug 10, 2016, 1:13:51 PM8/10/16
to gensim
Your approach is plausible enough to be worth trying... but generally there are no firm guarantees about whether the arrangement/directions of word-vectors will fit some desired purpose. Outcomes depend significantly on your training data and other parameter choices. To the extent that directional/distance relationships arise that happen to match human intuitions about meaning, it's still somewhat of a lucky trick, and improving those relationships requires iterative trial-and-error. 

As Jayant notes, what we consider 'antonyms' are often quite-close, in word-vector space, because they appear interchangeably in similar contexts. (It still may be the case that the *direction* of difference between them reflects some interesting meaning, but the nearby 'neighborhood' often has words that contrast in human-understanding.) Some have described this interchangeable-similarity as 'syntactic' or 'typical' similarity, in contrast with 'semantic' or 'topical' similarity – and further noted that larger windows create vectors more indicative of 'semantic'/'topical' similarity. So you may want to experiment with larger window values. 

Regarding the way you're using `most_similar()`:

* its `positive` and `negative` parameters are there to average-and-difference the supplied words, as if solving the analogies often used as one way to evaluate word-vectors. Averaging together a few words you think of as 'good' might not result in the purest 'positive-sentiment' direction – I'm not sure, definitely try more or fewer words. And also, then subtracting out a vector representing (one or more) 'negative-sentiment' directions might not result in a purer 'positive' direction – I'm not sure, it's may not be necessary (compared to just using positive-sentiment words alone), so only do it if it seems to be helping. (And, vice-versa for the 'negative-sentiment' case.) More narrowly-focused differences between closely-associated polar words ('good' - 'bad', 'like' - 'dislike', 'love' - 'hate', etc) might be worth trying.

* It seems like you may be stemming words. I don't know if this helps or hurts, but can say that many word-vector projects/papers skip this, trusting instead that (with enough examples) all variations of a word still arrange themselves appropriately. 

* restricting your results to just the most-common 1000 words in your corpus may be limiting; if there's enough data that the vectors for less-common words are still well-defined, I suspect even uncommon (but clear/strong) words indicating certain sentiments will be of interest

- Gordon

Andrey Kutuzov

unread,
Aug 10, 2016, 1:43:56 PM8/10/16
to gen...@googlegroups.com
There are different ways to overcome this issue with antonyms.
See, for example,
http://aclanthology.info/papers/integrating-distributional-lexical-contrast-into-word-embeddings-for-antonym-synonym-distinction

10.08.2016 17:29, jayant jain wrote:
> Another reason I'm unsure about whether word2vec is the ideal approach
> for this is that word2vec distance is not always the same as semantic
> distance. A brief explanation for this would be - word2vec trains word
> representations on the basis of context words, and often both synonyms
> and antonyms appear in similar contexts.
>
> An example from restaurant reviews would be - The food was __. Both
> negative and positive words would fit well there, and so often antonyms
> end up having similar representations (from personal experience,
> `model.most_similar('good')` almost always seems to have `bad` in the
> top 5 results.
>
> A few links which I found quite insightful -
> 1. Messing around with word2vec
> <https://quomodocumque.wordpress.com/2016/01/15/messing-around-with-word2vec/>
> 2. Exploring antonyms with word2vec
> <https://gist.github.com/kylemcdonald/9bedafead69145875b8c>
>
> On Wednesday, 10 August 2016 12:33:48 UTC+5:30, Lior Magen wrote:
>
> I'm trying to get a list of positive words and a list of negative
> words using Word2Vec.
>
> Iv'e create a small set of generic negative/positive words and using
> them I'm trying to do this.
>
> words_list = dict(Positive=['good', 'great', 'best'],
> Negative=['bad', 'awful', 'terribl', 'disappoint'])
>
> sentiments_dict['Positive'] = [x[0] for x in
> self.model.most_similar(positive=words_list['Positive'],
> negative=words_list['Negative'],
> topn=most_common, restrict_vocab=1000)]
>
> sentiments_dict['Positive'].extend(words_list['Positive'])
>
> sentiments_dict['Negative'] = [x[0] for x in
> self.model.most_similar(positive=words_list['Negative'],
> negative=words_list['Positive'],
> topn=most_common, restrict_vocab=1000)]
>
> sentiments_dict['Negative'].extend(words_list['Negative'])
>
>
>
> The problem is that in some cases the results are fine but in other
> they pretty suck.
>
> In suggestions on how to improve that? I believe that the basic idea
> is right but there's something I miss I guess.
>
> --
> You received this message because you are subscribed to the Google
> Groups "gensim" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to gensim+un...@googlegroups.com
> <mailto:gensim+un...@googlegroups.com>.
> For more options, visit https://groups.google.com/d/optout.

--
Solve et coagula!
Andrey

Lior Magen

unread,
Aug 11, 2016, 7:25:19 AM8/11/16
to gensim
You are absolutly right and that's the reason why I'm using some positive example words and negative example words so it'll substitute the distances and will return only relevant words.

Lior Magen

unread,
Aug 11, 2016, 8:47:17 AM8/11/16
to gensim
Some updates:

First of all, thank you all for the help, you helped me so much.

Iv'e tried using Skip-Gram instead of CBOW and it did lead to better results but I have a pre-trained CBOW model so its not an option for me to retrain it using SG, although the results are better its not something worth retraining the model.

The text I'm using is stemmed, Iv'e tried to train on non-stemmed text but the results stay pretty much the same. 

Iv'e tried to use most_similar with only positive input (and negative input for negative sentiment), it led to worse which makes sense because just like Jain explained, the word 'bad' and the word 'good' are coming in identical context so when I was looking for similar words to 'good' it returned words like 'tasty', 'delicious' but also the word 'bad' so its no good.

Iv'e used most_common=1000 because I want the most popular words, not just positive/negative words but the most popular of them.

A side note - I was working on a fast food category and in that case finding positive words was more accurate than finding negative words.

In the end I got the best results with the following settings:
Algorithm = CBOW
sentiments_dict['Positive'] = [x[0] for x in
                               self.model.most_similar(positive=['good', 'best', 'love'],
negative=['bad', 'hate'],
topn=most_common, restrict_vocab=3000)]
sentiments_dict['Positive'].extend(['good', 'best', 'love'])

sentiments_dict['Negative'] = [x[0] for x in
                               self.model.most_similar(positive=['bad', 'awful', 'hate'],
negative=['good', 'love'],
topn=most_common, restrict_vocab=3000)]
sentiments_dict['Positive'].extend(['good', 'best', 'love'])

sentiments_dict['Negative'] = [x[0] for x in
                               self.model.most_similar(positive=['bad', 'awful', 'hate', 'product'],
negative=['good', 'love', 'product'],
topn=most_common, restrict_vocab=2000)]
sentiments_dict['Negative'].extend(['bad', 'awful', 'hate')


Category: Fast food
Highlight positive words:['delici', 'solid', 'yummi', 'favorit', 'great', 'fantast', 'fastest', 'good', 'best', 'love']
Highlight negative words:['horribl', 'terribl', 'unaccept', 'nasti', 'poor', 'disrespect', 'lousi', 'bad', 'awful', 'hate']

Category: Diapers
Highlight positive words: ['excel', 'fantast', 'softest', 'great', 'terrif', 'amaz', 'dryness', 'good', 'best', 'love']
Highlight negative words: ['terribl', 'horribl', 'instant', 'immedi', 'pain', 'bleed', 'burn', 'bad', 'awful', 'hate']

jayant jain

unread,
Aug 11, 2016, 8:08:44 PM8/11/16
to gensim
Thanks for the updates. Good to hear that you've gotten better results.

You've mentioned previously that your corpus has around 80,000 documents with around 20 words each. Training on a corpus that size should take very little time (assuming you have the fast version of word2vec with cython set up), so it might actually be a good idea to retrain the model with different hyperparameters and see what works best.

As Gordon noted above, increasing window size leads to more semantic information being captured (which seems intuitive), in a small window `good` and `bad` might have identical contexts, but in a larger window hopefully that should change (I can imagine `bad` might co-occur with other negative words in a larger window).

Lior Magen

unread,
Aug 14, 2016, 5:20:40 AM8/14/16
to gensim
Thanks for the tips, I'll try to increase the window size. 
Reply all
Reply to author
Forward
0 new messages