Embedding vector for word, query, sentence and paragraph

Nick Liu

unread,

Dec 3, 2015, 6:53:41 PM12/3/15

to gensim

Hi,

I’ve played with word2vec for almost a year and learned a great deal. I’d like to share my experiences here to help others to better use this powerful tool:

1.       Word vector
a.       It’s a little bit challenging to figure out the best training parameters for word2vec. Fortunately, Levy & Goldberg’s paper pointed us to the right direction:
Improving Distributional Similarity with Lessons Learned from Word Embeddings
From their paper:
·         SGNS(Skip-Grams with Negative Sampling) is a robust baseline. While it might not be the best method for every task, it does not significantly underperform in any scenario. Moreover, SGNS is the fastest method to train, and cheapest (by far) in terms of disk space and memory consumption.
·         With SGNS, prefer many negative samples.
My own experience with word2vec + deep learning also shows SGNS (cbow=0 and hs=0) is the best option.
b.       For SGNS, here is what I believe really happens during the training:
If two words appear together, the training will try to increase their cosine similarity. If two words never appear together, the training will reduce their cosine similarity. So if there are a lot of user queries such as “auto insurance” and “car insurance”, then “auto” vector will be similar to “insurance” vector and “car” vector will also be similar to “insurance” vector. Eventually “auto” vector will be very similar to “car” vector because both of them are similar to “insurance”, “loan” and “repair” vector. This intuition will be useful if you want to better design your training data to meet the goal of your information retrieval task.
2.       Query vector
A query usually has 1-10 words. I have implemented and tried two query vector predictors.
a.       Sentence vector based on Quoc Le & Tomas Mikolov’s paper: Distributed Representations of Sentences and Documents
b.       Avg predictor by averaging the word vectors of each word in a query. You can check out source code here
https://github.com/nliu86/Fixed-length-vector-predictor-for-text
The avg predictor is implemented in multi-thread and is super-fast. For my experiment with word2vec + deep learning, avg predictor slightly beats sentence vector.
3.       Sentence and paragraph vector
We have to set our expectation reasonable here: there exists no such magic to accurately transform sentence and paragraph with infinite possibilities into 300-dimension vector. Can you image we spit out 300 random numbers instead of saying a whole sentence to convey our meaning? For word and phrase, since they appear in a lot of contexts, we can exploit those contexts. However, for sentence and paragraph, they usually only appear once in training corpus. The only thing we can exploit is to use the relationship between sentence and the words in it. By doing so, we get something that’s similar to the avg predictor mentioned earlier. So for sentence and paragraph, the best way to represent them is to first remove all the stop words. Then for the rest words, pick a dominant word set and use avg predictor for the dominant word set. If we don’t pick dominant word set, avg predictor will average every word in the sentence and the resulted vector will be super noisy. I’ve used a special training data to train word2vec and then used vector clustering to pick dominant word sets from this article: Disneyland Bought Extra Land For A Billion-Dollar Park Expansion. The sets look like [disney disneyland park] and [disney disneyland star_wars], which are good enough for my purpose.
4.       Tri-letter gram vector
Another way to use word2vec is to transform training data into tri-letter gram format. Say if we have a query called “best hotel deal”. We can transform it into “bes est hot ote tel dea eal”. Then use word2vec to train vector for each tri-letter gram. If we combine this with query avg predictor and deep learning, we will get something that’s similar to DSSM, but much simpler:
Learning Deep Structured Semantic Models for Web Search using Clickthrough Data
Tri-letter gram vector can be useful in detecting meanings of misspellings, unknown words and domain names. For example, if someone types “cooool”, we can figure out its meaning is similar to “cool”.
5.       Embedding vectors and deep learning
Embedding vectors are married to deep learning. Without deep learning, we lose a lot of benefits of embedding vectors. For my predictive modeling problem with 1TB training data and 200 million rows, deep-learning-based model outperforms simple neural network model by about 10%.

In summary, here is what I recommend if you plan to use word2vec: choose the right training parameters and training data for word2vec, use avg predictor for query, sentence and paragraph(code here) and apply deep learning on resulted vectors.

Let me know if you have any questions.

Nick

Stephen Oates

unread,

Dec 5, 2015, 5:46:51 AM12/5/15

to gensim

Wow - thanks so much - this is really useful.

Radim Řehůřek

unread,

Dec 6, 2015, 4:22:26 AM12/6/15

to gensim, word2vec...@googlegroups.com, Matthew Honnibal

Thanks Nick, great stuff! Let me cross-post this to the word2vec-toolkit mailing list as well.

I wish more people posted experimental results. By the way, I talked about that Levy&Goldberg paper in this youtube talk.

Also, regarding the named entity recognition and chunking for word2vec -- Matthew Honnibal has been experimenting with this in his spaCy library recently, with some interesting results:

https://github.com/honnibal/spaCy/issues/197#issuecomment-162099682

Best,

Radim

Lachlan Miller

unread,

Dec 6, 2015, 8:25:54 PM12/6/15

to gensim

Very interesting findings, thanks for the post!

Rafael Garcia Vega

unread,

Dec 11, 2015, 9:37:46 AM12/11/15

to gensim

Really impressing work, currently i'm working on a method to semantically retrieve information from a tweet dataset. I'll give your attack a try and let you know. Right now i'm trying to find out a dominant word for the tweet using k-means.

Rafa

Parkway

unread,

Dec 14, 2015, 4:19:26 AM12/14/15

to gensim

@nick Thank-you for sharing your experiences. A question:

Under 'Sentence and paragraph vector', it says "there exists no such magic to accurately transform sentence and paragraph with infinite possibilities into 300-dimension vector" but each number is in the range [-1,1]. Does this not increase support for almost infinite possibilities?

Nick Liu

unread,

Dec 15, 2015, 5:23:54 PM12/15/15

to gensim, word2vec...@googlegroups.com, honn...@gmail.com

Thanks Radim! In future I will update my post with more details.

I checked your youtube talk. It's a good summary of all the important work about word2vec!

Matthew Honnibal's experiments look very interesting. It's definitely an improvement over word2vec. I will try it out someday.

Thanks,

Nick

Nick Liu

unread,

Dec 15, 2015, 6:09:33 PM12/15/15

to gensim

For short sentence, maybe there is a way to transform them into 300-dimension vector. For example, we can replace each letter with its ASCII code so we get a sequence of numbers whose length is less than 300. But for long sentence and paragraph, this sequence of numbers can go on and on. How is it possible to project such sequence of numbers up to infinite length into 300 dimension without losing information? I seriously doubt it's possible.

I believe there is a big mistake in the sentence vector paper, which assumes we can train vectors for sentence the same way as words and phrases. Words and phrases have a lot of contexts. The same word or phrase may appear hundreds or thousands of times in the training corpus. But the same sentence usually only appears once. If you check the source code of sentence vector, you will see that sentence vector is mostly influenced by the words inside the sentence not by its contexts. That's a huge difference from word vector training.

Ralph Tigoumo

unread,

Dec 23, 2015, 12:01:05 AM12/23/15

to gensim

Hi Nick,

I really enjoyed your analysis, thanks so much for just sharing out your insights!

You claim that wordvec averaging (avg predictor) can beat doc2vec. Do you have any figures to support the claim? I'm interested in how much it actually beats doc2vec, and would love to see some of your actual figures, and which dataset you used to evaluate.

Thanks in advance :)

Nick Liu

unread,

Jan 21, 2016, 3:45:11 PM1/21/16

to gensim

Hi, Ralph,

Thank you for your interest in my analysis! The data set are 200 million ads samples from a search engine. Each sample has 3 fields: query, keyword, click/noclick.

Both query and keyword are transformed into 100 dimension vectors using my avg predictor or sentence vector. Then I apply deep neural network on the transformed dataset. AUC is used to evaluate model performance. Here are the rough numbers on the test data set:

Best baseline model AUC: 0.6865

word2vec avg predictor + deep neural network: 0.6977

word2vec sentence vector + deep neural network: 0.6968

word2vec avg predictor + simple neural network: 0.6421

Based on our past experiences, an AUC improvement of 0.001 is significant. Hope the numbers will be helpful for you.