Can document vectors be used for text classification?

Satya Gunnam

unread,

May 4, 2017, 8:40:59 PM5/4/17

to gensim

I have a use case where I need to filter or classify some text.

I have seen the simple bag-of-words approach using scikit learn

which uses countvectorizer etc..Looks simple and seems to work well..

I want to know if the document vectors trained by the doc2vec model can be

fed to a logistic regression model compared to the above approach to get better results

in terms of accuracy or error rate?

If yes, do you have any examples you can refer me to.

Shiva Manne

unread,

May 6, 2017, 6:38:42 PM5/6/17

to gensim

Hey Satya,
Yes, feeding Document vectors to conventional machine learning techniques like logistic regression and SVMs is a valid/popular way for your task. The original Doc2Vec paper shows results comparable to state-of-the-art in various text classification and sentiment analysis tasks using a model. You should definitely go ahead and try Doc2Vec. In case your results are not satisfactory, you could also look at variations of LSTMs/CNNs. Here are the links to related notebooks : https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-IMDB.ipynb , https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-lee.ipynb.

Regards,
Shiva.

Satya Gunnam

unread,

May 7, 2017, 11:45:13 PM5/7/17

to gensim

Thanks..I did find this notebook:

https://github.com/RaRe-Technologies/movie-plots-by-genre/blob/master/ipynb_with_output/Document%20classification%20with%20word%20embeddings%20tutorial%20-%20with%20output.ipynb

This book looks at 7 diff ways of doing text classification starting with bag-of-words to doc2vec etc..

The thing that is bothering me is of all the 7 methods this book tried, doc2vec trained vectors

fed to KNN/logregression have much less accuracy than simple bag-of-words..

Ivan Menshikh

unread,

May 8, 2017, 8:12:27 AM5/8/17

to gensim

Hello, Satya

Do not be afraid, all depends on the "tuning" of the models (more quality embeddings, parameters of the model), source dataset and upper model (such logreg or something else)

In real text classification task, I successfully used LDA, LSI and Doc2vec embeddings with RandomForest. This worked well for my task (significantly better that bag-of-word approach)

понедельник, 8 мая 2017 г., 8:45:13 UTC+5 пользователь Satya Gunnam написал:

Satya Gunnam

unread,

May 8, 2017, 1:04:42 PM5/8/17

to gensim

Ok thanks for the information.

I am going to use doc2vec and test on my corpus/use case and update my findings.

Satya Gunnam

unread,

May 11, 2017, 6:51:04 PM5/11/17

to gensim

I have tried the below notebook with pretty much all the models starting from bag-of-words.

The accuracy ranges from 70-80%.

I had around 1300 samples with 3 classes and almost equal data for each class.

I tried to tweak the word2vec/doc2vec model params to the best of my knowledge.

I am seeing some obvious misses in the predictions..

What can I do further to improve accuracy ?

Is it something to do with the data itself? I have not really looked at the data yet..

Ivan Menshikh

unread,

May 12, 2017, 1:59:49 AM5/12/17

to gensim

The quality of model depends on a dataset, of course.

For improving accuracy you can do several things

- Extend your dataset (1300 samples is not enough for topic models and embedding like doc2vec)

- Check balance between classes (enough objects for each class and no skew between classes).

- More accurate preprocessing of texts (stemming, tokenization, filtering infrequent/too frequent tokens).

- Tweak params of models ONLY to validation set (if you do this in test, you should overfit)

- Use more complex (nonlinear) model as a upper-level model.

From my practical observations, the weak point of Lda/Doc2Vec is short texts, remember this.

пятница, 12 мая 2017 г., 3:51:04 UTC+5 пользователь Satya Gunnam написал:

Reply all

Reply to author

Forward