gensim word2vec - binary text classificiation

Venkatesh Adiga

unread,

Sep 26, 2017, 9:13:39 AM9/26/17

to gensim

Hi,

I am novice person in gensim, trying to implement binary text classification for email filter - spam or non-spam just based on the message content. I am finding difficult to use word2vec framework to GridSearchCV skilearn API. Did anyone tried using GridSearchCV API on word2vec instead of word tokens. Please let me know if someone has any info.

Thanks in advance.
Regards
Venkatesh

Ivan Menshikh

unread,

Sep 27, 2017, 7:01:35 AM9/27/17

to gensim

Hi Venkatesh,

Please look into sklearn api integration notebook, in this notebook, you'll find the examples that you need

FYI don't forget to upgrade your gensim version to latest (3.0.0)

Venkatesh Adiga

unread,

Oct 11, 2017, 9:18:32 AM10/11/17

to gensim

Thank you Ivan. I have changed the my model to Doc2Vec from Word2Vec as it is not meeting my requirement for SMS spam prediction. I have upgraded from 2.3 to 3.0, so got visibility to D2VTransformer.

But i could not get clear help from the D2VTransformer.
I am trying to implement logistic prediction function on the Doc2Vec which is already(trained one) stored as model. I also stored Logistic regression model as pickle file.
My intent is to convert the received SMS message into Doc2Vec, and predict the type on the received SMS message.
So there are below items which needs integration
1) Stored Doc2Vec Model
2) Stored Logistic Regression model
3) Conversion of received SMS to Doc2Vec instance and updating the saved model.
4) Predict on the new instance of Doc2Vec

Can anyone please throw some tips on this?

Thank you.
Venkatesh

Ivan Menshikh

unread,

Oct 12, 2017, 1:09:28 AM10/12/17

to gensim

Sorry, I don't catch, what's your problem more concretely?

Venkatesh Adiga

unread,

Oct 12, 2017, 1:56:20 AM10/12/17

to gensim

I wanted to implement a python method which predicts the type of received SMS as either spam or ham using Doc2Vec and Logistic regression classifier. Have completed the training on SMS repository, so stored Doc2Vec as model and Logistic regression as pickle model.
On received any SMS, the new method should predict the type of SMS as spam or ham using the Doc2Vec model and Logistic regression pickle model file.

I wanted to know how i should proceed with steps.

Thanks
Venkatesh

Gordon Mohr

unread,

Oct 12, 2017, 2:45:27 PM10/12/17

to gensim

The demo notebook included in gensim that Ivan mentions in another thread (https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-IMDB.ipynb) is a good example to study - it applies Doc2Vec to a binary "positive"-"negative" (sentiment) classification.

Note that if your real goal is the best possible spam-vs-ham determination, rather than just trying/learning Doc2Vec on an available and interesting problem, other classification techniques based on other more categorical text features, even just "bag of words", might perform better. Doc2Vec typically benefits from somewhat longer documents - for example the IMDB example has texts of hundreds-of-words, rather than SMS-sized dozens-or-hundreds-of-characters. (It's worth trying as one among many feature-engineering techniques, I'd just avoid expectations, high or low, as to where it'd compare with other techniques.)

- Gordon

Ivan Menshikh

unread,

Oct 12, 2017, 3:02:24 PM10/12/17

to gensim

Agree with Gordon, for short text Doc2Vec can be not very efficient, for short text please look at this thread and look to classical naive Bayes method (yes, it works very nice for short texts).

Reply all

Reply to author

Forward