Training SVM as published at "Twitter Sentiment Classification using Distant Supervision"

Breno Arosa

unread,

Jan 5, 2017, 7:59:14 AM1/5/17

to sentim...@googlegroups.com

Hi,

I'm trying to reproduce the classifiers published at "Twitter Sentiment Classification using Distant Supervision" to use as baseline of my research, which is tweet sentiment classification in pt-BR.

I'm using the dataset provided at http://help.sentiment140.com/for-students. I was able to get similar performance with Naive Bayes as published.
But I couldn't train a SVM classifier since the dataset has a decent size (1.6M).
I tried to use the scikit-learn implementation with linear kernel and unigram (33k) as feature. All the matrix are already sparsed represented.
My best try was to run a bagging ensemble of smaller SVMs (32k) which run on a couple of hours.

Am I missing any detail?
Could you elucidate how you trained the SVM classifier which gave the published results?

Thank you in advance,
Breno.

Alec Go

unread,

Jan 6, 2017, 9:54:43 AM1/6/17

to sentim...@googlegroups.com

Hi Breno,

I believe we used libsvm which works better for large data sizes.

Alec

--
You received this message because you are subscribed to the Google Groups "Sentiment140" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sentiment140...@googlegroups.com.
To post to this group, send email to sentim...@googlegroups.com.
Visit this group at https://groups.google.com/group/sentiment140.
For more options, visit https://groups.google.com/d/optout.

Phyllis Liu

unread,

Apr 17, 2018, 9:41:30 PM4/17/18

to Sentiment140

Hi Breno,

I am doing the same thing as you mentioned in this post. However, when I used sklearn.feature_extraction.text.CountVectorizer to extract features in unigram, I got 268515 features (after replacing username and urls, reducing repeated letters to two occurrences). It is impossible to train SVM with this magnitude of feature. The parameters of CountVectorizer is (analyzer='word', binary='boolean', decode_error=u'strict',

dtype=<type 'numpy.int64'>, encoding=u'utf-8', input='content',

lowercase=True, max_df=1.0, max_features=None, min_df=1,

ngram_range=(1, 1), preprocessor=None, stop_words='english',

strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',

tokenizer=None, vocabulary=None)

How did you get 33k unigram features? Which feature extractor did you use? Did you do any feature selection?

Thank you very much.

Phyllis

在 2017年1月5日星期四 UTC+11下午11:59:14，Breno Vieira Arosa写道：

Phyllis Liu

unread,

Apr 17, 2018, 10:00:20 PM4/17/18

to Sentiment140

Hi Alec,

In table 2 of your paper, you mentioned the number of features is 364464 after all kinds of feature reduction. I am trying to reproduce it. However, I got 268515 unigram features instead. Do I misunderstanding the meaning of this table? Is the number of features (364464) the same as the number of unigram features? If I am going to train a SVM classifier, how can I reduce the number of features to let the classifier working well? I would be greatly appreciated if you can give me some tips.