Multiclass classification with gensim word2vec for feature extraction

390 views
Skip to first unread message

Praveen049

unread,
Aug 18, 2018, 11:25:21 AM8/18/18
to Gensim
Hi
  I am trying to implement a multiclass test classification solution using gensim word2vec for feature extraction.

I am building the model as below

def g_word2vec(raw, save=False, path=None):
 
"""
 Word vector generated from the raw text which is basically all the text fields in the DataFrame which will be
 used for prediction
 """

 
print (type(raw))
 cleaned
= [default_clean(d) for d in raw] sentences = list(tokenize(cleaned))

 
from gensim.models import Word2Vec

 model
= Word2Vec(sentences=sentences, # tokenized senteces, list of list of strings
 size
=100, # size of embedding vectors
 workers
=5, # how many threads
 min_count
=5, # minimum frequency per token, filtering rare words
 sample
=0.05, # weight of downsampling common words
 sg
= 1, # should we use skip-gram? if 0, then cbow
 iter
=5,
 hs
= 0
 
)

 X
= model[model.wv.vocab]

 
if save:
 model
.wv.save_word2vec_format('{}/wv.bin.format(path)', binary=False)

 w2v
= dict(zip(model.wv.index2word, model.wv.syn0))
 
return model, w2v


I have implemented few types of feature extractors  to make use the word vectors

class MeanEmbeddingVectorizer():
 
def __init__(self, word2vec):
 
self.word2vec = word2vec
 
# if a text is empty we should return a vector of zeros
 
self.dim = len(word2vec)

 
def fit(self, X):
 
return self

 
def transform(self, X):
 
return np.array([
 np
.mean([self.word2vec[w] for w in words if w in self.word2vec]
 
or [np.zeros(self.dim)], axis=0)
 
for words in X
 
])

class ModelInferVectorizer():
 
def __init__(self, word2vec, model):
 
self.word2vec = word2vec
 
self.word2weight = None
 
self.model = model
 
self.dim = len(word2vec)

 
def fit(self, X):
 
return X

 
def transform(self, X):
 
return np.array(list([self.model.infer_vector(sentences, steps=50, alpha=0.25)] for sentences in X))



I use one of these custom vectorizer for feature extraction and then use SVM for multi class classification.

The classification accuracy is lower than using sklearn tfidf vectorizer with SVM.

Is my approach reasonable way for using the word vectors ? 
If yes, any error in my approach which is giving lower accuracy than plain tfidf  ?

Are there any other ways of using  word2vec for multi class classification

Thanks

Praveen

Gordon Mohr

unread,
Aug 19, 2018, 10:23:05 PM8/19/18
to Gensim
A simple mean of all a text's word-vectors, as in your `MeanEmbeddingVectorizer`, is a fairly crude way to create a summary vector for the text. In particular, it doesn't weight any words as being more significant (or give any chance for downstream classifiers to do so), immediately collapses the representation to just the N dimensions of your dense word-embedding (here 100), and would be highly dependent on the quality of the word-vectors. It might work OK as a quick baseline. 

By comparison, the `TfidfVectorizer` and resulting "bag of words" vector-representation of texts *will* inherently scale rarer words as more important, and also maintains the M separate dimensions (presence or absence of a word) as a 'sparse' embedding (where M is the size of the whole vocabulary, & much larger than 100 dimensions). Downstream classifiers also then have a chance to further learn that some words are more significant for their purposes. 

So while there might be a dense text embedding that would be a top performer in your downstream tasks, it'd probably need to be more sophisticated & carefully tuned than a simple mean-of-word-vectors, in order to outperform the (somewhat larger) `TfidfVectorizer` representation. 

Something based on much-larger word-vectors (400d? 1000d?) might help, if there's enough training data. Something that weighted the word-vectors before averaging, perhaps even by TF-IDF calculations, might help. Something based on the related `Doc2Vec` algorithm, which explicitly learns a text-vector not based on a simple average might do a little better, if tuned & supported with enough training data. The 'FastText' word2vec variant might do better, since it has a 'classification' training mode where the word-vectors are specifically optimized, given some known labels, to work well as inputs to an average-then-classify process.

A few separate notes on your apparent setup:

* `sample=0.05` is a very non-typical setting which might result in negligible downsampling; common values for this parameter range from 1e-3 (0.001) to 1e-6 (0.000001), going smaller and becoming more useful with much larger training corpuses

* word-vector quality is very dependent on the size and quality of training data; if your data is thin finding more data to help with word-vector training, or re-using domain-compatible word-vectors from elsewhere, might help

* your `ModelInferVectorizer` couldn't possibly work with a `Word2Vec` model, because `infer_vector()` only exists on `Doc2Vec` models. If you do use a `Doc2Vec` model -– which per above is worth a try – that's a very atypically large starting `alpha` for `infer_vector()`. Be sure to use the latest (3.5.0+) gensim, which has some inference fixes, and while it may make sense to try larger-than-default `steps' values, especially on short texts, other `alpha` choices probably aren't necessary.

- Gordon

Praveen049

unread,
Aug 20, 2018, 6:02:47 AM8/20/18
to Gensim
Hi Gordon
  Thank you for your inputs.

Yes, tfidf embedding is something i am trying now. Below is the code
class TfidfEmbeddingVectorizer():

   
def __init__(self, word2vec):
       
self.word2vec =
word2vec
       
self.word2weight = None

       
self.dim = len(word2vec)

 
def fit(self, X):


     tfidf
= TfidfVectorizer(encoding='utf-8')
     tfidf
.fit(X)
     max_idf
= max(tfidf.idf_)

     
self.word2weight = defaultdict( lambda: max_idf,
         
[(w, tfidf.idf_[i]) for w, i in tfidf.vocabulary_.items()])


     
return self

 
def transform(self, X):
     return np.array([

         np
.mean([self.word2vec[w] * self.word2weight[w]

             
for w in words if w in self.word2vec] or
             
[np.zeros(self.dim)], axis=0)
             
for words in X
         
])


I also made the changes to the w2v creator as you recommended.  (lower downsampling and 1000d)

With these changes the test accuracy is bit better but still lower than plain tfidf + svm.

I will try with the FastText that you recommended.

Regarding sample set size, i have 60000 samples and around 500 labels.

Br
Praveen

Gordon Mohr

unread,
Aug 20, 2018, 12:55:36 PM8/20/18
to Gensim
Word2Vec quality is sensitive to the total number of training contexts – essentially length in words, not texts. So if your 60,000 texts are only 5-10 words each, you've only got a 300,000-600,000 words corpus, which would be very small for Word2Vec training. (You might need to use more than the default 5 training passes with a small corpus, and in any case might not be able to get good giant 1000d vectors from a small dataset.) On the other hand, if they're each 1,000-10,000 word articles, you'd have 60 million to 600 million words – much better for getting useful word-vectors. (But still perhaps not enough for giant 1000d vectors, which I've only seen tried occasionally, with very large corpuses.)

So: be aware word-vector quality is very sensitive to corpus size (in total words/contexts), and no one vector-size is necessarily best – it can depend on data size/quality/character, and once you have an objective repeatable evaluation score you can optimize for, vector size can be adjusted along with many other parameters in your setup.

- Gordon
Reply all
Reply to author
Forward
0 new messages