A simple mean of all a text's word-vectors, as in your `MeanEmbeddingVectorizer`, is a fairly crude way to create a summary vector for the text. In particular, it doesn't weight any words as being more significant (or give any chance for downstream classifiers to do so), immediately collapses the representation to just the N dimensions of your dense word-embedding (here 100), and would be highly dependent on the quality of the word-vectors. It might work OK as a quick baseline.
By comparison, the `TfidfVectorizer` and resulting "bag of words" vector-representation of texts *will* inherently scale rarer words as more important, and also maintains the M separate dimensions (presence or absence of a word) as a 'sparse' embedding (where M is the size of the whole vocabulary, & much larger than 100 dimensions). Downstream classifiers also then have a chance to further learn that some words are more significant for their purposes.
So while there might be a dense text embedding that would be a top performer in your downstream tasks, it'd probably need to be more sophisticated & carefully tuned than a simple mean-of-word-vectors, in order to outperform the (somewhat larger) `TfidfVectorizer` representation.
Something based on much-larger word-vectors (400d? 1000d?) might help, if there's enough training data. Something that weighted the word-vectors before averaging, perhaps even by TF-IDF calculations, might help. Something based on the related `Doc2Vec` algorithm, which explicitly learns a text-vector not based on a simple average might do a little better, if tuned & supported with enough training data. The 'FastText' word2vec variant might do better, since it has a 'classification' training mode where the word-vectors are specifically optimized, given some known labels, to work well as inputs to an average-then-classify process.
A few separate notes on your apparent setup:
* `sample=0.05` is a very non-typical setting which might result in negligible downsampling; common values for this parameter range from 1e-3 (0.001) to 1e-6 (0.000001), going smaller and becoming more useful with much larger training corpuses
* word-vector quality is very dependent on the size and quality of training data; if your data is thin finding more data to help with word-vector training, or re-using domain-compatible word-vectors from elsewhere, might help
* your `ModelInferVectorizer` couldn't possibly work with a `Word2Vec` model, because `infer_vector()` only exists on `Doc2Vec` models. If you do use a `Doc2Vec` model -– which per above is worth a try – that's a very atypically large starting `alpha` for `infer_vector()`. Be sure to use the latest (3.5.0+) gensim, which has some inference fixes, and while it may make sense to try larger-than-default `steps' values, especially on short texts, other `alpha` choices probably aren't necessary.
- Gordon