Gensim 4.0 beta: fastText, word2vec, doc2vec

829 views
Skip to first unread message

Radim Řehůřek

unread,
Nov 21, 2020, 6:34:00 PM11/21/20
to Gensim
Hi all,

we just released Gensim 4.0.0beta.

If you want to help, please install 4.0.0beta and let us know how it went!

$ pip install --upgrade --pre gensim

Gensim 4.0 contains massively optimized (RAM, CPU) versions of popular algorithms like word2vec, fastText, doc2vec:

Em3nL3_XcAAQddq.jpeg

Cheers,
Radim

Tedo Vrbanec

unread,
Dec 2, 2020, 2:44:18 PM12/2/20
to Gensim
I would like to report some problems. First this one:
from gensim.matutils import softcossim
ImportError: cannot import name softcossim

Vít Novotný

unread,
Dec 3, 2020, 3:17:21 AM12/3/20
to Gensim
Greetings,

the gensim.matutils.softcossim function has been deprecated in favor of the gensim.similarities.termsim.SparseTermSimilarityMatrix.inner_product method and/or the gensim.similarities.docsim.SoftCosineSimilarity class. Unlike softcossim, inner_product and SoftCosineSimilarity support efficient computation of the soft cosine similarity not only between documents, but also between corpora.

Best,
Vítek

Dne středa 2. prosince 2020 v 20:44:18 UTC+1 uživatel tedo.v...@gmail.com napsal:

Radim Řehůřek

unread,
Dec 4, 2020, 7:34:23 AM12/4/20
to Gensim
Right. That function had been marked as "deprecated" for 2 years, and finally removed in 4.0.

Tedo, does that help?

Best,
Radim



Message has been deleted

Tedo Vrbanec

unread,
Dec 15, 2020, 3:45:36 PM12/15/20
to Gensim
[model.word_vectors[model.dictionary[word]] for word in words_in_model]
AttributeError: 'Word2Vec' object has no attribute 'word_vectors'

Tedo Vrbanec

unread,
Dec 15, 2020, 3:49:56 PM12/15/20
to Gensim
Yes, thanks. But I am very worried. There are so many changes in Gensim 4, and my program code is so complex that I no longer know what I was doing and why. :( I'm afraid I'll have to rewrite everything, and I've already spent a couple of years on it.

Gordon Mohr

unread,
Dec 16, 2020, 6:08:32 PM12/16/20
to Gensim
On Tuesday, December 15, 2020 at 12:45:36 PM UTC-8 tedo.v...@gmail.com wrote:
[model.word_vectors[model.dictionary[word]] for word in words_in_model]
AttributeError: 'Word2Vec' object has no attribute 'word_vectors'

As far as I can tell, there was no `.word_vectors` property on Gensim class `Word2Vec` in gensim-3.8.x, nor do I recall it in any earlier version. 

So if this code was previously working, I don't know where it was, & the error you're getting doesn't seem related to gensim-4.0.0beta changes.

(Had you created your own identically-named `Word2Vec` class, perhaps as a subclass of Gensim's `Word2Vec`, with extra convenience features, and now that hasn't been used/adapted?)

Yes, thanks. But I am very worried. There are so many changes in Gensim 4, and my program code is so complex that I no longer know what I was doing and why. :( I'm afraid I'll have to rewrite everything, and I've already spent a couple of years on it.

For those who used high-level public APIs, prior code is likely to work without changes. If you're using some lower-level, previously-marked-deprecated or similarly-replaced functionality, in almost every case the same functions are now available in just a slightly-renamed or relocated way. A few short cycles of – (1) run the code; (2) see the errors triggered, use the hints in the error or release notes to adjust things; (3) try again – should almost always be sufficient to port old code forward. 

If you've created your own hairball of lots of deep interactions with the older code's internal state, not via public methods, the task could be harder. But if your own code is stable & inscrutable, you always have the choice of staying on the older gensim almost indefinitely. Only if you need recent improvement would you need to upgrade, and ultimately the changes in interface aren't very large. 

But if you do have examples of older code that worked in gensim-3.8.3, & in 4.0.0beta or later doesn't or generates errors, & you ask here, we'll be able to point to the necessary changes pretty easily.

- Gordon

Tedo Vrbanec

unread,
Dec 19, 2020, 3:00:42 AM12/19/20
to Gensim
I must inform you that I do not have any custom code or class. This (`.word_vectors`) is an option from Gensim.

santosh.b...@gmail.com

unread,
Dec 19, 2020, 5:29:17 PM12/19/20
to Gensim
I installed the beta version and trained and saved a wor2vec model on Wikipedia text. I was able to use it after loading:
model= gensim.models.Word2Vec.load(file)

Recently, I had to downgrade to 3.8.0 version as  Python package WEFE (https://wefe.readthedocs.io/en/latest/about.html) depends on it. After doing this I am unable to load the file as above. I now get the following error"
AttributeError: 'Word2Vec' object has no attribute 'trainables'
AttributeError: 'EuclideanKeyedVectors' object has no attribute 'vocab' 

Not sure how to resolve this issue.

Raffaele Mancuso

unread,
Dec 20, 2020, 4:48:15 PM12/20/20
to Gensim
Dear Santosh,

you can find a version of WEFE with experimental support for gensim 4 here

Best regards

santosh.b...@gmail.com

unread,
Dec 21, 2020, 7:51:26 AM12/21/20
to Gensim
Many thanks, Raffaele. This is super helpful.

Regards
sbs

Radim Řehůřek

unread,
Dec 21, 2020, 8:13:12 AM12/21/20
to Gensim
Thank you for the update Raffaele!

Radim

Gordon Mohr

unread,
Dec 21, 2020, 4:43:22 PM12/21/20
to Gensim
On Saturday, December 19, 2020 at 12:00:42 AM UTC-8 tedo.v...@gmail.com wrote:
I must inform you that I do not have any custom code or class. This (`.word_vectors`) is an option from Gensim.

I can't find `.word_vectors mentioned in the gensim-3.8.3 docs, or source code.  I'd guess some other library you're using, or code you've run, has added it to the specific `Word2Vec` model class or object you're using. 

If you were to try to make a minimal example, in `gensim-3.8.3` or any other recent version, of instantiating a `Word2Vec` model & trying to access this property, I suspect you'd get the same 'has no attribute' error. For example:

```
from gensim.models import Word2Vec
model = Word2Vec([['nada']], min_count=1)
model.word_vectors
```

- Gordon

Tedo Vrbanec

unread,
Dec 23, 2020, 5:09:32 PM12/23/20
to Gensim
I finally took enough time to study the problem. Gordon, I'm sorry. You were right. This is a feature of the GloVe model that is not part of the Gensim. :(
Message has been deleted

Tedo Vrbanec

unread,
Dec 28, 2020, 6:13:02 PM12/28/20
to Gensim
words = [word for word in words_embeddings.wv.key_to_index if word.isalpha()]
gives:
AttributeError: 'numpy.ndarray' object has no attribute 'wv'

Gordon Mohr

unread,
Dec 28, 2020, 8:25:05 PM12/28/20
to Gensim
From the error, it is likely that in other not-shown code, you have assigned a `numpy.ndarray` type object into the `words_embeddings` variable. 

- Gordon

Message has been deleted

Tedo Vrbanec

unread,
Dec 29, 2020, 12:27:40 PM12/29/20
to Gensim
I used
w2v_model = w2v_model.wv.get_normed_vectors() #In Gensim 4 allowed
instead of previous
w2v_model.init_sims(replace=True) # for Gensim 3.8

So, now I got this error
AttributeError: 'numpy.ndarray' object has no attribute 'wv'

What code do I have to use instead of
w2v_model.init_sims(replace=True)

Gordon Mohr

unread,
Dec 29, 2020, 2:23:16 PM12/29/20
to Gensim
Per the notes about migrating to Gensim-4 (<https://github.com/RaRe-Technologies/gensim/wiki/Migrating-from-Gensim-3.x-to-4#5-no-more-init_sims>), in most cases `init_sims()` is no longer necessary. In most cases where it was called, it can now just *not* be called. (Calling it yourself with `replace=True` no longer offers any notable RAM savings, but does destroy raw magnitude information that could still be useful in many applications.)

If you were already using `.get_normed_vectors()`, it will still work, but as the migration notes warn, now, each call creates the full array anew - so older code that was considering that call cheap/instant may need updating to retain the calculated array itself for reuse, rather than count on the model to always provide it. 

But if you weren't already using `.get_normed_vectors()`, there's no need to add it now. And in neither Gensim-3.8 nor Gensim 4.0 would it be a good idea to clobber the value of your `w2v_model` variable with the return-value of `get_normed_vectors()`, as that method returns a big `numpy.ndarray`, not a `Word2Vec` or `KeyedVectors` instance with their convenience methods. 

If for some particular reason you really do need to *destructively* change the vectors inside a Gensim `KeyedVectors` to all be unit-normalized:

* `w2v_model.wv.init_sims(replace=True)` will still work in Gensim-4.0, though it will generate a deprecation warning. 

* `w2v_model.wv.unit_normalize_all()` is a new, explicitly-documented method for achieving that aim. See: https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.KeyedVectors.unit_normalize_all

However, it's rare to really want/need to do that. And once you do, it won't make sense to save/train the original enclosing  `Word2Vec` model. (So, if you've really moved a model to a lookup-vectors-only role, and further applied this destructive mutation, it'd be clearer to pull the `KeyedVectors` alone out of the `w2v_model` – that is, its `w2v_model.wv` object – and save/load/operate-on that object, without the potential confusion/bugs of keeping a full `Word2Vec` model around after you've fouled its raw weights.

- Gordon
Message has been deleted

Tedo Vrbanec

unread,
Dec 29, 2020, 7:10:21 PM12/29/20
to Gensim
That was very helpfull, Gordon. Thank you!

Tedo Vrbanec

unread,
Dec 30, 2020, 6:24:12 AM12/30/20
to Gensim
I can't compute WMD using the wmdistance method. I'v got (executing *.pyx) AttributeError: 'Doc2Vec' object has no attribute 'wmdistance'.

Gordon Mohr

unread,
Dec 30, 2020, 7:59:30 PM12/30/20
to Gensim
Indeed, the `Doc2Vec` class doesn't have a `wmdistance()` method, so such an error is expected:


The `KeyedVectors` class does have such a method:

The `.wv` property of a `Doc2Vec` model is a `KeyedVectors` object, if you want to do a WMD calculation using the word-vectors of a `Doc2Vec` model. But remember only some modes of `Doc2Vec` train word-vectors. (The otherwise very fast and useful `dm=0` PV-DBOW mode will leave the word-vectors as meaningless random-initialized values, unless you also add the `dbow_words=1` option.)

- Gordon
Reply all
Reply to author
Forward
0 new messages