Errors When Installing Version 3.8.3

62 views
Skip to first unread message

Andy Weasley

unread,
Oct 16, 2023, 5:37:34 PM10/16/23
to Gensim
Hi Gensim Teams,

Hope this message finds you well. Since I need some particular features to replicate the code in a paper, I have to use version 3.8.3. However, I encountered some errors when installing it on my Google Colab (Python version: 3.10.12). The error message is below:
I also tried to clone the 3.8.3 branch to my drive and manually install it by running setup.py and a similar problem happened again. Does the 3.8.3 version support a newer python environment? Or how can I solve the installing problem? I greatly appreciate your assistance.

error.png

Gordon Mohr

unread,
Oct 17, 2023, 12:25:09 AM10/17/23
to Gensim
Gensim 3.8.3 was released 3.5 years ago, in May 2020, and was supported for all the then-available versions of Python 3.x: 3.5, 3.6, 3.7 and 3.8. 

It *might* be possible to get a 3.8.3 versions working on later Python 3.x versions, but there's little demand for that.

The latest Gensim 4.3.2, released August 2023, has many fixes & improvements, and except for a few cases where functionality was completely discared, it'd be easier to get older code working in the latest Gensim than get an older Gensim working in the latest Python.

And, converting older source to the latest Gensim will result in code supportable in future Gensim & Python releases, rather than being one-off work with little forward value.

If you absolutely needed to use Gensim-3.8.3, you should roll your Python runtime back to a supported version. (I know that may be hard or impossible in hosted environments – which have their own legitimate reasons to only support more-recent Python versions. But it's an option for local runtimes/notebook-servers.)

Unless you are precisely needing to verify the exact code in an older paper, I would expect that performing the minimal changes to make that code work in Gensim 4+ would still be a substantive replication, in most cases & respects. The algorithms are the same – just a bit more memory & speed efficient. (So: you'd not expect some runtime measures to be comparable, but outputs would be. It's highly unlikely any paper's code/results relied on any fixed bugs.)

And if your goal was *exact* replication, even the changes to get Gensim to work udner a later Python, and th use of a different Python runtime than the original work, would make a patched-Gensim-3.6.8 in a different-Python-3.x somewhat different. 

If you try to run the code in Gensim-4.3.2 & run into problems, typical code can be fixed with a few changes of methods/properties/idioms – usually just a few lines of code changes, most summarized at the help page: https://github.com/RaRe-Technologies/gensim/wiki/Migrating-from-Gensim-3.x-to-4

If the paper code made more extensive algorithmic changes – such as editing or extending older, changed classes – the adaptation will be a bit harder, but still shouldn't be too extensive and will still be worth it: the Gensim 4 changes simplify many internals, lower memory use, speed common operations, & provide better extension-points.

So if you run into specific problems, let me know here with details (& even a pointer to the code you're trying to use), and I'll point you in the right direction. 

- Gordon

Andy Weasley

unread,
Oct 17, 2023, 1:30:21 AM10/17/23
to Gensim
Hi Gordon,

Thank you for your valuable information about the 4.X versions. I understand that newer versions are more efficient and have higher performance. My group is trying to use code that encodes music in an NLP way in our project. The original codes are from the paper https://www.nature.com/articles/s41598-023-40332-0#Abs1 and its code link is https://github.com/SirawitC/NLP-based-music-processing-for-composer-classification/blob/main/Musical_AI_Composer_classification_with_NLP_based_approaches.ipynb

Since we are undergraduates and relatively new to NLP and gensim, we currently are not able to write custom code for the newer version of gensim. As we are running their code, we have errors when encoding work to vector, and causes different weird exceptions in the word-to-vector part (error messages from the last three cells are attached). We guess that the newer version of genism put an upper limit for words in some places, but not sure, and we should use the 3.8.3 version (according to the author of the paper mentioned above).

Therefore, is there any way we could install 3.8.3 on the most recent Python or do we have to use a lower-version Python environment on our local device? As only one to two functions from the code trigger the error (Word2Vec and several similar functions Word2Vec calls), if you are able to provide some suggestions or insights about revising the code for the newer version of gensim, we would greatly appreciate your help, but we understand completely that you are busy for various works. If not revising it, just let us know the installation process or the fact that we have to switch to a local device. It would still be a great assistance for us.

Sincerely,
Mingyang


Error Messages:
Screenshot 2023-10-16 at 10.20.56 PM.png
WechatIMG881.jpg

Gordon Mohr

unread,
Oct 17, 2023, 4:15:04 PM10/17/23
to Gensim
As mentioned, gensim-3.8.3 wasn't tested/supported for Pythons later than Python-3.8. And, I believe any effort to make (obsolete, slower, buggier) gensim-3.8.3 would be wasteful compared to the likely similar-or-smaller effort it'd take to adapt older code to work with gensim-4.3.2 under whatever latest Python-3.x you like.

If you really needed to run gensim-3.8.3 for some reason, the most straightforward way is to choose an older Python-3.8 as your runtime, in a notebook server  where you have that option – such as one run locally rather than subject to the version choices of some cloud service.

But now having seen fragments of your error & the code you're trying to run, I don't yet see any evidence that Gensim's behavior, & changes across versions, have anything to do with the error you're hitting. It's a tough to see everything that's going on – if forwarding stacks for others to help, it's best to (a) expand & show *all* frames (leaving none collapsed/hidden); & (b) paste tracebacks as text, rather than screenshots (which may trime details, be opaque to later indexing, etc).

Still, it looks like your notebook code inside `get_sentence_vec_avg()` is simply buggy, with respect to whatever data you're running it on. It tries to create a vector-average for some text where *none* of the 'words' are in the model. Thus, every word-lookup generates an exception,, prints "Not in vocab", and the `temp` variable is never initialized, even once. The code lacks any handling for this potential case, and assumes `temp` has something – generating the `UnboundLocalError` you see.

And I'd expect you to get the exact same error with gensim-3.8.3, because at the very high level at which you're using the Gensim `Word2Vec` class, essentially nothing has changed: the exact same "words" (tokens) will be in the model, or not. So you're barking up the wrong tree when trying to fix this with Gensim version twiddling.)

Separately, the `get_sentence_vec_avg()` function is kind-of-a-head-scratcher in other ways. For a sense of all its problems, I'll defer to ChatGPT-4, whose opinion I've attached as an output screenshot. (I've not checked its suggested code – which adds the convention that a 'sentence' with no words/known-words gets `None` instead of a vector, which other code would also need to handle – but its points about the original function look correct.)  But the TLDR: the doesn't even do what its name claims – instead just clobbering the 'average' with the *last* 'word' in each 'sentence' – so it's hard to see how this code ever worked properly.

I'd be happy to help adapt this notebook to gensim-4+ – if any of its problems/errors were related to that. But it's got more foundational problems. 

Are you perhaps running it with a tinier test amount of data than the original paper authors, and thus hitting exceptions (no known words) they didn't? If so, you may dodge those errors with different data, but you'd still have the problem it's not really doing an average where it claims to be. 

(As one last aside: the notebook shows using a `min_count=1`, which is almost always a mistake when using `Word2Vec` on natural-language texts: the algorithm's performance, & downstream uses, tend to do *better* when rare words are ignored/discarded.)

- Gordon


chatgpt-on-gensim-groups-question.jpeg

Andy Weasley

unread,
Oct 17, 2023, 6:53:41 PM10/17/23
to Gensim
Hi Gordon,

I really appreciate your information. I now understand that the code does not seem to fail due to the version issue. We currently are trying to run it on our local device with Python 3.8 and revise the code according to suggestions from ChatGPT. The Gensim version issue is proposed by the author of this paper, so we thought it could be a useful solution to try (The original info from the author is attached in file).

In case you would like to see the complete error stack frame, I rerun the original code from the paper and attached the complete frame below:
Below is from Implementation Example - AVG (at the end of the notebook)
(More Not in vocab)
Not in vocab Not in vocab Not in vocab Not in vocab Not in vocab Not in vocab Not in vocab Not in vocab Not in vocab Not in vocab Not in vocab Not in vocab Not in vocab Not in vocab Not in vocab Not in vocab Not in vocab Not in vocab
--------------------------------------------------------------------------- UnboundLocalError Traceback (most recent call last) <ipython-input-27-1466847965f5> in <cell line: 1>() ----> 1 KNN, RFC, Logis, Avg, SVM, MLP = Main("/content/drive/MyDrive/Composer Classification/replicate-to-word.txt","m",13000,5000,5,True,False) 2 print("Avg:", Avg) 3 print("F1-score of KNN:",KNN) 4 print("F1-score of RFC:",RFC) 5 print("F1-score of Logistic:",Logis)

2 frames
<ipython-input-25-f0ff530d43cd> in Main(corpus, modelName, vocabSize, maxSenLength, window, Avg, cov) 1 def Main(corpus, modelName, vocabSize, maxSenLength, window, Avg=False, cov=False): 2 sentences = sentencePiece(corpus, modelName, vocabSize, maxSenLength) ----> 3 sentencesLst = Word2Vec(window, sentences, Avg, cov) 4 data, label = createLabel(sentencesLst) 5 sentence_w_label_100 = pd.DataFrame({"sentence": data, "label":label}) <ipython-input-24-a541c0b834d7> in Word2Vec(Window, sentences, Avg, SD) 11 return sentenceLstAvgwithCov 12 elif(Avg and (not SD)): ---> 13 sentencesLstAvg = get_sentence_vec_avg(sentences,model) 14 return sentencesLstAvg 15 elif((not Avg) and SD): <ipython-input-11-3b0a1c56996b> in get_sentence_vec_avg(sentences, model) 8 except: 9 print("Not in vocab") ---> 10 l.append(temp/len(sentence)) 11 return l UnboundLocalError: local variable 'temp' referenced before assignment
The following is from Implementation - SD:
too many "Not in Vocab" in the output follows this warning:
Not in vocab Not in vocab
/usr/local/lib/python3.10/dist-packages/numpy/core/_methods.py:265: RuntimeWarning: Degrees of freedom <= 0 for slice ret = _var(a, axis=axis, dtype=dtype, out=out, ddof=ddof, /usr/local/lib/python3.10/dist-packages/numpy/core/_methods.py:223: RuntimeWarning: invalid value encountered in divide arrmean = um.true_divide(arrmean, div, out=arrmean, casting='unsafe', /usr/local/lib/python3.10/dist-packages/numpy/core/_methods.py:257: RuntimeWarning: invalid value encountered in double_scalars ret = ret.dtype.type(ret / rcount)

Then, several thousand Not in Vocab output, and fail in about 15 minutes:
Not in vocab Not in vocab
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-29-b6696d07b217> in <cell line: 1>() ----> 1 KNN, RFC, Logis, Avg, SVM, MLP = Main("/content/drive/MyDrive/Composer Classification/replicate-to-word.txt","m",13000,5000,5,False,True) 2 print("Avg:", Avg) 3 print("F1-score of KNN:",KNN) 4 print("F1-score of RFC:",RFC) 5 print("F1-score of Logistic:",Logis)

4 frames
<ipython-input-25-f0ff530d43cd> in Main(corpus, modelName, vocabSize, maxSenLength, window, Avg, cov) 13 y.append(map_label[i]) 14 PredictorScaler=StandardScaler() ---> 15 PredictorScalerFit=PredictorScaler.fit(X) 16 X=PredictorScalerFit.transform(X) 17 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123) /usr/local/lib/python3.10/dist-packages/sklearn/preprocessing/_data.py in fit(self, X, y, sample_weight) 822 # Reset internal state before fitting 823 self._reset() --> 824 return self.partial_fit(X, y, sample_weight) 825 826 def partial_fit(self, X, y=None, sample_weight=None): /usr/local/lib/python3.10/dist-packages/sklearn/preprocessing/_data.py in partial_fit(self, X, y, sample_weight) 859 860 first_call = not hasattr(self, "n_samples_seen_") --> 861 X = self._validate_data( 862 X, 863 accept_sparse=("csr", "csc"), /usr/local/lib/python3.10/dist-packages/sklearn/base.py in _validate_data(self, X, y, reset, validate_separately, **check_params) 563 raise ValueError("Validation should be done on X, y or both.") 564 elif not no_val_X and no_val_y: --> 565 X = check_array(X, input_name="X", **check_params) 566 out = X 567 elif no_val_X and not no_val_y: /usr/local/lib/python3.10/dist-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator, input_name) 900 # If input is 1D raise error 901 if array.ndim == 1: --> 902 raise ValueError( 903 "Expected 2D array, got 1D array instead:\narray={}.\n" 904 "Reshape your data either using array.reshape(-1, 1) if " ValueError: Expected 2D array, got 1D array instead: array=[nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan]. Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

The following is from Implementation - Avg and SD, which is similar to the first running case.
Not in vocab Not in vocab
/usr/local/lib/python3.10/dist-packages/numpy/core/_methods.py:265: RuntimeWarning: Degrees of freedom <= 0 for slice ret = _var(a, axis=axis, dtype=dtype, out=out, ddof=ddof, 
/usr/local/lib/python3.10/dist-packages/numpy/core/_methods.py:223: RuntimeWarning: invalid value encountered in divide arrmean = um.true_divide(arrmean, div, out=arrmean, casting='unsafe', 
/usr/local/lib/python3.10/dist-packages/numpy/core/_methods.py:257: RuntimeWarning: invalid value encountered in double_scalars ret = ret.dtype.type(ret / rcount)
--------------------------------------------------------------------------- UnboundLocalError Traceback (most recent call last) <ipython-input-31-49ec9ef0eec0> in <cell line: 1>() ----> 1 KNN, RFC, Logis, Avg, SVM, MLP = Main("/content/drive/MyDrive/Composer Classification/replicate-to-word.txt","m",13000,5000,5,True,True) 2 print("Avg:", Avg) 3 print("F1-score of KNN:",KNN) 4 print("F1-score of RFC:",RFC) 5 print("F1-score of Logistic:",Logis)

2 frames
<ipython-input-25-f0ff530d43cd> in Main(corpus, modelName, vocabSize, maxSenLength, window, Avg, cov) 1 def Main(corpus, modelName, vocabSize, maxSenLength, window, Avg=False, cov=False): 2 sentences = sentencePiece(corpus, modelName, vocabSize, maxSenLength) ----> 3 sentencesLst = Word2Vec(window, sentences, Avg, cov) 4 data, label = createLabel(sentencesLst) 5 sentence_w_label_100 = pd.DataFrame({"sentence": data, "label":label}) <ipython-input-24-a541c0b834d7> in Word2Vec(Window, sentences, Avg, SD) 8 ) 9 if(Avg and SD): ---> 10 sentenceLstAvgwithCov = get_sentence_vec_avg_with_cov2(sentences,model) 11 return sentenceLstAvgwithCov 12 elif(Avg and (not SD)): <ipython-input-12-64401f3e5348> in get_sentence_vec_avg_with_cov2(sentences, model) 12 data = np.array(cov) 13 sd = np.std(data,axis=0) ---> 14 z = temp/len(sentence) 15 z = z.tolist() 16 z += sd.tolist() UnboundLocalError: local variable 'temp' referenced before assignment
Screenshot 2023-10-17 at 3.24.53 PM.png

Gordon Mohr

unread,
Oct 17, 2023, 8:42:27 PM10/17/23
to Gensim
Note that if you fix that buggy function, you're no longer reproducing the paper. You will have fixed a blatant (fatal?) bug in the paper, & thus be testing something other than what the original authors did. (I would add: this kind of bug would be a 'red flag' that makes me view all code from the same source with suspicion of severe defects.)

As I noted, the ChatGPT-4 suggested function adds a new convention for a case the old code didn't handle: using `None` as a flag value when a 'sentence' had no words or no known words. That's likely to break other things unless other adjustments are made. 

Given the errors, I don't yet see any reason to believe Gensim version differences are part of the errors, so the effort to run under Python-3.8 and gensim-3.8.3 may just get you to the same error. But, if you're able to try, I guess you might as well. 

The rich text you've pasted is *also* a bad way to share error info, compared to plain text. And, I'm not seeing more traceback stack frames than before. But to the extent that: "UnboundLocalError: local variable 'temp' referenced before assignment" is what you're hitting, that's a bug in the notebook code that seems unrelated to Gensim. 

Screenshots of comments elsewhere are also a bad way to share that context, compared to an actual link where it would be possible to see the history of the interaction. 

Are you sure you're using the exact same data as the original authors? I would suggest enabling logging & watching the output of each step carefully. Especially check that the corpus & `Word2Vec` step are doing the real, intended work – with your own added cells displaying interim info for confirmation. (For example: what's the `len(model.wv)` for your `Word2Vec` model? Is that the number of words you expected it to learn?)

If you're getting tons of "Not in vocab" lines, where the original authors didn't, that implies the true divergence is *before* those log-lines, and before error you're focusing on – and your `Word2Vec` model doesn't even have the same words in it as they had – and perhaps your early corpus-prep steps are broken, in code, versions, or available local data. 

- Gordon

Andy Weasley

unread,
Nov 23, 2023, 2:48:37 AM11/23/23
to Gensim

Hi Gordon,

Thank you for your information. I have contacted the author of the paper to ask about their code, and for the buggy functions, they updated all "model[word]"  to "model.wv[word]". This modification solves the problem. Thank you for your assistance anyway. 

Best wishes,
Mingyang

Gordon Mohr

unread,
Nov 23, 2023, 8:19:50 AM11/23/23
to Gensim
Thanks for the update, but keep in mind that given the rather blatant failure-to-do-what-was-intended in the original `get_sentence_vec_avg()` function, that casts doubt on whether the original paper's code even tested what it purported to test. It likely needs a close code-review rather than rote reproduction. Good luck!

- Gordon

Reply all
Reply to author
Forward
0 new messages