In my experience, when people are *trying* to get perfectly-reproducible results from `Word2Vec`, they are often misinterpreting the value & best uses of this model. It finds broad but inexact tendencies, after a training process which inherently makes use of a lot of randomness. Thus it is best used, and evaluated, in some fashion which tolerates lots of 'jitter' in alternate runs.
Similarly, as the algorithm only does interesting things with large, varied corpora of training texts, trying to deduce anything about its benefits, or operation from toy-sized or contrived datasets is usually misguided, except insofar as you do it, gradually understand why it's not very useful exercise, then move on to avoiding such limited/contrived tests except as examples of what not-ot-do.
So, a contrived corpus of 19 2-word sentences is so far from what `Word2Vec` works-well on that it is highly likely that trivial oddities you seek to understand, in the toy run, shed no light (or even mislead) about what should be expected in a more appropriate run.
If you want to learn the algorithm, & the ins-and-outs of the Gensim implementation, it would be far more beneficial to run with larger more-realistic training data, and accept the normal random jitter that comes from things like multi-worker training, and learn to overlook superficial differences in model outputs in favor of more salient but fuzzy evaluations of the relative quality of end-results.
That said, with enough digressive effort, it should still be possible to force determinism.
It's unclear how & when you've tried to set PYTHONHASHSEED. It would have to be *before the interpreter was launched*. That is, you can't do it from Python itself. And, that environment varaible has to be set in the *environment that the interpreter launches from, & learns its environment-variables*. In many cases, like some online notebook services, that might be hard to effect.
Also, your attached code doesn't show `workers=1`, usually required to avoid thread-scheduling from introducing random reordering of training examples. (Though, with such a tiny corpus, only a single thread might be doing anything anyway.)
Even with such extra constraints, reproducibility would only be definitively expected with typical input when using the exact same `window` value, as well. Gensim, following Google's original implementation, actually uses the `window` value as a *maximum* window, where the effective windows for each center-word are some random value *up to* the configured `window` value.
Now, with your atypical data of only 2-word sentences (per the message text, not the message attachment), all effective-windows will always be just `1`, and so I suppose there's a chance runs with different `window` values, with all other sources of randomness locked-down, might reproduce, because the exact same sequence of psuedorandom draws, for uses in the exact same roles, might happen.
But it's possible something deep in the code does still vary, maybe requesting extra unused pseudorandom numbers, when your `window` value varies. And such an inefficiency in a non-typical setup, expressing an unconcerning output-variability, only when seeking a generally-unwise level-of-perfect-reproducibility, wouldn't be either a surprise, nor a high priority to understand or fix.
- Gordon