fasttext model produces very different vectors from train to train

14 views

Skip to first unread message

l1muba1

unread,

Jun 4, 2024, 7:11:19 AMJun 4

to Gensim

I did read Q11 of the FAQ.

The corpus is not the same, but changes slightly.
It is around 10 millions of sentences and is changed from 2 to 8 percents from one train to another, calculated on the latest five trains:

```plaintext
fn='df.4560.dat' total=8781469 added=0 (0.00%) removed=0 (0.00%) (avg_words_per_sentence=10.42)
fn='df.4577.dat' total=8976940 added=209578 (2.33%) removed=14107 (0.16%) (avg_words_per_sentence=10.40)
fn='df.4594.dat' total=9296148 added=332069 (3.57%) removed=12861 (0.14%) (avg_words_per_sentence=10.43)
fn='df.4617.dat' total=10017840 added=800577 (7.99%) removed=78885 (0.79%) (avg_words_per_sentence=10.43)
fn='df.4634.dat' total=10164549 added=179520 (1.77%) removed=32811 (0.32%) (avg_words_per_sentence=10.46)
```

I picked a few words and for each I calculated pairwise cosine distance on this five models,
the typical result is like this:

```
[[0. 0.66 0.63 0.64 0.63]
[0.66 0. 0.64 0.63 0.68]
[0.63 0.64 0. 0.67 0.72]
[0.64 0.63 0.67 0. 0.61]
[0.63 0.68 0.72 0.61 0. ]]
```

So, to my understanding, every time I train a new model, vectors change quite significantly.
Is this the expected behavior? Or am I doing something wrong?

Parameters:

```python
vector_size=300,
window=3,
epochs=10,
sg=1,
shrink_windows=False,
alpha=0.05,
sample=0.0001,
```

gensim version is 4.3.2

Gordon Mohr

unread,

Jun 4, 2024, 1:25:01 PMJun 4

to Gensim

This is expected behavior, per the Q11 in the FAQ. Every model run creates a new "space". Vectors from one run/space are not meaingfully comparable to vectors from other runs/spaces, even vectors for the exact same word.

Think of it this way:

There's no inherently-correct or best coordinates for a single word, like say 'apple', in generic 300-dimensions. There's only a *useful* place, with regard to the distances/angles to other related words, as can be learned from a corpus showing examples of all relevant words' contextual usages. And because of inherent randomness in how the algorithm run, including its perturbability by tiny changes in training-ordering or slight vocabulary/text changes, even stabs at stability like always initializing the word pre-training to the same random starting vector don't reliably force the whole run to land it in a similar place. Rather, the word 'apple', and all words related to it, will land in some 300-dimensional constellation that, from run-to-run, has similarly useful neighborhoods/directions - but potentially at very different coordinates in the giant high-hyperdimensional volume.

If you need word-vectors to be reliably comparable, they should come from the same training run, so they went through a common iterative tug-of-war with all other words. This might mean instead of churning your training texts by some fraction each run, you do a larger composite training run with all texts of the last N epochs, covering all words that need to have compatible vectors.