strange gensim word2vec behavior

visual gcp

belum dibaca,

15 Mar 2022, 12.55.1115/03/22

kepadaGensim

from gensim.models import Word2Vec
model = Word2Vec(sentences = [['a','b'],['c','d']], window = 9999999, min_count=1)
model.wv.most_similar('a', topn=10)

Above code gives the following result:

[('d', 0.06363436579704285),
('b', -0.010543467476963997),
('c', -0.039232250303030014)]

shouldn't the 'b' ranked first, since it's the only one nearby 'a'?

alistair...@gmail.com

belum dibaca,

15 Mar 2022, 17.17.3515/03/22

kepadaGensim

This is a silly example. These models learn by semantic similarity i.e. words which appear in similar contexts are given similar vectors. There is no shared context for any word here.

Gordon Mohr

belum dibaca,

15 Mar 2022, 20.39.2215/03/22

kepadaGensim

That's right: there's too little here for the algorithm to deliver its usual benefits.

I wrote more about why it's not a good idea to test/understand word2vec via tiny/contrived training data in an answer to the same question that was posted on StackOVerflow:

https://stackoverflow.com/questions/71485555/gensim-word2vec-results-why-non-nearby-word-first/71490527#71490527

- Gordon

visual gcp

belum dibaca,

15 Mar 2022, 22.02.5515/03/22

kepadaGensim

Hi Gordon,

Thanks for the comment.

Can you comment on for the code given by ezw93, below is the summary:

import random
import numpy
from gensim.models import Word2Vec

nested_list = []
for _ in range(0, 50000):
nested_list.append(['a', 'b'])
for _ in range(0, 50000):
nested_list.append(['c', 'd'])
for _ in range(0, 50000):
nested_list.append(['a', 'x'])
random.shuffle(nested_list)

model = Word2Vec(sentences=nested_list, window=9999999, min_count=1)

words = ['a', 'b', 'c', 'd', 'x']
for word in words:
print(word, model.wv.most_similar(word, topn=10))

Which will return:

a [('c', 0.11672252416610718), ('d', 0.11632005870342255), ('x', 0.09789041429758072), ('b', 0.0978466272354126)]
b [('x', 0.999595046043396), ('c', 0.10307613760232925), ('a', 0.0978466272354126), ('d', 0.09400281310081482)]
c [('a', 0.11672253161668777), ('d', 0.11085666716098785), ('b', 0.10307613760232925), ('x', 0.0969843715429306)]
d [('a', 0.11632007360458374), ('c', 0.11085667461156845), ('x', 0.10299163311719894), ('b', 0.09400279819965363)]
x [('b', 0.9995951652526855), ('d', 0.10299164056777954), ('a', 0.09789039939641953), ('c', 0.0969843715429306)]

`x` and `b` often occur in a similar context (next to an `a`). All other distances between the representations have pretty much nothing noticable.

Why wouldn't the algorithm classify 'a and b' or 'c and d' or 'c and x' as similar, since given say 'a', the CBOW algorithm will use 'b' to predict 'a' and use 'a' to predict 'b'?

Or the explanation by ezw93 in his comment is valid?

Gordon Mohr

belum dibaca,

16 Mar 2022, 13.51.3716/03/22

kepadaGensim

It's still a toy-sized/synthetic example. As mentioned, a tiny vocabulary/corpus can't usefully train a larger model.

Repeating a tiny amount of data N times doesn't create any more of the competing-alternate-usage variety that word2vec needs. It may give the model more *cycles* to learn the peculiarities of the limited data – but you could achieve that same effect with more `epochs` than with artificially-repeated examples. (And that's usually better because contrasting examples will then be interleaved/alternating during training – rather than presenting 50,000 identical texts ni a row, super-reinforcing one peculiar example before any balance from contrasting examples.)

And why would `'a'` & `'b'` be similar, based on predicting neighbors? The observable neighbors of `'a'` are only and exactly `'b'` and `'x'`. The observable neighbors of `'b'` are only and exactly just `'a'`. There is no overlap in membership between set-of-neighbors `['b', 'x']`, and set-of-neighbors `['a']` – so why would the vectors for `'a'` & `'b'`, trained to predict those neighbors, necessarily become similar?

- Gordon

Balas ke semua

Balas ke penulis

Teruskan