Thanks for the comment.
Can you comment on for the code given by ezw93, below is the summary:
import random
import numpy
from gensim.models import Word2Vec
nested_list = []
for _ in range(0, 50000):
nested_list.append(['a', 'b'])
for _ in range(0, 50000):
nested_list.append(['c', 'd'])
for _ in range(0, 50000):
nested_list.append(['a', 'x'])
random.shuffle(nested_list)
model = Word2Vec(sentences=nested_list, window=9999999, min_count=1)
words = ['a', 'b', 'c', 'd', 'x']
for word in words:
print(word, model.wv.most_similar(word, topn=10))
Which will return:
a [('c', 0.11672252416610718), ('d', 0.11632005870342255), ('x', 0.09789041429758072), ('b', 0.0978466272354126)]
b [('x', 0.999595046043396), ('c', 0.10307613760232925), ('a', 0.0978466272354126), ('d', 0.09400281310081482)]
c [('a', 0.11672253161668777), ('d', 0.11085666716098785), ('b', 0.10307613760232925), ('x', 0.0969843715429306)]
d [('a', 0.11632007360458374), ('c', 0.11085667461156845), ('x', 0.10299163311719894), ('b', 0.09400279819965363)]
x [('b', 0.9995951652526855), ('d', 0.10299164056777954), ('a', 0.09789039939641953), ('c', 0.0969843715429306)]
`x` and `b` often occur in a similar context (next to an `a`). All other distances between the representations have pretty much nothing noticable.
Why wouldn't the algorithm classify
'a and b' or 'c and d' or 'c and x' as similar, since given say 'a', the CBOW algorithm will use 'b' to predict 'a' and use 'a' to predict 'b'?
Or the explanation by
ezw93 in his comment is valid?