strange gensim word2vec behavior

29 kali dilihat
Langsung ke pesan pertama yang belum dibaca

visual gcp

belum dibaca,
15 Mar 2022, 12.55.1115/03/22
kepadaGensim
    from gensim.models import Word2Vec
    model = Word2Vec(sentences = [['a','b'],['c','d']], window = 9999999, min_count=1)
    model.wv.most_similar('a', topn=10)

Above code gives the following result:

    [('d', 0.06363436579704285),
     ('b', -0.010543467476963997),
     ('c', -0.039232250303030014)]

shouldn't the 'b' ranked first, since it's the only one nearby 'a'?

alistair...@gmail.com

belum dibaca,
15 Mar 2022, 17.17.3515/03/22
kepadaGensim
This is a silly example. These models learn by semantic similarity i.e. words which appear in similar contexts are given similar vectors. There is no shared context for any word here.

Gordon Mohr

belum dibaca,
15 Mar 2022, 20.39.2215/03/22
kepadaGensim
That's right: there's too little here for the algorithm to deliver its usual benefits. 

I wrote more about why it's not a good idea to test/understand word2vec via tiny/contrived training data in an answer to the same question that was posted on StackOVerflow: 


- Gordon

visual gcp

belum dibaca,
15 Mar 2022, 22.02.5515/03/22
kepadaGensim
Hi Gordon,

Thanks for the comment.

Can you comment on for the code given by ezw93, below is the summary:

    import random
    import numpy
    from gensim.models import Word2Vec
   
   
    nested_list = []
    for _ in range(0, 50000):
        nested_list.append(['a', 'b'])
    for _ in range(0, 50000):
        nested_list.append(['c', 'd'])
    for _ in range(0, 50000):
        nested_list.append(['a', 'x'])
    random.shuffle(nested_list)
   
   
    model = Word2Vec(sentences=nested_list, window=9999999, min_count=1)
   
    words = ['a', 'b', 'c', 'd', 'x']
    for word in words:
        print(word, model.wv.most_similar(word, topn=10))

Which will return:

    a [('c', 0.11672252416610718), ('d', 0.11632005870342255), ('x', 0.09789041429758072), ('b', 0.0978466272354126)]
    b [('x', 0.999595046043396), ('c', 0.10307613760232925), ('a', 0.0978466272354126), ('d', 0.09400281310081482)]
    c [('a', 0.11672253161668777), ('d', 0.11085666716098785), ('b', 0.10307613760232925), ('x', 0.0969843715429306)]
    d [('a', 0.11632007360458374), ('c', 0.11085667461156845), ('x', 0.10299163311719894), ('b', 0.09400279819965363)]
    x [('b', 0.9995951652526855), ('d', 0.10299164056777954), ('a', 0.09789039939641953), ('c', 0.0969843715429306)]

`x` and `b` often occur in a similar context (next to an `a`). All other distances between the representations have pretty much nothing noticable.

Why wouldn't the algorithm classify  'a and b' or 'c and d' or 'c and x' as similar, since given say 'a', the CBOW algorithm will use 'b' to predict 'a' and use 'a' to predict 'b'?
Or the explanation by  ezw93 in his comment is valid?

Gordon Mohr

belum dibaca,
16 Mar 2022, 13.51.3716/03/22
kepadaGensim
It's still a toy-sized/synthetic example. As mentioned, a tiny vocabulary/corpus can't usefully train a larger model. 

Repeating a tiny amount of data N times doesn't create any more of the competing-alternate-usage variety that word2vec needs. It may give the model more *cycles* to learn the peculiarities of the limited data – but you could achieve that same effect with more `epochs` than with artificially-repeated examples. (And that's usually better because contrasting examples will then be interleaved/alternating during training – rather than presenting 50,000 identical texts ni a row, super-reinforcing one peculiar example before any balance from contrasting examples.)

And why would `'a'` & `'b'` be similar, based on predicting neighbors? The observable neighbors of `'a'` are only and exactly `'b'` and `'x'`. The observable neighbors of `'b'` are only and exactly just `'a'`. There is no overlap in membership between set-of-neighbors `['b', 'x']`, and set-of-neighbors `['a']` – so why would the vectors for `'a'` & `'b'`, trained to predict those neighbors, necessarily become similar?

- Gordon 
Balas ke semua
Balas ke penulis
Teruskan
0 pesan baru