Word pair (cumberbatch, actor) similarity using Numberbatch vectors

275 views
Skip to first unread message

Asma Shaukat

unread,
May 28, 2017, 7:54:58 AM5/28/17
to conceptnet-users
Hey all

       I want to get word pair similarity  as some similarities have been provided for some pairs on Conceptnet blog: https://blog.conceptnet.io/category/software-and-scripts/ 
       as mentioned in this blog that simialrity has been obtained via point wise multiplication. Did this mean dot product?

       I obtained following vectors for actor and cumberbatch form numberbatch-en embeddings file
       actor 0.2481 0.0814 0.0120 -0.0724 0.0014 -0.1751 -0.1650 0.1921 -0.0128 0.0955 -0.0855 -0.0543 -0.0379 -0.1053 0.0486 -0.0305 -0.0276 -0.0458 0.2320 0.0719 0.0511 -0.0040 0.0131 0.0045 0.0199 -0.0007 -0.0373 -0.0072 0.0961 0.0450 0.0245 0.0598 -0.0294 -0.1010 0.0301 -0.0105 -0.1012 0.0163 -0.0323 -0.0680 -0.1010 0.0038 -0.0610 -0.1323 0.0225 -0.0822 -0.0227 -0.0330 -0.0359 -0.0568 0.0578 0.0790 -0.0224 0.1045 0.0211 0.0242 -0.0561 -0.0291 -0.0413 0.0012 -0.0690 0.0096 0.0786 -0.0405 0.0478 -0.0476 0.0201 -0.0408 0.0800 0.0335 -0.0563 -0.0501 -0.0244 -0.0540 -0.0704 -0.0344 -0.0289 0.0209 0.0907 0.0297 -0.0663 -0.1213 -0.0863 -0.0420 0.0503 0.0330 0.0241 -0.0168 -0.0553 -0.0082 0.0544 -0.0786 0.0375 -0.1305 0.0158 0.0518 -0.0508 -0.0051 -0.0399 -0.0065 -0.0032 0.0860 0.0102 -0.1350 -0.0685 -0.0098 0.0739 -0.0692 -0.0101 -0.0456 0.0379 -0.0608 -0.0519 -0.1771 -0.0729 -0.0592 0.0708 0.0054 -0.0502 -0.0867 0.0402 -0.0057 -0.1047 0.0390 0.0040 -0.0179 -0.0278 -0.0320 0.0614 0.0765 -0.0655 -0.0029 0.0163 0.0697 -0.0123 0.0282 0.0248 0.0484 -0.0615 0.0285 0.0616 -0.0164 -0.0688 0.1251 0.0544 -0.0113 0.0100 -0.0489 0.0214 0.0282 0.0131 -0.0787 -0.0677 0.1096 -0.0310 0.0230 0.0155 -0.0012 0.0439 -0.0376 -0.0103 0.0180 0.0268 0.0358 0.0426 -0.0930 -0.0350 -0.0296 -0.0334 -0.0169 0.0139 -0.0575 -0.0406 0.0606 -0.0460 0.0514 0.0596 -0.0653 0.0186 0.0158 -0.0690 0.0104 0.0825 -0.0101 -0.0038 -0.0143 -0.0411 -0.0315 0.0199 -0.0210 0.0149 0.0672 -0.0055 -0.0017 -0.0901 0.0111 -0.0087 -0.0073 -0.0150 0.0455 -0.0075 -0.0675 0.0426 -0.0653 -0.0077 0.0437 -0.0363 -0.0408 -0.0771 0.0204 0.0205 0.0004 -0.0674 -0.0735 -0.0002 -0.0882 -0.0085 0.0171 -0.0470 0.0000 -0.0799 0.0795 0.0135 -0.0603 0.0608 -0.0061 -0.0147 0.0658 0.0158 0.0665 -0.0766 -0.0367 0.0062 -0.1243 0.0189 0.0076 0.0498 -0.0114 0.0275 -0.0572 0.1218 -0.0155 0.0170 0.0022 0.0097 -0.0622 -0.0418 -0.0039 0.0548 -0.0640 -0.0295 -0.0151 0.0251 0.0346 0.0020 -0.0481 0.0941 0.0540 0.0588 0.0678 -0.0008 0.0104 -0.0062 0.0632 0.0038 -0.0182 0.0988 -0.0044 0.0481 0.0331 0.0242 -0.0301 0.0253 -0.0105 -0.0141 0.0666 -0.0062 -0.0247 -0.0429 0.0260 -0.0324 -0.0516 -0.0169 0.0342 0.0231 0.0535 0.0674 0.0068 0.0151 -0.0540 -0.0162 0.0990 0.0079 0.0168 0.0110 -0.0026 -0.0015 0.0428 -0.0114 -0.0073
cumberbatch -0.1321 -0.0086 -0.0840 0.1177 -0.2368 0.0706 0.0943 -0.0740 0.0893 -0.0868 0.0366 0.1622 0.0001 0.1451 0.0362 -0.0144 0.0259 0.0284 -0.1084 -0.0385 0.1329 0.1896 -0.0540 -0.0821 0.1964 0.0617 -0.0236 -0.0490 0.1763 0.0634 0.1933 -0.0133 -0.0737 0.0277 -0.0103 -0.0316 0.0283 0.1256 0.0476 0.0295 0.0749 -0.0213 -0.0014 -0.1070 -0.0206 -0.0509 -0.1155 -0.0178 -0.0344 -0.0436 0.0045 -0.1284 -0.0850 0.0495 -0.0482 0.0233 0.0432 -0.1680 0.0037 0.0066 -0.1000 -0.0359 0.0381 0.0342 0.0543 -0.1231 0.0103 0.0326 -0.0286 0.0263 0.0039 0.1240 0.0163 -0.0613 -0.0154 -0.0158 0.0414 -0.0801 -0.0502 0.0621 0.0080 0.0234 0.0253 -0.0067 0.0017 0.0148 0.0340 -0.0724 -0.0254 0.0046 0.0069 0.0541 -0.0810 -0.0881 -0.0389 0.0752 -0.0403 -0.0447 0.0066 0.0083 -0.0230 0.0287 0.0098 -0.0139 0.0340 -0.0029 -0.0326 -0.0397 0.0338 0.0479 0.0387 0.0772 -0.0670 0.0882 -0.0038 0.0277 0.0100 -0.0027 -0.0053 0.0323 0.0276 0.0274 -0.1201 0.0369 -0.0093 -0.0132 0.0130 -0.0114 0.0748 -0.0181 0.0314 -0.0601 -0.0542 -0.0215 0.0753 0.0548 0.0349 -0.0278 -0.0142 0.0219 0.0041 0.0703 -0.0079 -0.0542 0.0461 -0.1178 0.0627 -0.0108 -0.0040 0.0451 0.0383 -0.0568 0.0375 -0.0702 -0.0362 0.0023 0.0567 -0.0214 0.0698 -0.1030 0.0132 0.0358 0.0210 0.0490 0.0360 -0.0243 0.0306 -0.0562 0.0414 0.0452 -0.0662 -0.0561 0.0573 -0.0720 -0.0106 0.0800 -0.0085 0.0394 -0.0150 0.0419 -0.0314 -0.0715 -0.0194 0.0578 -0.0359 0.0380 0.0466 -0.0211 0.0281 -0.0606 0.0662 0.0189 -0.0589 0.0135 -0.0456 -0.0335 -0.0400 -0.0149 0.0132 -0.0188 -0.0248 -0.0460 0.0217 -0.0775 0.0152 -0.0690 0.0892 0.0306 0.0211 -0.0022 0.0398 0.0072 0.0756 -0.0846 0.0415 -0.0555 0.0384 -0.0197 0.0741 0.0227 0.0269 0.0978 -0.0001 0.0127 -0.0546 0.0417 -0.0053 0.0514 0.0462 -0.0060 -0.0349 0.0054 0.0109 -0.0139 -0.0141 -0.0037 0.0166 0.0439 0.0075 0.0050 0.0496 -0.0200 0.0441 0.0146 0.0182 -0.0301 0.0209 -0.0367 -0.0481 -0.0233 -0.0913 0.0833 -0.0861 0.0061 0.0649 -0.0336 -0.0304 0.0128 0.0085 0.0072 -0.0004 0.0116 0.0098 -0.0601 -0.0247 0.0366 -0.0117 0.0653 0.0249 0.0452 -0.0107 0.0062 -0.0194 0.0248 0.0420 -0.0591 0.0747 -0.0056 0.0221 0.0390 -0.0133 -0.0422 -0.0253 0.0136 0.1025 0.0172 -0.0500 -0.0513 0.1050 0.0168 0.0071 -0.0443 0.0060 0.0770 -0.0355 -0.0140 0.0287 0.0104 -0.0561 0.0292

      but i got = -0.0707 similarity instead of 0.35. If anyone know correct way for computing pair similarity, plz suggest me that way. 
      I want to know which similarity method of numberbatch vectors have been used in paper "An Ensamble Method to produce high quality word embeddings" where  spearman rank correlation is .861 . what embeddings have been used? are these different from available at  numberbatch-en-17.04.txt.gz  or some modified form?




Rob Speer

unread,
May 29, 2017, 4:02:35 AM5/29/17
to conceptnet-users

I'm traveling and can't get you the exact details right now, but:

The example and paper you're referring to are from the 16.04 version (April 2016). There should be references to hire to reproduce that version in the conceptnet-numberbatch Git repository.

However, you might as well keep using 17.04. Although that one example (Cumberbatch to actor) broke - it was a silly example and this data was never focused on representing facts about specific people -the vectors perform even better on all evaluations now.

The paper you're referring to was our first attempt to publish; two later papers were published.

The similarity metric is cosine similarity​ - the dot product of normalized vectors.

The current conceptnet5 repository has a script to reproduce our most recently published results. You need the conceptnet5 code, not just the vectors, because you need to be able to produce vectors for terms that are in the vocabulary of ConceptNet but don't correspond to a row in the matrix. (We could have included all these additional vectors, but it's an unreasonable waste of RAM when they can just be inferred.)


--
You received this message because you are subscribed to the Google Groups "conceptnet-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to conceptnet-use...@googlegroups.com.
To post to this group, send email to conceptn...@googlegroups.com.
Visit this group at https://groups.google.com/group/conceptnet-users.
For more options, visit https://groups.google.com/d/optout.

Asma Shaukat

unread,
Jun 17, 2017, 1:27:35 AM6/17/17
to conceptnet-users, r...@luminoso.com
Thank you sir for your informative reply. I tried to read about how conceptnet compute related terms' weights for a word like tea kettle on this link
http://api.conceptnet.io//related/c/en/tea_kettle?filter=/c/en.   Is there any algorithm behind it for calculating such weights? or these weights are predefined by humans?  Plz share link of any reading material related to it if available. 

Rob Speer

unread,
Jun 18, 2017, 1:11:20 PM6/18/17
to conceptnet-users

By the way: I just realized from someone else's report that the en-17.04 vectors were corrupted. That would explain the problem. You should go to the site to download en-17.04b.

Rob Speer

unread,
Jun 19, 2017, 1:57:13 PM6/19/17
to conceptnet-users
To answer your second question: the results from the /related API are just the dot products from a reduced-memory version of ConceptNet Numberbatch. If you've been trying to get related results that match, I hope that the fix to the data will help.

Asma Shaukat

unread,
Jul 7, 2017, 10:20:31 AM7/7/17
to conceptnet-users, r...@luminoso.com
Thanks Robert. due to previous bug in conceptnet vectors i was confused about results.Now things are working. 
     I want to reproduce your results given in paper "ConceptNet 5.5: An Open Multilingual Graph of General Knowledge" for different datasets like MEN, RW etc
     As  mentioned in paper, for MEN dataset already given splits has been used for testing and training, I am also using same split of testing and training data with lemma form of words as already provided in dataset .  I downloaded vectors " conceptnet-numberbatch-201609-en-main". I am working in matlab. i obtained vectors for both testing and training datasets from downloaded vectors file( attached a file containing vectors  testing ).
then i computed cosine similarity but their is little difference in my obtained results as on testing i have value 0.8612 (0.866 in paper) and for training data i got 0.8538 (0.859 in paper ). may b  i should ignore such difference but if this is due to some important reason then its ignorance may lead me towards some wrong experiments Are my used  training and testing vectors right? is there some another L1 normalization across features is required before taking cosine similarity or something else?


Best Wishes.
Testing_numberbatch-enVectorsForMEN.txt

Rob Speer

unread,
Jul 7, 2017, 1:15:09 PM7/7/17
to Asma Shaukat, conceptnet-users
Great that it's almost working.

My best guess about where the difference comes from is in the strategy for looking up out-of-vocabulary words. Are you implementing anything for OOV words? In the AAAI paper, our OOV strategy was to look up the terms in ConceptNet and average the known vectors of their neighbors, if those exist. This means that reproducing the exact results will depend on the exact contents of ConceptNet that they're used with.

I forget whether there are any OOV words in the MEN dataset. There might have been one or two, which would explain a small discrepancy. If there aren't any, the difference must be elsewhere. If the problem is indeed OOV words, it would show up as a much larger difference on the RW dataset.

My recommended way of getting the exact evaluation results is to set up a machine to run the ConceptNet build process, check out the 'aaai2017' tag of the conceptnet5 repository, and run './scripts/reproduce-evaluation.sh'. Unfortunately this now requires the extra step of making sure that xmltodict==0.10.2 is installed, because versions of xmltodict after that changed in a way that breaks the build. I'm now looking into making 'aaai2017' a branch that I update to fix things like that.

Rob Speer

unread,
Jul 10, 2017, 12:05:32 PM7/10/17
to Asma Shaukat, conceptnet-users
I just re-ran the build from the 'aaai2017' tag, using './scripts/reproduce-evaluation.sh', and can confirm that the results that come out are the results in the paper.
Reply all
Reply to author
Forward
0 new messages