word2vec for proper nouns

615 views
Skip to first unread message

Ishaan Arora

unread,
Mar 23, 2017, 7:26:43 AM3/23/17
to gensim

I have trained my word2vec model on a movie dataset with star cast, director name and other similar features/columns in the training data set. The text is not free flowing (it is comma separated). As a result, the SIMILARITY function and SCORE functions don’t produce satisfactory results as embedding generated are not up to the mark

  1. 1. Is word2vec the right approach for such a problem with more large number of proper nouns and no free flowing text?

  2. 2. If yes, which parameters to tune for training with proper nouns?

Andrey Kutuzov

unread,
Mar 23, 2017, 9:22:24 AM3/23/17
to gen...@googlegroups.com
Hi Ishaan,

I personally think that distributional models (like word2vec) will not
be of much use in this case. Their power comes exactly from what you are
missing in you dataset - typical word co-occurrences.

You will probably be much better off simply using your 'columns' as
features in classification/clustering/whatever.
> --
> You received this message because you are subscribed to the Google
> Groups "gensim" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to gensim+un...@googlegroups.com
> <mailto:gensim+un...@googlegroups.com>.
> For more options, visit https://groups.google.com/d/optout.

--
Solve et coagula!
Andrey

Ishaan Arora

unread,
Mar 23, 2017, 10:02:33 AM3/23/17
to gen...@googlegroups.com
Hi Andrey,
Thanks for replying .I am also using word2vec for another use case.Please can you validate for that as well,

These are few of the addresses I am training my word2vec on

[['WORDSWORTH', 'HOUSE', '21', 'FAIRFAX', 'ROAD', 'BIRMINGHAM', '', 'GBR'], 
['THE', 'HOLLIES', '2', 'FRIESTON', 'ROAD', 'GRANTHAM', 'CAYTHORPE', 'NG32', 'GBR']]

When i use the score function,I get the following results

model.score([str("WORDSWORTH HOUSE 21 FAIRFAX ROAD").split()])[0] 
-30.27762
model.score([str("WORDSWORTH GBR").split()])[0]
-19.615669

So what I am trying to achieve is that score in the first case must be better than second as the first case context is present in training but not the second,Is word2vec the appropriate way to go for such a use case ?

Thanks

On Thu, Mar 23, 2017 at 6:51 PM, Andrey Kutuzov <akutu...@gmail.com> wrote:
Hi Ishaan,

I personally think that distributional models (like word2vec) will not
be of much use in this case. Their power comes exactly from what you are
missing in you dataset - typical word co-occurrences.

You will probably be much better off simply using your 'columns' as
features in classification/clustering/whatever.

On 03/23/2017 12:26 PM, Ishaan Arora wrote:
> I have trained my word2vec model on a movie dataset with star cast,
> director name and other similar features/columns in the training data
> set. The text is not free flowing (it is comma separated). As a result,
> the SIMILARITY function and SCORE functions don’t produce satisfactory
> results as embedding generated are not up to the mark
>
>  1.
>
>     1. Is word2vec the right approach for such a problem with more large
>     number of proper nouns and no free flowing text?
>
>  2.
>
>     2. If yes, which parameters to tune for training with proper nouns?
>
> --
> You received this message because you are subscribed to the Google
> Groups "gensim" group.
> To unsubscribe from this group and stop receiving emails from it, send

> For more options, visit https://groups.google.com/d/optout.

--
Solve et coagula!
Andrey

--
You received this message because you are subscribed to a topic in the Google Groups "gensim" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/gensim/DwsDbeYoaD4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to gensim+unsubscribe@googlegroups.com.

Andrey Kutuzov

unread,
Mar 23, 2017, 12:16:47 PM3/23/17
to gen...@googlegroups.com
Hi Ishaan,

You won't get much insights from training word2vec models on toy
examples. Try training on a large corpus and see what happens.

Also, what is the final aim that you are trying to achieve? The 'score'
function is as a rule used as a sort of classifier, in cases when you
have 2 or more models trained on different corpora.
> > an email to gensim+un...@googlegroups.com
> <mailto:gensim%2Bunsu...@googlegroups.com>
> > <mailto:gensim+un...@googlegroups.com
> <mailto:gensim%2Bunsu...@googlegroups.com>>.
> > For more options, visit https://groups.google.com/d/optout <https://groups.google.com/d/optout>.
>
> --
> Solve et coagula!
> Andrey
>
> --
> You received this message because you are subscribed to a topic in
> the Google Groups "gensim" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/gensim/DwsDbeYoaD4/unsubscribe
> <https://groups.google.com/d/topic/gensim/DwsDbeYoaD4/unsubscribe>.
> To unsubscribe from this group and all its topics, send an email to
> gensim+un...@googlegroups.com
> <mailto:gensim%2Bunsu...@googlegroups.com>.
> For more options, visit https://groups.google.com/d/optout
> <https://groups.google.com/d/optout>.
>
>
> --
> You received this message because you are subscribed to the Google
> Groups "gensim" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to gensim+un...@googlegroups.com
> <mailto:gensim+un...@googlegroups.com>.

Ishaan Arora

unread,
Mar 23, 2017, 12:44:39 PM3/23/17
to gen...@googlegroups.com
Just,to clarify,the two  examples in previous mail were only 2 samples out of 30 million examples.I am training my word2vec model on. I will explain the END GOAL with this example
say my training CSV file has column headers like

ORGANIZATION NAME,CITY,STATE,POSTCODE,COUNTRY
and say one of the training sentences is like

SERVICES LIMITED,KINGSTON HULL,ENGLAND,EY1,GBR

so token 'SERVICES' must have high similarity to 'LIMITED' ,'SERVICES'  should have less similarity to token 'KINGSTON' and 'SERVICES' should extremely less similarity to 'GBR' token

Can word2vec model learn this because there is no free flowing text and mostly PROPER NOUN (read name)  tokens only ?


>     > For more options, visit https://groups.google.com/d/optout <https://groups.google.com/d/optout>.
>
>     --
>     Solve et coagula!
>     Andrey
>
>     --
>     You received this message because you are subscribed to a topic in
>     the Google Groups "gensim" group.
>     To unsubscribe from this topic, visit
>     https://groups.google.com/d/topic/gensim/DwsDbeYoaD4/unsubscribe
>     <https://groups.google.com/d/topic/gensim/DwsDbeYoaD4/unsubscribe>.
>     To unsubscribe from this group and all its topics, send an email to

>     For more options, visit https://groups.google.com/d/optout
>     <https://groups.google.com/d/optout>.
>
>
> --
> You received this message because you are subscribed to the Google
> Groups "gensim" group.
> To unsubscribe from this group and stop receiving emails from it, send

> For more options, visit https://groups.google.com/d/optout.

--
Solve et coagula!
Andrey

--
You received this message because you are subscribed to a topic in the Google Groups "gensim" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/gensim/DwsDbeYoaD4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to gensim+unsubscribe@googlegroups.com.

Andrey Kutuzov

unread,
Mar 23, 2017, 2:37:14 PM3/23/17
to gen...@googlegroups.com
When I talk about 'end goal', I mean the real downstream task that you
are trying to solve. You want to your model to be good for exactly what?

Certainly, just calculating similarities between words is not your
downstream task (if you are not a linguist interested in semantics). Do
you aim to find organizations that are similar to the organization in a
hypothetical user query? Or you want to do something else?
The answer to this question determines whether word2vec can be helpful
to your cause or not.

But as I've already said, my impression is that distributional models
will probably not be much useful on this data.

On 03/23/2017 05:44 PM, Ishaan Arora wrote:
> Just,to clarify,the two examples in previous mail were only 2 samples
> out of 30 million examples.I am training my word2vec model on. I will
> explain the END GOAL with this example
> /say my training CSV file has column headers like/
>
> *ORGANIZATION NAME*,*CITY*,*STATE*,*POSTCODE*,*COUNTRY*
> and say one of the training sentences is like
>
> SERVICES LIMITED,KINGSTON HULL,ENGLAND,EY1,GBR
>
> so token *'SERVICES'* must have high similarity to
> *'LIMITED'* ,'*SERVICES' * should have less similarity to token
> *'KINGSTON' *and '*SERVICES' *should* */extremely less/ similarity
> to*'GBR' *token
>
> Can word2vec model learn this because there is no free flowing text and
> mostly PROPER NOUN (read name) tokens only ?
>
> On Thu, Mar 23, 2017 at 9:45 PM, Andrey Kutuzov <akutu...@gmail.com
> > > an email to gensim+un...@googlegroups.com
> <mailto:gensim%2Bunsu...@googlegroups.com>
> > <mailto:gensim%2Bunsu...@googlegroups.com
> <mailto:gensim%252Buns...@googlegroups.com>>
> > > <mailto:gensim+un...@googlegroups.com
> <mailto:gensim%2Bunsu...@googlegroups.com>
> > <mailto:gensim%2Bunsu...@googlegroups.com
> <mailto:gensim%252Buns...@googlegroups.com>>>.
> > > For more options, visit https://groups.google.com/d/optout
> <https://groups.google.com/d/optout>
> <https://groups.google.com/d/optout
> <https://groups.google.com/d/optout>>.
> >
> > --
> > Solve et coagula!
> > Andrey
> >
> > --
> > You received this message because you are subscribed to a topic in
> > the Google Groups "gensim" group.
> > To unsubscribe from this topic, visit
> > https://groups.google.com/d/topic/gensim/DwsDbeYoaD4/unsubscribe
> <https://groups.google.com/d/topic/gensim/DwsDbeYoaD4/unsubscribe>
> >
> <https://groups.google.com/d/topic/gensim/DwsDbeYoaD4/unsubscribe
> <https://groups.google.com/d/topic/gensim/DwsDbeYoaD4/unsubscribe>>.
> > To unsubscribe from this group and all its topics, send an email to
> > gensim+un...@googlegroups.com
> <mailto:gensim%2Bunsu...@googlegroups.com>
> > <mailto:gensim%2Bunsu...@googlegroups.com
> <mailto:gensim%252Buns...@googlegroups.com>>.
> > <https://groups.google.com/d/optout
> <https://groups.google.com/d/optout>>.
> >
> >
> > --
> > You received this message because you are subscribed to the Google
> > Groups "gensim" group.
> > To unsubscribe from this group and stop receiving emails from it, send
> <mailto:gensim%2Bunsu...@googlegroups.com>>.
> > For more options, visit https://groups.google.com/d/optout
> <https://groups.google.com/d/optout>.
>
> --
> Solve et coagula!
> Andrey
>
> --
> You received this message because you are subscribed to a topic in
> the Google Groups "gensim" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/gensim/DwsDbeYoaD4/unsubscribe
> <https://groups.google.com/d/topic/gensim/DwsDbeYoaD4/unsubscribe>.
> To unsubscribe from this group and all its topics, send an email to
> gensim+un...@googlegroups.com
> <mailto:gensim%2Bunsu...@googlegroups.com>.
> For more options, visit https://groups.google.com/d/optout
> <https://groups.google.com/d/optout>.
>
>
> --
> You received this message because you are subscribed to the Google
> Groups "gensim" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to gensim+un...@googlegroups.com
> <mailto:gensim+un...@googlegroups.com>.

Ishaan Arora

unread,
Mar 23, 2017, 10:06:35 PM3/23/17
to gen...@googlegroups.com
END goal is for a hypothetical user query to output a confidence score,the confidence score must represents the probability of those string of tokens being together.So 
a query like (refer example in previous mail )
"SERVICES LIMITED" must output a higher score ,because they are highly probable of being found together in ORGANIZATION column but a query like
"LIMITED GBR" (where "LIMITED" is a  token which is mostly found in ORGANIZATIONS and "GBR" is found in COUNTRY) must output a relatively less score because these tokens are very less probable of being found together.

This is quite an intricate USE CASE and needs a word2vec expert's (like you) validation .

>     >     > For more options, visit https://groups.google.com/d/optout
>     <https://groups.google.com/d/optout>
>     <https://groups.google.com/d/optout
>     <https://groups.google.com/d/optout>>.
>     >
>     >     --
>     >     Solve et coagula!
>     >     Andrey
>     >
>     >     --
>     >     You received this message because you are subscribed to a topic in
>     >     the Google Groups "gensim" group.
>     >     To unsubscribe from this topic, visit
>     >     https://groups.google.com/d/topic/gensim/DwsDbeYoaD4/unsubscribe
>     <https://groups.google.com/d/topic/gensim/DwsDbeYoaD4/unsubscribe>
>     >
>      <https://groups.google.com/d/topic/gensim/DwsDbeYoaD4/unsubscribe
>     <https://groups.google.com/d/topic/gensim/DwsDbeYoaD4/unsubscribe>>.
>     >     To unsubscribe from this group and all its topics, send an email to

>     >     For more options, visit https://groups.google.com/d/optout <https://groups.google.com/d/optout>
>     >     <https://groups.google.com/d/optout
>     <https://groups.google.com/d/optout>>.
>     >
>     >
>     > --
>     > You received this message because you are subscribed to the Google
>     > Groups "gensim" group.
>     > To unsubscribe from this group and stop receiving emails from it, send

>     > For more options, visit https://groups.google.com/d/optout
>     <https://groups.google.com/d/optout>.
>
>     --
>     Solve et coagula!
>     Andrey
>
>     --
>     You received this message because you are subscribed to a topic in
>     the Google Groups "gensim" group.
>     To unsubscribe from this topic, visit
>     https://groups.google.com/d/topic/gensim/DwsDbeYoaD4/unsubscribe
>     <https://groups.google.com/d/topic/gensim/DwsDbeYoaD4/unsubscribe>.
>     To unsubscribe from this group and all its topics, send an email to

>     For more options, visit https://groups.google.com/d/optout
>     <https://groups.google.com/d/optout>.
>
>
> --
> You received this message because you are subscribed to the Google
> Groups "gensim" group.
> To unsubscribe from this group and stop receiving emails from it, send

> For more options, visit https://groups.google.com/d/optout.

--
Solve et coagula!
Andrey

--
You received this message because you are subscribed to a topic in the Google Groups "gensim" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/gensim/DwsDbeYoaD4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to gensim+unsubscribe@googlegroups.com.

Gordon Mohr

unread,
Mar 23, 2017, 11:42:03 PM3/23/17
to gensim
That still seems like it may be an interim goal. (Why are these co-occurrence probabilities important?)

But, if the value you want is really the co-occurrence probability, word2vec is a roundabout way to approximate that. It does both more and less than what you need. 

Depending on your data size, why not count the co-occurrences and report them exactly? For example, rather than trying to figure out a way to make word2vec output scores that vaguely fit your goals, record exactly the co-occurrences of different tokens, in the same or different fields, and report exactly those co-occurrence rates? (If you need to weight co-occurrences in different fields differently, that'd be an explicit choice to be tuned in your code for your real end-goal.)

- Gordon

>     >     > an email to gensim+un...@googlegroups.com
>     <mailto:gensim%2Bu...@googlegroups.com>
>     >     <mailto:gensim%2Bu...@googlegroups.com
>     <mailto:gensim%252Bunsubscribe@googlegroups.com>>
>     >     > <mailto:gensim+un...@googlegroups.com
>     <mailto:gensim%2Bu...@googlegroups.com>
>     >     <mailto:gensim%2Bu...@googlegroups.com
>     <mailto:gensim%252Bunsubscribe@googlegroups.com>>>.
>     >     > For more options, visit https://groups.google.com/d/optout
>     <https://groups.google.com/d/optout>
>     <https://groups.google.com/d/optout
>     <https://groups.google.com/d/optout>>.
>     >
>     >     --
>     >     Solve et coagula!
>     >     Andrey
>     >
>     >     --
>     >     You received this message because you are subscribed to a topic in
>     >     the Google Groups "gensim" group.
>     >     To unsubscribe from this topic, visit
>     >     https://groups.google.com/d/topic/gensim/DwsDbeYoaD4/unsubscribe
>     <https://groups.google.com/d/topic/gensim/DwsDbeYoaD4/unsubscribe>
>     >
>      <https://groups.google.com/d/topic/gensim/DwsDbeYoaD4/unsubscribe
>     <https://groups.google.com/d/topic/gensim/DwsDbeYoaD4/unsubscribe>>.
>     >     To unsubscribe from this group and all its topics, send an email to
>     >     gensim+un...@googlegroups.com
>     <mailto:gensim%2Bu...@googlegroups.com>
>     >     <mailto:gensim%2Bu...@googlegroups.com
>     >     <https://groups.google.com/d/optout
>     <https://groups.google.com/d/optout>>.
>     >
>     >
>     > --
>     > You received this message because you are subscribed to the Google
>     > Groups "gensim" group.
>     > To unsubscribe from this group and stop receiving emails from it, send
>     > an email to gensim+un...@googlegroups.com
>     <mailto:gensim%2Bu...@googlegroups.com>
>     > <mailto:gensim+un...@googlegroups.com
>     <mailto:gensim%2Bu...@googlegroups.com>>.

>     > For more options, visit https://groups.google.com/d/optout
>     <https://groups.google.com/d/optout>.
>
>     --
>     Solve et coagula!
>     Andrey
>
>     --
>     You received this message because you are subscribed to a topic in
>     the Google Groups "gensim" group.
>     To unsubscribe from this topic, visit
>     https://groups.google.com/d/topic/gensim/DwsDbeYoaD4/unsubscribe
>     <https://groups.google.com/d/topic/gensim/DwsDbeYoaD4/unsubscribe>.
>     To unsubscribe from this group and all its topics, send an email to

>     For more options, visit https://groups.google.com/d/optout
>     <https://groups.google.com/d/optout>.
>
>
> --
> You received this message because you are subscribed to the Google
> Groups "gensim" group.
> To unsubscribe from this group and stop receiving emails from it, send

> For more options, visit https://groups.google.com/d/optout.

--
Solve et coagula!
Andrey

--
You received this message because you are subscribed to a topic in the Google Groups "gensim" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/gensim/DwsDbeYoaD4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to gensim+un...@googlegroups.com.

Ishaan Arora

unread,
Mar 24, 2017, 1:41:08 AM3/24/17
to gen...@googlegroups.com
Hi Gordon,
Co-occurrences can not be calculated exactly because the TEST queries will not exactly match the TRAIN data on which word2vec model was trained on,there might be tokens which are not in the pre-exisitng vocabulary and/or in some cases,tokens are arranged in different order in which word2vec was TRAINED on.
But the expectation is word2vec should still output high co-occurrence probability for similar/near tokens and less probability for dissimilar/far tokens.


>     >     > <mailto:gensim+un...@googlegroups.com
>     <mailto:gensim%2Bu...@googlegroups.com>
>     >     <mailto:gensim%2Bu...@googlegroups.com

>     >     > For more options, visit https://groups.google.com/d/optout
>     <https://groups.google.com/d/optout>
>     <https://groups.google.com/d/optout
>     <https://groups.google.com/d/optout>>.
>     >
>     >     --
>     >     Solve et coagula!
>     >     Andrey
>     >
>     >     --
>     >     You received this message because you are subscribed to a topic in
>     >     the Google Groups "gensim" group.
>     >     To unsubscribe from this topic, visit
>     >     https://groups.google.com/d/topic/gensim/DwsDbeYoaD4/unsubscribe
>     <https://groups.google.com/d/topic/gensim/DwsDbeYoaD4/unsubscribe>
>     >
>      <https://groups.google.com/d/topic/gensim/DwsDbeYoaD4/unsubscribe
>     <https://groups.google.com/d/topic/gensim/DwsDbeYoaD4/unsubscribe>>.
>     >     To unsubscribe from this group and all its topics, send an email to
>     >     gensim+un...@googlegroups.com
>     <mailto:gensim%2Bu...@googlegroups.com>
>     >     <mailto:gensim%2Bu...@googlegroups.com

--
To unsubscribe from this group and all its topics, send an email to gensim+unsubscribe@googlegroups.com.

Gordon Mohr

unread,
Mar 24, 2017, 3:03:51 AM3/24/17
to gensim
Word2Vec also can't tell you anything about tokens which are not in the training vocabulary and if you were creating your own co-occurrence statistics, you'd be free to either take ordering into account, or not. (In word2Vec ordering comes into play with regard to the 'window' parameter, which doesn't map well to your arbitrary-list-of-fielded-records – where I suspect the fact that tokens appear in adjacent fields is no more important than appearing in any fields.) 

Your data and goal doesn't particularly fit usual word2vec patterns – so whether it's useful will likely require some improvised data-prep and iterative experimentation. For example, it might make sense to split your records into pairs of fields, and/or sometimes treat fields as multi-token but other times as a joined token, and/or prefix tokens with field-names, to fully benefit from the fieldedness, but not be unduly influenced by the arbitrary ordering of fields in a record. That is, your example record of:

*ORGANIZATION NAME*,*CITY*,*STATE*,*POSTCODE*,*COUNTRY*
SERVICES LIMITED,KINGSTON HULL,ENGLAND,EY1,GBR

...*might* induce better vectors if preprocessed to be more like...

    tokens = ['org:SERVICES', 'org:LIMITED', 'city:KINGSTON_HULL', 'state:ENGLAND', 'pc:EY1', 'country:GBR']
    for c in itertools.combinations(tokens, 2):
        print c

('org:SERVICES', 'org:LIMITED')

('org:SERVICES', 'city:KINGSTON_HULL')

('org:SERVICES', 'state:ENGLAND')

('org:SERVICES', 'pc:EY1')

('org:SERVICES', 'country:GBR')

('org:LIMITED', 'city:KINGSTON_HULL')

('org:LIMITED', 'state:ENGLAND')

('org:LIMITED', 'pc:EY1')

('org:LIMITED', 'country:GBR')

('city:KINGSTON_HULL', 'state:ENGLAND')

('city:KINGSTON_HULL', 'pc:EY1')

('city:KINGSTON_HULL', 'country:GBR')

('state:ENGLAND', 'pc:EY1')

('state:ENGLAND', 'country:GBR')

('pc:EY1', 'country:GBR')


But again, only testing/tinkering, including with a wide range of word2vec parameters, will reveal if some improvised word2vec model built on this sort of data will have the end-characteristics you seek. 

- Gordon

>     >     > <mailto:gensim+un...@googlegroups.com
>     <mailto:gensim%2Bu...@googlegroups.com>
>     >     <mailto:gensim%2Bu...@googlegroups.com

>     >     > For more options, visit https://groups.google.com/d/optout
>     <https://groups.google.com/d/optout>
>     <https://groups.google.com/d/optout
>     <https://groups.google.com/d/optout>>.
>     >
>     >     --
>     >     Solve et coagula!
>     >     Andrey
>     >
>     >     --
>     >     You received this message because you are subscribed to a topic in
>     >     the Google Groups "gensim" group.
>     >     To unsubscribe from this topic, visit
>     >     https://groups.google.com/d/topic/gensim/DwsDbeYoaD4/unsubscribe
>     <https://groups.google.com/d/topic/gensim/DwsDbeYoaD4/unsubscribe>
>     >
>      <https://groups.google.com/d/topic/gensim/DwsDbeYoaD4/unsubscribe
>     <https://groups.google.com/d/topic/gensim/DwsDbeYoaD4/unsubscribe>>.
>     >     To unsubscribe from this group and all its topics, send an email to
>     >     gensim+un...@googlegroups.com
>     <mailto:gensim%2Bu...@googlegroups.com>
>     >     <mailto:gensim%2Bu...@googlegroups.com
Reply all
Reply to author
Forward
0 new messages