how are features represented in the spouse example

Skip to first unread message

Bruce Ho

Mar 31, 2020, 5:22:27 PM3/31/20
to deepdive-users
Does anyone understand this statement
we are using a sparse storage representation- you could think of a spouse candidate (p1_id, p2_id) as being represented by a vector of length L = COUNT(DISTINCT feature), consisting of all zeros except for at the indexes specified by the rows with key (p1_id, p2_id)

I thought about it for quite a while but still don't get it. First, there appears to be infinite possible features values, so COUNT(DISTINCT feature) would be rather inconvenient to work with. The example features given include things like "W_LEMMA_L_1_R_3_[elder]_[will stick he]" which is specific to this one sentence, and is not shared throughout the entire article collection. I suppose you can convert such features into a vector space representation, but no details are provided.

And the statement "consisting of all zeros except for at the indexes specified by the rows with key (p1_id, p2_id" makes it all the more confusing. This index mentioned is for the row index which is unique to a p1, p2 combination. It would have nothing to do with indexing in the feature space. 

Furthermore, the phrase "consisting of all zeros except" makes it sound like they are using one hot encoding, which is a completely different approach than vector space encoding. 

If anyone figured this out, please post.

Message has been deleted

Bruce Ho

Apr 3, 2020, 12:33:18 AM4/3/20
to deepdive-users
I think it makes more sense now. The features are lemma sequence of words in between and n-grams surrounding each mention. with large enough vocabulary, you can get some repeating patterns as features. The encoding is probably some kind of simple hash table. 
Reply all
Reply to author
0 new messages