Hi, John.

60 views
Skip to first unread message

张东旭

unread,
Aug 17, 2014, 11:03:31 AM8/17/14
to stream...@googlegroups.com
What is the meaning of equiv_id? How can we use it as coref resolution?

John R Frank

unread,
Aug 17, 2014, 12:40:37 PM8/17/14
to 张东旭, stream...@googlegroups.com

> What is the meaning of equiv_id? How can we use it as coref resolution?


within a single document (StreamItem) the equiv_id values are generated by a tagger's within-doc coref model. For example, the Serif, tagger connects mentions of all three types into coref chains. The mention_type property distinguishes the three types.

Since a single mention can span multiple tokens, the mention_id property connects multi-token mentions.

The within-doc coref chains are not resolved to an external reference like freebase. We hope to receive a bundle of community metadata with such entity linking output, however I don't yet have a time frame for this.

Does this answer your question?

John

张东旭

unread,
Aug 18, 2014, 11:58:39 PM8/18/14
to stream...@googlegroups.com, zhangd...@gmail.com
Do you mean that if two tokens have same value of mention_id, they will refer to the same query?  
But I decrypted mention_id from my relative documents and found all of them equal -1 which means null.  (streamcorpus.Chunk[i].body.sentences['serif'][j].tokens[k].mention_id)
Is that normal?
I still don't quite understand the meaning of mention_id and equiv_id.
 

在 2014年8月18日星期一UTC+8上午12时40分37秒,John Frank写道:

John R. Frank

unread,
Aug 19, 2014, 4:43:58 AM8/19/14
to 张东旭, stream...@googlegroups.com
> Do you mean that if two tokens have same value of mention_id, they will
> refer to the same query?  

It means that the NER tagger's model is asserting that they are part of
the same mention. E.g., in the sentence "The chairman is named John
Smith.", there is a two-token mention to "John Smith" and they would both
have the same mention_id


> But I decrypted mention_id from my relative documents and found all of
> them equal -1 which means null. 
> (streamcorpus.Chunk[i].body.sentences['serif'][j].tokens[k].mention_id)
> Is that normal?

They shouldn't *all* be -1.

You might try using streamcorpus_dump, which gets installed when you `pip
install streamcorpus` It generates output like the attached, which came
from these commands:

wget http://s3.amazonaws.com/aws-publicdatasets/trec/kba/kba-streamcorpus-2014-v0_3_0-kba-filtered/2012-01-31-13/news-20-515f8142060b05f74b7c61d7aa1ba68e-ea1dd37e19a1e9354e8b669d11088fe7-a58a20497edcc70b8b23093c8fb3161f-a708f2e101fd56bae21c56257b4cd27c.sc.xz.gpg

streamcorpus_dump --smart-dump news-20-515f8142060b05f74b7c61d7aa1ba68e-ea1dd37e19a1e9354e8b669d11088fe7-a58a20497edcc70b8b23093c8fb3161f-a708f2e101fd56bae21c56257b4cd27c.sc.xz.gpg > news-20-515f8142060b05f74b7c61d7aa1ba68e-ea1dd37e19a1e9354e8b669d11088fe7-a58a20497edcc70b8b23093c8fb3161f-a708f2e101fd56bae21c56257b4cd27c.dump.txt


Note that --smart-dump suppresses output of mention_id=-1 entries.

jrf
news-20-515f8142060b05f74b7c61d7aa1ba68e-ea1dd37e19a1e9354e8b669d11088fe7-a58a20497edcc70b8b23093c8fb3161f-a708f2e101fd56bae21c56257b4cd27c.dump.txt.gz
Reply all
Reply to author
Forward
0 new messages