> Do you mean that if two tokens have same value of mention_id, they will
> refer to the same query?
It means that the NER tagger's model is asserting that they are part of
the same mention. E.g., in the sentence "The chairman is named John
Smith.", there is a two-token mention to "John Smith" and they would both
have the same mention_id
> But I decrypted mention_id from my relative documents and found all of
> them equal -1 which means null.
> (streamcorpus.Chunk[i].body.sentences['serif'][j].tokens[k].mention_id)
> Is that normal?
They shouldn't *all* be -1.
You might try using streamcorpus_dump, which gets installed when you `pip
install streamcorpus` It generates output like the attached, which came
from these commands:
wget
http://s3.amazonaws.com/aws-publicdatasets/trec/kba/kba-streamcorpus-2014-v0_3_0-kba-filtered/2012-01-31-13/news-20-515f8142060b05f74b7c61d7aa1ba68e-ea1dd37e19a1e9354e8b669d11088fe7-a58a20497edcc70b8b23093c8fb3161f-a708f2e101fd56bae21c56257b4cd27c.sc.xz.gpg
streamcorpus_dump --smart-dump news-20-515f8142060b05f74b7c61d7aa1ba68e-ea1dd37e19a1e9354e8b669d11088fe7-a58a20497edcc70b8b23093c8fb3161f-a708f2e101fd56bae21c56257b4cd27c.sc.xz.gpg > news-20-515f8142060b05f74b7c61d7aa1ba68e-ea1dd37e19a1e9354e8b669d11088fe7-a58a20497edcc70b8b23093c8fb3161f-a708f2e101fd56bae21c56257b4cd27c.dump.txt
Note that --smart-dump suppresses output of mention_id=-1 entries.
jrf