Re: extracting stream items from the TS-specific corpus subset

84 views
Skip to first unread message

John R. Frank

unread,
Jul 22, 2014, 6:53:44 AM7/22/14
to kita3...@gmail.com, stream...@googlegroups.com
Hi Sayaka,

LingPipe and Serif were only run on documents that cld recognized as
English or for which cld failed to recognize any language. From a manual
sample of the documents that cld could not recognize, it appeared to be
about about 40% English.

Serif was slower to run than LingPipe and produced a richer set of tags,
so you might consider defaulting to Serif and falling back to LingPipe.

- Serif's within-doc coref chains tend to be more complete

- Serif recognizes nominals and incorporates them into within-doc coref
chains, see the Token.mention_type and Token.equiv_id

- Serif has sentence parsing, see Token.parent_id and
Token.dependency_path

- Serif populates StreamItem.body.relations['serif'] with some KBP
relations.

For KBA 2014, only documents with Serif tags are included in the
evaluation.

The TempSumm discussion forum can answer your questions about which
documents are considered valid for TS:
http://www.trec-ts.org/contact/google-group

jrf




On Mon, 21 Jul 2014, kita3...@gmail.com wrote:

> Hello,
>
> I am one of participants in the TREC 2014 Temporal Summarization (TS) track.
> I use the TS-specific corpus subset (http://s3.amazonaws.com/aws-publicdatasets/trec/ts/index.html) and have some problems.
> Would you give me any advice?
>
>
> 1.
> We were extracting stream items from the TS-specific corpus subset with the attached java program.
> However, we cannot find "lingpipe" component in stream items from 2013-02-03-00 to 2013-04-20-23 or get sentences.
> Would you tell me about that solution methods?
>
>
> 2.
> Our atattied program doesn't output any sentences, when we extract stream items from some chunk files.
> (e.g.
> http://s3.amazonaws.com/aws-publicdatasets/trec/ts/streamcorpus-2014-v0_3_0-ts-filtered/2012-12-01-00/WEBLOG-28-5bb0b1c7835d2c5d866613709ffe4cef-fcc077e4c970142b466ea34e44635aa3-33d7ff7263ea9a35c5eb
> e5d77c632f21.sc.xz.gpg,
> http://s3.amazonaws.com/aws-publicdatasets/trec/ts/streamcorpus-2014-v0_3_0-ts-filtered/2012-12-01-00/MAINSTREAM_NEWS-50-55d1c5ee21680cc9e7245f8ebf35a977-72dd9a25cb94c83d8695fd00c32f53ed-14a973f07cd
> 891951b98d750e219ac86.sc.xz.gpg)
>
> What does this case mean?
> May I ignore these chunk files?
>
>
> Thank you for your consideration.
> Sayaka
>
>
>

kita3...@gmail.com

unread,
Jul 23, 2014, 1:49:05 AM7/23/14
to stream...@googlegroups.com, kita3...@gmail.com
Hi John

Thank you very much for your helpful answer.
I will use Serif in place of Lingpipe.

Regards,
Sayaka


2014年7月22日火曜日 19時53分44秒 UTC+9 John Frank:
Reply all
Reply to author
Forward
0 new messages