Re: extracting stream items from the TS-specific corpus subset

Skip to first unread message

John R. Frank

Jul 22, 2014, 6:53:44 AM7/22/14
Hi Sayaka,

LingPipe and Serif were only run on documents that cld recognized as
English or for which cld failed to recognize any language. From a manual
sample of the documents that cld could not recognize, it appeared to be
about about 40% English.

Serif was slower to run than LingPipe and produced a richer set of tags,
so you might consider defaulting to Serif and falling back to LingPipe.

- Serif's within-doc coref chains tend to be more complete

- Serif recognizes nominals and incorporates them into within-doc coref
chains, see the Token.mention_type and Token.equiv_id

- Serif has sentence parsing, see Token.parent_id and

- Serif populates StreamItem.body.relations['serif'] with some KBP

For KBA 2014, only documents with Serif tags are included in the

The TempSumm discussion forum can answer your questions about which
documents are considered valid for TS:


On Mon, 21 Jul 2014, wrote:

> Hello,
> I am one of participants in the TREC 2014 Temporal Summarization (TS) track.
> I use the TS-specific corpus subset ( and have some problems.
> Would you give me any advice?
> 1.
> We were extracting stream items from the TS-specific corpus subset with the attached java program.
> However, we cannot find "lingpipe" component in stream items from 2013-02-03-00 to 2013-04-20-23 or get sentences.
> Would you tell me about that solution methods?
> 2.
> Our atattied program doesn't output any sentences, when we extract stream items from some chunk files.
> (e.g.
> What does this case mean?
> May I ignore these chunk files?
> Thank you for your consideration.
> Sayaka

Jul 23, 2014, 1:49:05 AM7/23/14
Hi John

Thank you very much for your helpful answer.
I will use Serif in place of Lingpipe.


2014年7月22日火曜日 19時53分44秒 UTC+9 John Frank:
Reply all
Reply to author
0 new messages