Sentence processing question

Willis, Craig

unread,

Jun 18, 2015, 6:28:03 PM6/18/15

to tre...@googlegroups.com

Hello,

I'm new to the temporal summarization track and am having a problem processing sentences from the streamitems.

I've been looking at the TS 2013 ground-truth data to confirm my streamcorpus pipeline and use of the "serif" parsed sentences. Specifically, I've been using the "matches.tsv" file to confirm sentence IDs and term positions. For example, for query 1 the 2013 matches.tsv contains the following entries for streamid 1701a631ea289156f9d2d37391720ce8:

grep 1701a631ea289156f9d2d37391720ce8 matches.tsv

1 1329921060-1701a631ea289156f9d2d37391720ce8-9 VMTS13.01.055 47 88 0

1 1329921060-1701a631ea289156f9d2d37391720ce8-9 VMTS13.01.051 0 11 0

1 1329921060-1701a631ea289156f9d2d37391720ce8-9 VMTS13.01.050 25 46 0

1 1329921060-1701a631ea289156f9d2d37391720ce8-4 VMTS13.01.088 166 183 0

1 1329921060-1701a631ea289156f9d2d37391720ce8-4 VMTS13.01.058 112 138 0

1 1329921060-1701a631ea289156f9d2d37391720ce8-4 VMTS13.01.056 188 205 0

1 1329921060-1701a631ea289156f9d2d37391720ce8-4 VMTS13.01.050 111 164 0

If I'm understanding things, this indicates that several nuggets were matched against sentences 4 and 9. Sentence 9 is at least 88 characters long (given the match_end value) and sentence 4 is at least 205 characters long.

However, when I output the "serif" sentences for this streamid, I see the following (sentence numbers are based on list index):

(0) Argentina declares two-day mourning period after train crash kills 50 - CNN.com

(1) SET EDITION : U.S .

(2) INTERNATIONAL MEXICO ARABIC

(3) TV : CNN CNNi CNN en Espanol HLN

(4) Sign up Log in

(5) Home Video NewsPulse U.S. World Politics Justice Entertainment Tech Health Living Travel Opinion iReport Money Sports

(6) Share this on :

(7) Facebook Twitter Digg delicious reddit MySpace StumbleUpon LinkedIn

(8) Argentina declares two-day mourning period after train crash kills 50

(9) By the CNN Wire Staff updated 12:13 AM EST , Thu February 23 , 2012(10) Argentina train crash moment of impact

....

With the following Java implementation:

Map<String, List<Sentence>> parsers = item.body.sentences;

List<Sentence> sentences = parsers.get("serif");

int sentenceNum = 0;

for (Sentence s: sentences) {

List<Token> tokens = s.tokens;

System.out.print ("(" + sentenceNum + ") ");

for (Token token: tokens) {

System.out.print(token.token + " ");

}

System.out.print("\n");

sentenceNum++;

}

Either my use of the serif-parsed sentences is wrong or I've misunderstood the fields matches.tsv. Sentences 4 and 9 do not contain useful information and are nowhere near as long as indicated by the match_end fields.

Any guidance would be appreciated.

Thank you,

Craig

Stuart Mackie

unread,

Jun 19, 2015, 5:50:57 AM6/19/15

to tre...@googlegroups.com

Hi,

I thought we were to use the "lingpipe" sentence indexing for the 2013 topics, 1-10.

If I use "serif" indexing on these topics, I get the same output as you have quoted.

But, if I use "lingpipe" indexing, I get:

Sentence 4, "Argentina declares two-day mourning period after train crash kills 50 - CNN. com SET EDITION : U.S. "
Sentence 9, "Dozens dead in rush hour train crash Argentina train crash kills more than 40 Passengers told reporters the crash sounded like a bomb blast. "

Which matches up with the qrels.

Stuart.

Willis, Craig

unread,

Jun 19, 2015, 10:32:47 AM6/19/15

to Stuart Mackie, tre...@googlegroups.com

Thank you, Stuart.

Craig

From: tre...@googlegroups.com [tre...@googlegroups.com] on behalf of Stuart Mackie [s.mac...@research.gla.ac.uk]
Sent: Friday, June 19, 2015 4:50 AM
To: tre...@googlegroups.com
Subject: [TREC-TS] Re: Sentence processing question

--
You received this message because you are subscribed to the Google Groups "temporalsummarization" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trec-ts+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all

Reply to author

Forward