Hello,
I'm new to the temporal summarization track and am having a problem processing sentences from the streamitems.
I've been looking at the TS 2013 ground-truth data to confirm my streamcorpus pipeline and use of the "serif" parsed sentences. Specifically, I've been using the "matches.tsv" file to confirm sentence IDs and term positions. For example, for query 1
the 2013 matches.tsv contains the following entries for streamid 1701a631ea289156f9d2d37391720ce8:
grep 1701a631ea289156f9d2d37391720ce8 matches.tsv
1 1329921060-1701a631ea289156f9d2d37391720ce8-9
VMTS13.01.055
47 88
0
1 1329921060-1701a631ea289156f9d2d37391720ce8-9
VMTS13.01.051
0 11
0
1 1329921060-1701a631ea289156f9d2d37391720ce8-9
VMTS13.01.050
25 46
0
1 1329921060-1701a631ea289156f9d2d37391720ce8-4
VMTS13.01.088
166 183
0
1 1329921060-1701a631ea289156f9d2d37391720ce8-4
VMTS13.01.058
112 138
0
1 1329921060-1701a631ea289156f9d2d37391720ce8-4
VMTS13.01.056
188 205
0
1 1329921060-1701a631ea289156f9d2d37391720ce8-4
VMTS13.01.050
111 164
0
If I'm understanding things, this indicates that several nuggets were matched against sentences 4 and 9. Sentence 9 is at least 88 characters long (given the match_end value) and sentence 4 is at least 205 characters long.
However, when I output the "serif" sentences for this streamid, I see the following (sentence numbers are based on list index):
(0) Argentina declares two-day mourning period after train crash kills 50 - CNN.com
(1) SET EDITION : U.S .
(2) INTERNATIONAL MEXICO ARABIC
(3) TV : CNN CNNi CNN en Espanol HLN
(4) Sign up Log in
(5) Home Video NewsPulse U.S. World Politics Justice Entertainment Tech Health Living Travel Opinion iReport Money Sports
(6) Share this on :
(7) Facebook Twitter Digg delicious reddit MySpace StumbleUpon LinkedIn
(8) Argentina declares two-day mourning period after train crash kills 50
(9) By the CNN Wire Staff updated 12:13 AM EST , Thu February 23 , 2012(10) Argentina train crash moment of impact
....
With the following Java implementation:
Map<String, List<Sentence>> parsers = item.body.sentences;
List<Sentence> sentences = parsers.get("serif");
int sentenceNum = 0;
for (Sentence s: sentences) {
List<Token> tokens = s.tokens;
System.out.print ("(" + sentenceNum + ") ");
for (Token token: tokens) {
System.out.print(token.token + " ");
}
System.out.print("\n");
sentenceNum++;
}
Either my use of the serif-parsed sentences is wrong or I've misunderstood the fields matches.tsv. Sentences 4 and 9 do not contain useful information and are nowhere near as long as indicated by the match_end fields.
Any guidance would be appreciated.
Thank you,
Craig