TempEval-2013
as part of
SemEval-2013
International Workshop on Semantic Evaluations
an ACL-SIGLEX event
First Call for Participation
http://www.cs.york.ac.uk/semeval-2013
SemEval 2013 the 7th International Workshop on Semantic Evaluations
invites participation in broad range of tasks in the semantic analysis
of natural language. The aim of the SemEval series of workshops is to
extend the current state-of-the-art in semantic analysis and to help
create high quality annotated datasets in a range of increasingly
challenging problems in natural language semantics. For more
information on SemEval, please see
http://en.wikipedia.org/wiki/SemEval.
Abstract
The aim of TempEval is to advance research on temporal information
processing, which could eventually help NLP applications like question
answering, textual entailment, summarization, etc. TempEval-3 follows
on from previous TempEval events, incorporating: a three-part task
structure covering event, temporal expression and temporal relation
extraction; the use of the complete set of TimeML temporal relations,
that was simplified in previous editions; a 10-times larger dataset;
and single overall performance scores, which allow the ranking of the
participating systems in each task and also in general.
Introduction
Temporal annotation is a time-consuming task for humans, which has
limited the size of annotated data in previous TempEvals. Current
systems, however, are performing close to the inter-annotator
reliability, which suggests that larger corpora could be built
starting with automatically annotated data. One of the main goals of
this TempEval edition is to explore whether there is value in adding a
large automatically created silver standard to a hand-crafted gold
standard. It might be that for some tasks an auto-annotated larger
corpus might be more useful than a hand annotated small corpus.
TempEval-3, a temporal evaluation task, is a follow-up to TempEval-1
and 2. TempEval-3 differs from its ancestors in the following respects:
(i) size of the corpus: the dataset used comprises about 500K
tokens of silver standard data and about 100K tokens of gold standard
data for training, compared to the corpus of roughly 50K tokens corpus
used in TempEval 1 and 2;
(ii) temporal relation task: the temporal relation classification
tasks are to be performed from raw text, i.e. participants need to
extract events and temporal expressions first, determine which ones to
link and then obtain the relation types;
(iii) tasks not independent: participants must annotate temporal
expressions and events in order to do the relation task;
(iv) temporal relation types: the full set of temporal interval
relations in TimeML (Pustejovsky et al., 2005) is used, rather than
the reduced set used in earlier TempEvals;
(v) annotation: most of the corpus was automatically annotated by
the stateof-the-art systems from TempEval-2, a portion of the corpus,
including the test dataset, that is human reviewed;
(vi) evaluation: we will report a temporal awareness score for
evaluating temporal relations, to help to rank systems with a single
score.
TempEval 3 Tasks
The tasks proposed for TempEval-3 are related to each one of the main
TimeML tags. These are:
Task A: Temporal expression extraction and normalization
Determine the extent of the time expressions in a text as defined by
the TimeML TIMEX3 tag. In addition, determine the value of the
features TYPE and VAL. The possible values of TYPE are time, date,
duration, and set; the value of VAL is a normalized value as defined by
the TIMEX3 standard. The main attribute to annotate is VAL.
Task B: Event extraction
As in TempEval-2, participants will determine the extent of the events
in a text as defined by the TimeML EVENT tag. In addition, systems may
determine the value of the features CLASS, TENSE, ASPECT, POLARITY,
MODALITY and also identify if the event is a main event or not. The
main attribute to annotate is CLASS.
Task C: Annotating temporal relations
Identify the pairs of temporal entities (events or temporal
expressions) that have a temporal link and classify the temporal
relation between them as a TLINK. Possible pairs of entities that can
have a temporal link are: (i) event and temporal expressions in the
same sentence, (ii) event and document creation time, (iii) main
events of consecutive sentences and (iv) pairs of events in the same
sentence. For this task, we now require that the participating systems
determine which entities need to be linked.
The relation labels will be same as in TimeML, i.e.: before, after,
includes, is-included, during, simultaneous, immediately after,
immediately before, identity, begins, ends, begun-by and ended-by.
Task selection
Participants may choose to do task A, B, or C. Choosing task C
(relation annotation) entails doing tasks A and B (interval
annotation). However, a participant may perform only task C by
applying existing tools to carry out tasks A and B.
Dataset Creation
In TempEval-3, we release new data, as well as significantly reviewing
and modifying existing corpora.
Reviewing Existing Corpora
We considered the existing TimeBank (Pustejovsky et al., 2003),
TempEval-1, TempEval-2 and AQUAINT data for review in TempEval-3.
TimeBank v1.2, TempEval-1 and TempEval-2 had the same documents but
different relation types and sometimes different sets of events. We will
refer to this body of temporally-annotated newswire documents as
TimeBank. For both TimeBank and AQUAINT, we cleaned up the formatting
for all files making it easy to review and read, made all files XML and
TimeML schema compatible and added some missing events and temporal
expressions. In AQUAINT, we added the temporal relations between event
and DCT (document creation time), which was missing for many documents
in that corpus. In particular, for the TimeBank documents, we borrowed
the events from the TempEval-2 corpus and the temporal relations from
the TimeBank corpus, which contains a full set of temporal relations
(TempEval-2 used a simpler, coarse-grained set of temporal relations).
Automatically Creating A New Large Corpus (Silver Standard)
A large portion of the TempEval-3 data is automatically generated,
using a temporal merging system. We collected the half-million token
text corpus from English Gigaword. We automatically annotated this
corpus using TIPSem, TIPSem-B (Llorens et al., 2010) and TRIOS
(UzZaman and Allen, 2010). These systems were re-trained on the
TimeBank and AQUAINT corpus, using the TimeML temporal relation set.
We then merged these three state-of-the-art system outputs using our
merging algorithm (Llorens et al., 2012). In our merging configuration,
all entities and relations suggested by the best system (TIPSem) are
added to the merge output. Suggestions from two other systems (TRIOS
and TIPSem-B) are added to the merge output if they are supported by
at least 2 of the 3 systems overall. The weights used in our
configuration are: TIPSem 0.36, TIPSemB 0.32, TRIOS 0.32.
This automatically created corpus is referred as silver data. A
portion of the silver data is in the process of human reviewing for
release as additional gold training data, in addition to reviewed and
re-curated versions of TimeBank and AQUAINT. The parts described in
Table 1 comprise our released dataset.
Table 1: Available corpus released for TempEval-3. (*: reviewing in progress)
Corpus: TimeBank
Number of tokens: 61,418
Purpose: Training
Standard: Gold
Corpus: AQUAINT
Number of tokens: 33,973
Purpose: Training
Standard: Gold
Corpus: TempEval-3 Silver
Number of tokens: 666,309
Purpose: Training
Standard: Silver
Corpus: TempEval-3 Gold
Number of tokens: 20,000*
Purpose: Training
Standard: Gold
Corpus: TempEval-3 Evaluation
Number of tokens: 20,000*
Purpose: Evaluation
Standard: Gold
The exploration of the benefits of both very large automatically
temporally annotated corpora (silver data) and of smaller human
annotated/reviewed temporal annotated corpora (gold data) with our
TempEval-3 release will be available to task participants and to
future research.
Evaluation
Evaluation on tasks A and B will be a standard F-score (incorporating
Precision and Recall metrics) on extents and F-score/Kappa on
attributes on the response extents that overlap with the key extents.
Evaluation on task C will be incorporated from our proposed
graph-based evaluation metric (see UzZaman and Allen (2011) for
details). This metric uses temporal closure to reward relation
annotations that are equivalent but distinct and then finds precision
and recall. Our temporal awareness score is a combined measure of a
system’s performance (i.e. it evaluates how a system extracts events,
temporal expressions and also identifies all temporal relations).
For details, check the task description paper here:
http://arxiv.org/pdf/1206.5333v1.pdf
Naushad UzZaman, Hector Llorens, James F. Allen, Leon Derczynski, Marc
Verhagen, James Pustejovsky. 2012. TempEval-3: Evaluating Events, Time
Expressions, and Temporal Relations. arXiv:1206.5333v1.
References:
Llorens, H., E. Saquete, and B. Navarro (2010), “TIPSem (English and
Spanish): Evaluating CRFs and Semantic Roles in TempEval-2.” In
Proceedings of the 5th International Workshop on Semantic Evaluation,
284–291, Association for Computational Linguistics.
Pustejovsky, J., P. Hanks, R. Sauri, A. See, R. Gaizauskas, A. Setzer,
D. Radev, B. Sundheim, D. Day, L. Ferro, et al. (2003), “The TimeBank
corpus.” In Corpus Linguistics, volume 2003, 40.
Pustejovsky, J., B. Ingria, R. Sauri, J. Castano, J. Littman, R.
Gaizauskas, A. Setzer, G. Katz, and I. Mani (2005), “The specification
language TimeML.” The Language of Time: A reader, 545–557.
UzZaman, N. and J.F. Allen (2010), “TRIPS and TRIOS system for
TempEval-2: Extracting temporal information from text.” In Proceedings
of the 5th International Workshop on Semantic Evaluation, 276–283,
Association for
Computational Linguistics.
UzZaman, N. and J.F. Allen (2011), “Temporal Evaluation.” In
Proceedings of The 49th Annual Meeting of the Association for
Computational Linguistics: Human Language Technologies (Short Paper),
Portland, Oregon, USA.
H. Llorens, UzZaman, N., and J.F. Allen (2012), “Merging Temporal
Annotations.” In Proceedings of the TIME Conference.
Verhagen, M., R. Gaizauskas, F. Schilder, M. Hepple, J. Moszkowicz,
and J. Pustejovsky (2009), “The TempEval challenge: identifying
temporal relations in text.” Language Resources and Evaluation, 43,
161–179.
Verhagen, M., R. Sauri, T. Caselli, and J. Pustejovsky (2010),
“SemEval-2010 task 13: TempEval-2.” In Proceedings of the 5th
International Workshop on Semantic Evaluation, 57–62, Association for
Computational Linguistics.
Task Organizers:
James Allen (
ja...@cs.rochester.edu), University of Rochester
Leon Derczynski (
le...@dcs.shef.ac.uk), University of Sheffield
Hector Llorens (
hector...@gmail.com), University of Alicante
James Pustejovsky (
jam...@cs.brandeis.edu), Brandeis University
Naushad UzZaman (
nau...@cs.rochester.edu), University of Rochester
[Primary Contact]
Marc Verhagen (
ma...@cs.brandeis.edu), Brandeis University
Important Dates:
September 12, 2012 First Call for participation
November 1, 2012 onwards Full Training Data available for participants
February 15, 2013 Test set ready
February 15, 2013 Registration Deadline [for Task Participants]
March 1, 2013 onwards Start of evaluation period [Task Dependent]
March 15, 2013 End of evaluation period
April 9, 2013 Paper submission deadline [TBC]
April 23, 2013 Reviews Due [TBC]
May 4, 2013 Camera ready Due [TBC]
Summer 2013 Workshop co-located with ACL or NAACL [TBC]
--
Leon R A Derczynski
NLP Research Group
Department of Computer Science
University of Sheffield
Regent Court, 211 Portobello
Sheffield S1 4DP, UK
+45 8715 6234
http://www.dcs.shef.ac.uk/~leon/