Determining DCT

Leon Derczynski

unread,

Jun 22, 2011, 9:16:43 AM6/22/11

to tempor...@googlegroups.com

Hi,

Does anyone have any methods for estimating document creation time for
unannotated documents?

All the best,

Leon

--
Leon R A Derczynski
NLP Research Group

Department of Computer Science
University of Sheffield
Regent Court, 211 Portobello
Sheffield S1 4DP, UK

+44 114 22 21931
http://www.dcs.shef.ac.uk/~leon/

Philippe Muller

unread,

Jun 22, 2011, 11:11:03 AM6/22/11

to Temporal Information in Language

Hi Leon,

do you mean the year or something more/less precise ?
in any case, that's a pretty hard task.
it was a shared task at the french text mining campaign both in 2010
and 2011 I think,
and I remember vaguely that scores were pretty low across the board.
http://deft2011.limsi.fr/

philippe

Leon Derczynski

unread,

Jun 22, 2011, 1:01:30 PM6/22/11

to tempor...@googlegroups.com

Hi Philippe,

Anything would be good I think, no matter how vague. This task looks
really interesting - and it'd be great to port the task over to English
too, where I imagine it'd likely be equally hard. I'll see how Google
Translate fares with the proceedings!

For newswire the task's sometimes a little easier; even though DCT info
often isn't explicit in the main text, it is usually buried in the header:

AP900815-0044
AP-NR-08-15-90 1337EDT
u i PM-GulfRdp 8thLd-Writethru 08-15 1334
PM-Gulf Rdp, 8th Ld-Writethru,a0605,1368
Saddam Seeks End To War With Iran; Bush To Urge Jordan To Close
Port
Eds: SUBS 28th graf pvs, Crown Prince... to CORRECT spelling of
Hassan; pick up 29th graf pvs, `A CBS...'
LaserPhotos WX6,7,XSAV1,NY5,10,TOK1,XAAFB1,AMM1, LaserColor XAAFB1

Building regular expressions or something similar for extracting day
(and even time, in this example) would be a decent approach for the
existing newswire resources, but like all regex based methods,
intrinsically fragile. A generic approach would be best, especially for
less structured genres.