Does anyone have any methods for estimating document creation time for
unannotated documents?
All the best,
Leon
--
Leon R A Derczynski
NLP Research Group
Department of Computer Science
University of Sheffield
Regent Court, 211 Portobello
Sheffield S1 4DP, UK
+44 114 22 21931
http://www.dcs.shef.ac.uk/~leon/
Anything would be good I think, no matter how vague. This task looks
really interesting - and it'd be great to port the task over to English
too, where I imagine it'd likely be equally hard. I'll see how Google
Translate fares with the proceedings!
For newswire the task's sometimes a little easier; even though DCT info
often isn't explicit in the main text, it is usually buried in the header:
AP900815-0044
AP-NR-08-15-90 1337EDT
u i PM-GulfRdp 8thLd-Writethru 08-15 1334
PM-Gulf Rdp, 8th Ld-Writethru,a0605,1368
Saddam Seeks End To War With Iran; Bush To Urge Jordan To Close
Port
Eds: SUBS 28th graf pvs, Crown Prince... to CORRECT spelling of
Hassan; pick up 29th graf pvs, `A CBS...'
LaserPhotos WX6,7,XSAV1,NY5,10,TOK1,XAAFB1,AMM1, LaserColor XAAFB1
Building regular expressions or something similar for extracting day
(and even time, in this example) would be a decent approach for the
existing newswire resources, but like all regex based methods,
intrinsically fragile. A generic approach would be best, especially for
less structured genres.
All the best,
Leon