can you please clarify the WARC spec with regard to the
WARC-Date field (part 1) and warcinfo records in WARCs
transformed from ARCs (part 2) for us? these issues came up when
comparing Heritrix (2.0.2) and warc tools (r242) arc2warc
output.
----------------------------------------------------------------
part 1
----------------------------------------------------------------
according to the WARC spec[1] ISO/DIS 28500 (v0.18):
5.4 WARC-Date
"The timestamp shall represent the instant that
data capture for record creation began."
this may mean that the creation date of the WARC file itself
(from an original ARC) would not be captured. also, WARC files
converted from ARCs which predate the WARC format might have a
WARC-Date field which predates the WARC format.
is this what we want?
this issue came up when comparing the output of:
1) Heritrix's Arc2Warc.java class, and
2) WARC Tools' arc2warc
given an ARC file whose date is:
2008-12-19 23:22:43
converting the ARC to a WARC with Heritrix gives:
WARC-Date: 2009-01-05T22:25:39Z
in the first record (a warcinfo record), which is the creation
date.
while converting to a WARC with warc tools gives:
WARC-Date: 2008-12-19T23:22:43Z
in the first record (which is a response record - see part 2).
so, do we want the WARC-Date field in the warcinfo record
to be the date of the first record, or the creation date
of the WARC file itself?
attachments:
arc2warc-arc.txt: head of Original ARC file
arc2warc-h2.txt : head of WARC from Heritrix's Arc2Warc.java
arc2warc-wt.txt : head of WARC from WARC tools arc2warc
----------------------------------------------------------------
part 2:
----------------------------------------------------------------
even more conspicuously, the warc tools transformed WARC gives
the first record as type:
WARC-Type: response
with a target URI of:
WARC-Target-URI: filedesc://...arc
which yields a significantly different record than the Heritrix
transformed WARC, which gives a 'warcinfo' record as the initial
record of the transformed WARC file. (see attachments)
furthermore, the WARC spec states in section "4 File and record
model":
All 'warcinfo' 'request', 'metadata' and 'revisit'
records shall not have a payload.
but Heritrix's Arc2Warc class outputs a warcinfo record that has
a "Filedesc:" payload.
please let us know what you think of these differences so we can
determine how best to converge.
thanks,
/st...@archive.org