WARC spec clarification on transformed WARCs

16 views
Skip to first unread message

st...@archive.org

unread,
Jan 14, 2009, 3:42:22 PM1/14/09
to warc-...@googlegroups.com, Gordon Mohr
hi WARC Tools,

can you please clarify the WARC spec with regard to the
WARC-Date field (part 1) and warcinfo records in WARCs
transformed from ARCs (part 2) for us? these issues came up when
comparing Heritrix (2.0.2) and warc tools (r242) arc2warc
output.

----------------------------------------------------------------
part 1
----------------------------------------------------------------
according to the WARC spec[1] ISO/DIS 28500 (v0.18):

5.4 WARC-Date
"The timestamp shall represent the instant that
data capture for record creation began."

this may mean that the creation date of the WARC file itself
(from an original ARC) would not be captured. also, WARC files
converted from ARCs which predate the WARC format might have a
WARC-Date field which predates the WARC format.

is this what we want?

this issue came up when comparing the output of:

1) Heritrix's Arc2Warc.java class, and
2) WARC Tools' arc2warc

given an ARC file whose date is:

2008-12-19 23:22:43

converting the ARC to a WARC with Heritrix gives:

WARC-Date: 2009-01-05T22:25:39Z

in the first record (a warcinfo record), which is the creation
date.

while converting to a WARC with warc tools gives:

WARC-Date: 2008-12-19T23:22:43Z

in the first record (which is a response record - see part 2).

so, do we want the WARC-Date field in the warcinfo record
to be the date of the first record, or the creation date
of the WARC file itself?

attachments:

arc2warc-arc.txt: head of Original ARC file
arc2warc-h2.txt : head of WARC from Heritrix's Arc2Warc.java
arc2warc-wt.txt : head of WARC from WARC tools arc2warc

----------------------------------------------------------------
part 2:
----------------------------------------------------------------
even more conspicuously, the warc tools transformed WARC gives
the first record as type:

WARC-Type: response

with a target URI of:

WARC-Target-URI: filedesc://...arc

which yields a significantly different record than the Heritrix
transformed WARC, which gives a 'warcinfo' record as the initial
record of the transformed WARC file. (see attachments)

furthermore, the WARC spec states in section "4 File and record
model":

All 'warcinfo' 'request', 'metadata' and 'revisit'
records shall not have a payload.

but Heritrix's Arc2Warc class outputs a warcinfo record that has
a "Filedesc:" payload.


please let us know what you think of these differences so we can
determine how best to converge.


thanks,
/st...@archive.org


[1] http://archive-access.sourceforge.net/warc/

arc2warc-arc.txt
arc2warc-h2.txt
arc2warc-wt.txt

Gordon Paynter

unread,
Jan 19, 2009, 3:31:03 PM1/19/09
to warc-...@googlegroups.com, Gordon Mohr, Clement Oury
Hi Steve:

While I cannot answer your questions myself, I did send them to Clement
at BNF, who made the following response (which I hope he will not mind
my sharing). I hope you find it useful.

Gordon



Hi Gordon,

I send you few comments on the questions on WARC (Part 2 precedes Part
1)

Part 2:

As far as I know, the Warcinfo record has been designed to play the
role of the "filedesc" of the ARC format.
However, the Warcinfo record of a migrated WARC file shall describe the
migration process (and it is not possible to have two warcinfo records
within the same WARC file).

On the other hand, an ARC filedesc record can't be considered as a real
"response", so it shall not be migrated in a WARC "response" record.

A solution may be to create a Warcinfo record describing a migration
process,
AND
to create a metadata record containing the content of the ARC filedesc
record.

On the question of the payload:
The payload in the WARC standard is defined as a "Data object referred
to, or contained by a WARC record as a meaningful subset of the content
block" (p. 3).

Defining a "meaningful subset" is useful, because one could want to
check data integrity of the payload (that is the file harvested on the
Net, without http responses), or identify its format.

In the Warcinfo record given as an example of the output of Heritrix's
ARC2WARC class, the text written after the headers seems to be only the
block of the record, so there is no inconsistency with the standard.

Part 1:

It seems to be a very critical issue.

To my opinion, a WARC response record migrated from a ARC record shall
have the same date than the previous ARC record.
That is:
a ARC record whose date is 2008-12-19 23:22:43
shall be migrated in a response record with WARC-Date:
2008-12-19T23:22:43Z

On the other hand, the migrated WARC response record should be linked
to the Warcinfo record describing the migration process, whose date
should be WARC-Date: 2009-01-05T22:25:39Z

The date of the metadata record containing the "filedesc" shall also be
2009-01-05T22:25:39Z, but it will be necessary to put the original date
of the ARC filedesc record somewhere else in the WARC metadata record.

This solution allows to record:
- the original harvest date
- the migration date
- and it seems a good solution for access tools such as Wayback
Machine

It has three shortcomings:
- this solution is not formally written in the standard (but the
standard gives no rule to manage migrated WARC files)
- the WARC response record dates predate the WARC format (but it is not
a real problem, to my opinion)
- it is not very consistent with the way we shall treat conversion
records (they shall have the WARC date of their creation, not of the
creation of the original WARC record, see the example in the standard p.
24).

-... but it seems to me the best solution!

I hope these few ideas will be useful, please say me what are your
opinion on these topics.

Clément

- - - - - - - - - -
Clément Oury
Digital Curator
Digital Legal Deposit

Bibliothèque nationale de France
Quai François-Mauriac
75706 Paris Cedex 13
tel. 33 (0)1 53 79 46 93







>>> "st...@archive.org" <st...@archive.org> 15/01/09 9:42 a.m. >>>

siznax

unread,
Jan 28, 2009, 3:16:16 PM1/28/09
to warc-tools
Gordon and Clement,

thanks for your thoughtful response.

your suggestions sound perfectly reasonable. i'll try
to restate them below so that you can confirm that we
have reached a consensus.

given the following WARC states, the following conditions
should apply:

1) original WARC
warcinfo record should serve as ARC "filedesc" record,
with optional WARC generation

2) migrated WARC (ARC->WARC)
a) warcinfo record should serve as migration description,
warcinfo/WARC-Date should be migrated WARC creation date
b) metadata record should contain content of ARC "filedesc"
record, metadata/WARC-Date should be migrated WARC creation
date, ARC "filedesc" date should also be in this record,
and possibly the WARC generation could be indicated here
c) response records should have the same date as each
corresponding ARC record

3) second-generation WARC (WARC->ARC->WARC)
a) same conditions as (2), and
b) warcinfo record should indicate WARC generation

i believe we would need to agree then on the form of the
fields for:

2b) original ARC "filedesc" date in migrated WARC metadata
record, e.g. metadata/"ARC-Filedesc-Date" with ISO8601 date.

1,2b,3b) WARC generation specified in warcinfo record,
e.g. warcinfo/"WARC-Generation" with integer value
indicating; 0=original WARC, 1=migrated WARC,
2=second-generation WARC, etc.

i'm not sure if "WARC-Generation" is necessary, but it seems
potentially useful.


thanks so much,
/st...@archive.org
Reply all
Reply to author
Forward
0 new messages