Warc start and end dates

Christian Lund

unread,

Nov 1, 2016, 3:52:05 AM11/1/16

to Common Crawl

I am trying to find the easiest way to determine the start and end date of a archive set. Assuming that the Warc files are generated chronologically and packed likewise, then I should be able to check the date of the first file and the last file.

However, it might also be useful to know the date range of an archive file before opening it, so perhaps date ranges could be added to the Path files (eg. warc.paths.gz, wat.paths.gz, etc).

...or is there a better way to determine date ranges for sets?

Sebastian Nagel

unread,

Nov 1, 2016, 5:05:36 AM11/1/16

to common...@googlegroups.com

Hi Christian,

thanks for the hint and the good idea to make it easier to find out the fetch times
of the pages contained in a WARC file.

The time stamp contained in the WARC file name
CC-MAIN-20160924173739-00014-ip-10-143-35-109.ec2.internal.warc.gz
is the time the batch ("segment") this WARC file belongs has been generated.
Not really useful because the actual time the content of the WARC was fetched
is up to 9 days after the generation date. Maybe it would be better to let this
time stamp indicate the time a segment has been fetched (beginning and/or end)?

One batch/segment is fetched within 2 hours. Pages contained in one WARC file
are random, so when you take the fetch times from one WARC file per segment,
you'll get the time range of this segment. All other WARC files should contain
fetch times in the same time range (maybe few minutes off).

> However, it might also be useful to know the date range of an archive file before opening it, so
> perhaps date ranges could be added to the Path files (eg. warc.paths.gz, wat.paths.gz, etc).

Getting the exact fetch times would mean to read every WARC file. A lot of computation.
Instead, one could read the index files, only 300 files. But still a lot of computation
to assign the exact time range.

Best,
Sebastian

> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
> To post to this group, send email to common...@googlegroups.com
> <mailto:common...@googlegroups.com>.
> Visit this group at https://groups.google.com/group/common-crawl.
> For more options, visit https://groups.google.com/d/optout.

Christian Lund

unread,

Nov 1, 2016, 6:18:34 AM11/1/16

to Common Crawl

Hi Sebastian,

Actually, what I was looking for was a way to separate extracted data in one crawl from another and obviously the easiest is to use the crawl set name, eg. CC-MAIN-2016-36. So to solve my current issue, I'll simply tag data from the August crawl with "2016-36" and when we start processing September the data will be tagged with "2016-40".

Out of curiosity, what does the DD in the naming convention of CC-MAIN-YYYY-DD stand for?

And getting back to the date ranges I think could be useful in any case. I was thinking the start and end date/time could be added to the Path files, eg. for the September crawl:

crawl-data/CC-MAIN-2016-40/segments/1474738659496.36/wat/CC-MAIN-20160924173739-00000-ip-10-143-35-109.ec2.internal.warc.wat.gz;2016-09-24T20:00:13Z;2016-09-24T22:00:54Z

crawl-data/CC-MAIN-2016-40/segments/1474738659496.36/wat/CC-MAIN-20160924173739-00001-ip-10-143-35-109.ec2.internal.warc.wat.gz;...

Where x and y are the start and end times of the segment. However, I realise that it is more complicated than simply checking the first and last entry in the segment, since the data is not ordered chronologically inside the segment.

And a couple of extra questions about segment time stamps.

For each segment there are some additional time stamps at the top of the file. From the first segment in the September crawl it states:

WARC-Type: warcinfo

WARC-Date: 2016-10-03T12:07:27Z

WARC-Type: metadata

WARC-Date: 2016-10-02T01:42:15Z

What do these time stamps refer to? Since the crawl data contained in the file was generated around: 2016-09-24T20:56:13Z

Sebastian Nagel

unread,

Nov 1, 2016, 7:29:28 AM11/1/16

to common...@googlegroups.com

Hi Christian,

> Out of curiosity, what does the DD in the naming convention of CC-MAIN-YYYY-DD stand for?

That's the week number the crawl was finished and the post-processing starts.

> And getting back to the date ranges I think could be useful in any case. I was thinking the start
> and end date/time could be added to the Path files, eg. for the September crawl:

Yes, I hear you. I keep on the list of things we could improve.

> WARC-Type: warcinfo
> WARC-Date: 2016-10-03T12:07:27Z
>
> WARC-Type: metadata
> WARC-Date: 2016-10-02T01:42:15Z
>
> What do these time stamps refer to? Since the crawl data contained in the file was generated around:
> 2016-09-24T20:56:13Z

That's the time the WARC file was written. The current work flow is:

1 generate 100 batches/segments, and prepare a fetch list for every segment
2 fetch all segments, each segment 2 hours = 100 * 2h ~ 8-9 days
3 write WARC files, etc.

1 was done on Sept 24, 2 started immediately thereafter, and 3 started early in October
after all segments have been fetched.

Best,
Sebastian

Reply all

Reply to author

Forward