Hi Christian,
thanks for the hint and the good idea to make it easier to find out the fetch times
of the pages contained in a WARC file.
The time stamp contained in the WARC file name
CC-MAIN-20160924173739-00014-ip-10-143-35-109.ec2.internal.warc.gz
is the time the batch ("segment") this WARC file belongs has been generated.
Not really useful because the actual time the content of the WARC was fetched
is up to 9 days after the generation date. Maybe it would be better to let this
time stamp indicate the time a segment has been fetched (beginning and/or end)?
One batch/segment is fetched within 2 hours. Pages contained in one WARC file
are random, so when you take the fetch times from one WARC file per segment,
you'll get the time range of this segment. All other WARC files should contain
fetch times in the same time range (maybe few minutes off).
> However, it might also be useful to know the date range of an archive file before opening it, so
> perhaps date ranges could be added to the Path files (eg. warc.paths.gz, wat.paths.gz, etc).
Getting the exact fetch times would mean to read every WARC file. A lot of computation.
Instead, one could read the index files, only 300 files. But still a lot of computation
to assign the exact time range.
Best,
Sebastian
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
>
common-crawl...@googlegroups.com <mailto:
common-crawl...@googlegroups.com>.
> To post to this group, send email to
common...@googlegroups.com
> <mailto:
common...@googlegroups.com>.
> Visit this group at
https://groups.google.com/group/common-crawl.
> For more options, visit
https://groups.google.com/d/optout.