On Tue, Jun 23, 2015 at 1:20 AM, Sree Aurovindh Viswanathan
<
sreeau...@gmail.com> wrote:
> I am trying to extract all warc files for a given TLD. I have seen that
>
index.commoncrawl.org lists five different indexes. Each index has a month
> associated with it( For eg: december 2014 Index).
>
> 1) Does that mean,each month the entire web is crawled? or is it like , at
> each month of an year, there are different subsets of entire web is crawled
> and it is released as they are available?
To the best of my knowledge, different, overlapping subsets of the web
are crawled in each crawl, but I haven't seen a comprehensive analysis
as to the degree of overlap for recrawls or the breadth of the total
crawl.
> 2) Are all available indexes accessible through
index.commoncrawl.org
> service ? In other words, Will it be possible for me to access indexes of
> web pages released before december 2014? If so how ?
The new index structure was just put in place recently. What you see
is all that's available. I don't know if there's any plan to go back
and index earlier crawls.
The index files which are used by the index service are also available
for download. If you're looking at a big TLD (e.g. .com), you'd
probably want to access the index files directly rather than through
the web service.
Tom