Common Crawl index access and TLD

93 views
Skip to first unread message

Sree Aurovindh Viswanathan

unread,
Jun 23, 2015, 1:20:56 AM6/23/15
to common...@googlegroups.com
Hi,

I am trying to extract all warc files for a given TLD. I have seen that index.commoncrawl.org lists five different indexes. Each index has a month associated with it( For eg: december 2014 Index).

1) Does that mean,each month the entire web is crawled? or is it like , at each month of an year, there are different subsets of entire web is crawled and it is released as they are available?
2) Are all available indexes accessible through  index.commoncrawl.org service ? In other words, Will it be possible for me to access indexes of web pages released before december 2014? If so how ?


Thanks
Sree Viswanathan

Tom Morris

unread,
Jun 23, 2015, 1:46:44 AM6/23/15
to common...@googlegroups.com
On Tue, Jun 23, 2015 at 1:20 AM, Sree Aurovindh Viswanathan
<sreeau...@gmail.com> wrote:

> I am trying to extract all warc files for a given TLD. I have seen that
> index.commoncrawl.org lists five different indexes. Each index has a month
> associated with it( For eg: december 2014 Index).
>
> 1) Does that mean,each month the entire web is crawled? or is it like , at
> each month of an year, there are different subsets of entire web is crawled
> and it is released as they are available?

To the best of my knowledge, different, overlapping subsets of the web
are crawled in each crawl, but I haven't seen a comprehensive analysis
as to the degree of overlap for recrawls or the breadth of the total
crawl.

> 2) Are all available indexes accessible through index.commoncrawl.org
> service ? In other words, Will it be possible for me to access indexes of
> web pages released before december 2014? If so how ?

The new index structure was just put in place recently. What you see
is all that's available. I don't know if there's any plan to go back
and index earlier crawls.

The index files which are used by the index service are also available
for download. If you're looking at a big TLD (e.g. .com), you'd
probably want to access the index files directly rather than through
the web service.

Tom

Sree Aurovindh Viswanathan

unread,
Jun 23, 2015, 1:50:43 AM6/23/15
to common...@googlegroups.com
Thank you.

Regards,
Sree Viswanathan

vanaja jayaraman

unread,
Jun 29, 2015, 5:37:15 AM6/29/15
to common...@googlegroups.com

Hi Tom,

You said that,


The index files which are used by the index service are also available for download.

I want the complete index file to use in
https://github.com/trivio/common_crawl_index/blob/master/bin/remote_copy

    mmap = BotoMap(s3_anon, src_bucket, '/common-crawl/projects/url-index/url-index.1356128792')


Where I can download the entire index file as the above mentioned index is a partial one. (Ref: https://groups.google.com/forum/#!msg/common-crawl/EfR1YHvtWrY/ImnW7Z0rgq4J)


Thanks in Advance,

Vanaja Jayaraman

Tom Morris

unread,
Jun 29, 2015, 1:56:58 PM6/29/15
to common...@googlegroups.com
On Mon, Jun 29, 2015 at 5:37 AM, vanaja jayaraman <vanaj...@gmail.com> wrote:

Hi Tom,

You said that,

The index files which are used by the index service are also available for download.

I want the complete index file to use in
https://github.com/trivio/common_crawl_index/blob/master/bin/remote_copy

    mmap = BotoMap(s3_anon, src_bucket, '/common-crawl/projects/url-index/url-index.1356128792')


Where I can download the entire index file as the above mentioned index is a partial one. (Ref: https://groups.google.com/forum/#!msg/common-crawl/EfR1YHvtWrY/ImnW7Z0rgq4J)

There have been two different indexes with different format files and different software serving them.  That message refers to the old index.  The current index, which is being added to each month as that month's crawl is complete, is a different format, saved in a different place on S3 and accessed using different software.


The blog post includes a pointer to the index data.

Tom

Vanaja Jayaraman

unread,
Jun 30, 2015, 12:03:56 AM6/30/15
to common...@googlegroups.com
Thanks for your response.

Can I use the current index file which is in a new format in the below script?

https://github.com/trivio/common_crawl_index/blob/master/bin/remote_copy


If so what path I need to mention in the below line

    mmap = BotoMap(s3_anon, src_bucket, '/common-crawl/projects/url-index/url-index.1356128792')

And please let me know the path of the complete index file in S3.




--
You received this message because you are subscribed to a topic in the Google Groups "Common Crawl" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/common-crawl/R0k5kjK63-c/unsubscribe.
To unsubscribe from this group and all its topics, send an email to common-crawl...@googlegroups.com.
To post to this group, send email to common...@googlegroups.com.
Visit this group at http://groups.google.com/group/common-crawl.
For more options, visit https://groups.google.com/d/optout.



--
Vanaja. J

Tom Morris

unread,
Jun 30, 2015, 10:55:06 AM6/30/15
to common...@googlegroups.com
On Tue, Jun 30, 2015 at 12:03 AM, Vanaja Jayaraman <vanaj...@gmail.com> wrote:
Thanks for your response.

Can I use the current index file which is in a new format in the below script?

https://github.com/trivio/common_crawl_index/blob/master/bin/remote_copy

No.  One of the things implied by saying the formats are different (ie new & old) is that they're not compatible.

If you go back and read my last email, it includes a link to software which is compatible with the new format.

Tom 
Reply all
Reply to author
Forward
0 new messages