Missing Indexes from November 2014 and before

90 views
Skip to first unread message

zbagz

unread,
Aug 31, 2016, 3:03:17 AM8/31/16
to Common Crawl
Looks like there are some missing indexes from 2014 and previous crawls. Is there any plans on adding these?

This https://github.com/ikreymer/webarchive-indexing seems to be the mapreduce jobs for generating the index files, so maybe I can generate these myself running these jobs on EMR and make the indexes available for future users. Is there any interest in hosting these generated indexes in the public s3 bucket?

Thanks.

Sebastian Nagel

unread,
Sep 1, 2016, 6:14:28 AM9/1/16
to common...@googlegroups.com
Hi,

the reason why all indexes before CC-MAIN-2014-52 are missing is simply:
Ilya Kreymer wrote the Common Crawl indexer just at this time and no
prior data has been indexed. If there is a general interest in providing
indexes for older data, please, comment on this thread. Thanks!

Yes, of course, you could create the missing indexes by yourself,
but as said we could do this as well. And yes, it's not a bad idea to upload
the indexes to the public data set bucket and add them to index.commoncrawl.org.
Which way to take depends mostly whether you need the indexes immediately or not.

Thanks,
Sebastian
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
> To post to this group, send email to common...@googlegroups.com
> <mailto:common...@googlegroups.com>.
> Visit this group at https://groups.google.com/group/common-crawl.
> For more options, visit https://groups.google.com/d/optout.

zbagz

unread,
Sep 1, 2016, 1:08:41 PM9/1/16
to Common Crawl
Hi Sebastian, thanks for your prompt reply. Yes, I'd love to have older indexes as well, that would be awesome. I don't need them immediately as I'm currently working with 2015 and 2016 data but I will be moving on to the older ones in the following weeks.

Indexes are super useful for me to avoid downloading the whole dataset. All I have to do is run this https://github.com/ikreymer/cc-index-server on a small ec2 instance and then query this api from my mapreduce jobs to extract the data needed.

It would be really nice to have all these indexes available for people like me who want to analyze the datasets on a large scale (different dates) without having to download everything that would be extremely inefficient and therefore costly.

I'd be looking at these (in this order):

== 2008 ====
crawl-001/2008/

== 2009 ====
crawl-001/2009/
crawl-002/2009/

== 2010 ====
crawl-002/2010/

== 2012 ====
parse-output/

== 2013 ====
crawl-data/CC-MAIN-2013-20/
crawl-data/CC-MAIN-2013-48/

== 2014 ====
crawl-data/CC-MAIN-2014-15/
crawl-data/CC-MAIN-2014-35/
crawl-data/CC-MAIN-2014-49/

Honestly, I don't know how costly this could be and maybe I'm asking too much here... but I could probably run some of these myself as well and contribute them to the public bucket so that future users can benefit too.

Thanks.

zbagz

unread,
Sep 8, 2016, 12:05:14 PM9/8/16
to Common Crawl
Any update on this guys?

Thank you.

Sebastian Nagel

unread,
Sep 14, 2016, 4:57:19 PM9/14/16
to common...@googlegroups.com
Hi,

> Indexes are super useful for me to avoid downloading the whole dataset. All I have to do is run
> this https://github.com/ikreymer/cc-index-server on a small ec2 instance and then query this api
> from my mapreduce jobs to extract the data needed.

Thanks for running your own index server. Ours is quite loaded during the last time.
For sure because the "indexes are super useful"!

Generating one monthly index costs approx. $80 if run on EMR, and $40 if a custom Hadoop cluster is
launched. We use 50 EC2 spot instances of type m3.xlarge or similar types if available at a cheaper
spot price. Generating the index takes 10-12 hours, it requires that all WARC files are processed.
That's about 30 TB for each monthly data set.

I have to discuss the possibility and get a formal ok to generate the missing 10 indexes for 2013
and 2014, probably when running the next monthly crawl. The older data has a different format: ARC
instead of WARC files and a different directory structure. Should be doable with some but little
programming effort.

If you want to do the processing by your own, there are few smaller modifications made in our fork
of Ilya's repository, see https://github.com/commoncrawl/webarchive-indexing

Best,
Sebastian

he...@markthomson.ca

unread,
Sep 25, 2016, 12:24:52 AM9/25/16
to Common Crawl
I'd just like to second the request for the earlier indexes. If would be nice to have at least all the WARC files indexed.

This is such an awesome resource that it feel a little wrong asking for "more" but what the heck :-)

Regardless… awesome work guys, many thanks.
Mark
Reply all
Reply to author
Forward
0 new messages