Hi,
> Indexes are super useful for me to avoid downloading the whole dataset. All I have to do is run
> this
https://github.com/ikreymer/cc-index-server on a small ec2 instance and then query this api
> from my mapreduce jobs to extract the data needed.
Thanks for running your own index server. Ours is quite loaded during the last time.
For sure because the "indexes are super useful"!
Generating one monthly index costs approx. $80 if run on EMR, and $40 if a custom Hadoop cluster is
launched. We use 50 EC2 spot instances of type m3.xlarge or similar types if available at a cheaper
spot price. Generating the index takes 10-12 hours, it requires that all WARC files are processed.
That's about 30 TB for each monthly data set.
I have to discuss the possibility and get a formal ok to generate the missing 10 indexes for 2013
and 2014, probably when running the next monthly crawl. The older data has a different format: ARC
instead of WARC files and a different directory structure. Should be doable with some but little
programming effort.
If you want to do the processing by your own, there are few smaller modifications made in our fork
of Ilya's repository, see
https://github.com/commoncrawl/webarchive-indexing
Best,
Sebastian