Hi,
this command should download only the CC-MAIN-2017-47 index:
aws --no-sign-request s3 sync s3://commoncrawl/cc-index/collections/CC-MAIN-2017-47/ \
collections/CC-MAIN-2017-47/ --exclude "*" --include "*/cluster.idx" --include "*/metadata.yaml"
Please note that the 150 MB cluster.idx is not the full index which still lives on s3://commoncrawl/
All 300 index files are 250 GB in total. If you do not run the index server on AWS (preferably in
the us-east-1 region), the server will be slow as it needs to fetch data from S3.
> I would like to find the source WARC files for a very large set of URLs (millions)
If it's about millions of results, and only 10,000s of queries - no problem to use
index.commoncrawl.org :) You're probably done in a couple of hours, max. one day.
If you want to send millions of queries, please follow Tom's advice to process the
index off-line. Just download all 300 files listed in
https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2017-47/cc-index.paths.gz
Best,
Sebastian
On 07/31/2018 08:52 AM, ziqi zhang wrote:
> Thank you Tom
>
> I have installed the tool, and followed the instruction to 'install the index', but realising that
> it is attempting to index *all* collections.
> On Mon, Jul 30, 2018 at 3:01 PM ziqi zhang <
ziqizha...@gmail.com <javascript:>> wrote:
>
> Hello
>
> I would like to find the source WARC files for a very large set of URLs (millions) and I
> understand that I can do this using the index server:
>
>
http://index.commoncrawl.org/CC-MAIN-2017-47 <
http://index.commoncrawl.org/CC-MAIN-2017-47>
> (as for this specific dump)
>
> However I also do not want to overload the server and I understand that you can actually
> build an offline index, as suggested here:
https://index.commoncrawl.org/
> <
https://index.commoncrawl.org/>. But I struggle to find information on how to do this. What
> files do I need to download, and what tools or code if any should I use to import and create
> the index?
>
> I'd appreciate some help on this, many thanks!
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
>
common-crawl...@googlegroups.com <javascript:>.
> To post to this group, send email to
common...@googlegroups.com <javascript:>.
> <
https://groups.google.com/group/common-crawl>.
> For more options, visit
https://groups.google.com/d/optout <
https://groups.google.com/d/optout>.
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
>
common-crawl...@googlegroups.com <mailto:
common-crawl...@googlegroups.com>.
> <mailto:
common...@googlegroups.com>.