Creating offline index server

50 views
Skip to first unread message

ziqi zhang

unread,
Jul 30, 2018, 3:01:04 PM7/30/18
to Common Crawl
Hello

I would like to find the source WARC files for a very large set of URLs (millions) and I understand that I can do this using the index server:


However I also do not want to overload the server and I understand that you can actually build an offline index, as suggested here:https://index.commoncrawl.org/. But I struggle to find information on how to do this. What files do I need to download, and what tools or code if any should I use to import and create the index?

I'd appreciate some help on this, many thanks!

Tom Morris

unread,
Jul 31, 2018, 1:12:17 AM7/31/18
to common...@googlegroups.com
The repo for the software for the index server is here: https://github.com/commoncrawl/cc-index-server
but you may want to consider just working from the raw files for a batch job.

The index files are typically announced in the blog post for the release. They are segmented, but sorted, so easy to process in a batch (e.g. Hadoop) job.

Of course, if you want, you could stand up a private index server and query interactively, but I suspect batch MapReduce style access could be faster.

Tom

--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To post to this group, send email to common...@googlegroups.com.
Visit this group at https://groups.google.com/group/common-crawl.
For more options, visit https://groups.google.com/d/optout.

ziqi zhang

unread,
Jul 31, 2018, 2:52:02 AM7/31/18
to Common Crawl
Thank you Tom

I have installed the tool, and followed the instruction to 'install the index', but realising that it is attempting to index all collections.

Is there any way to index just one or several specific collection?

The command reads:

aws --no-sign-request s3 sync s3://commoncrawl/cc-index/collections/ collections/ --exclude "*" --include "*/cluster.idx" --include "*/metadata.yaml"

I have tried 3 things all of them failed:

1) I tried to replace part of the command, for example replacing '*/cluster.idx' with "CC-MAIN-2017-47/cluster.idx" and "*/metadata.yaml" with "CC-MAIN-2017-47/metadata.yamal"
 
But it does not seem to work as it only downloads a 29byte yaml file only

2) I also tried 

aws --no-sign-request s3 sync s3://commoncrawl/cc-index/collections/CC-MAIN-2017-47 collections/ --exclude "*" --include "*/cluster.idx" --include "*/metadata.yaml"

which downloads a 160MB file in:

cc-index-server/collections/indexes

but then running "cdx-server" generates the error below:

Exception: Dir "collections/indexes/indexes" does not exist for "index_paths"
 

3) I tried 

aws --no-sign-request s3 sync s3://commoncrawl/cc-index/collections/ collections/CC-MAIN-2017-47 --exclude "*" --include "*/cluster.idx" --include "*/metadata.yaml"

But then it seems to still download all collections as the terminal shows message that indicates it is processing starting from 2013 collections


Also, is there any way to know how big the index is going to be?

Thanks again! 

Sebastian Nagel

unread,
Jul 31, 2018, 3:27:01 AM7/31/18
to common...@googlegroups.com
Hi,

this command should download only the CC-MAIN-2017-47 index:

aws --no-sign-request s3 sync s3://commoncrawl/cc-index/collections/CC-MAIN-2017-47/ \
collections/CC-MAIN-2017-47/ --exclude "*" --include "*/cluster.idx" --include "*/metadata.yaml"

Please note that the 150 MB cluster.idx is not the full index which still lives on s3://commoncrawl/
All 300 index files are 250 GB in total. If you do not run the index server on AWS (preferably in
the us-east-1 region), the server will be slow as it needs to fetch data from S3.

> I would like to find the source WARC files for a very large set of URLs (millions)

If it's about millions of results, and only 10,000s of queries - no problem to use
index.commoncrawl.org :) You're probably done in a couple of hours, max. one day.
If you want to send millions of queries, please follow Tom's advice to process the
index off-line. Just download all 300 files listed in
https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2017-47/cc-index.paths.gz

Best,
Sebastian

On 07/31/2018 08:52 AM, ziqi zhang wrote:
> Thank you Tom
>
> I have installed the tool, and followed the instruction to 'install the index', but realising that
> it is attempting to index *all* collections.
> On Mon, Jul 30, 2018 at 3:01 PM ziqi zhang <ziqizha...@gmail.com <javascript:>> wrote:
>
> Hello
>
> I would like to find the source WARC files for a very large set of URLs (millions) and I
> understand that I can do this using the index server:
>
> http://index.commoncrawl.org/CC-MAIN-2017-47 <http://index.commoncrawl.org/CC-MAIN-2017-47>
> (as for this specific dump)
>
> However I also do not want to overload the server and I understand that you can actually
> build an offline index, as suggested here:https://index.commoncrawl.org/
> <https://index.commoncrawl.org/>. But I struggle to find information on how to do this. What
> files do I need to download, and what tools or code if any should I use to import and create
> the index?
>
> I'd appreciate some help on this, many thanks!
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <javascript:>.
> To post to this group, send email to common...@googlegroups.com <javascript:>.
> <https://groups.google.com/group/common-crawl>.
> For more options, visit https://groups.google.com/d/optout <https://groups.google.com/d/optout>.
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
> To post to this group, send email to common...@googlegroups.com
> <mailto:common...@googlegroups.com>.

ziqi zhang

unread,
Aug 1, 2018, 4:43:23 AM8/1/18
to Common Crawl
Thank you, it is working now.


Reply all
Reply to author
Forward
0 new messages