Hi,
thanks again for the notice about the index server availability.
> Is there any documentation to setup mirror?
Please have a look at the project
https://github.com/commoncrawl/cc-index-server
First, you need to install the files
cluster.idx
metadata.yaml
for at least one monthly crawl. The script install-collections.sh
will install them for all *50* monthly crawls. Please see this
discussion how to download less:
https://groups.google.com/d/msg/common-crawl/2xT4OEISYiM/YedFmUrXAQAJ
Second, to run the index server locally there are two options:
- the script run-uwsgi.sh
- or a Dockerfile
I would recommend to run the Docker container:
git clone
https://github.com/commoncrawl/cc-index-server.git
cd cc-index-server
docker build . -t cc-index
docker run --rm --publish 8080:8080 -ti cc-index
The server should now respond on
http://localhost:8080/
For production:
- you should to run the server on AWS in the us-east-1 region.
The most part of the index is stored on S3 in this region,
accessing it from outside the AWS cloud is possible but much
slower.
- alternatively, you may set up a "real" server using
nginx + uwsgi ( + certbot )
Best,
Sebastian