CommonCrawl Index Server responds with "502 Bad Gateway"

jackun...@gmail.com

unread,

Jul 14, 2017, 6:45:38 AM7/14/17

to Common Crawl

Hi there,

lately I got a lot of 502 responses from the http://index.commoncrawl.org/ Index Server (no matter what index I am using) using the api.

Does someone know if there are any issues?

Thanks!

Sebastian Nagel

unread,

Jul 14, 2017, 7:04:33 AM7/14/17

to common...@googlegroups.com

Hi,

I have to check the logs what the reason was, the server
is a small one (single CPU) which runs from time to time
out of TCP memory.

If you're using the CDX client (https://github.com/ikreymer/cdx-index-client),
please try to set the number of parallel requests to a low number via
-p 2
The default is the 2*CPUs_on_client_machine - if you run the requests from
a powerful server, most of the requests will time out.

The good news: I've started today to migrate the server to a
more powerful machine which should be ready the upcoming week.

Thanks,
Sebastian

> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
> To post to this group, send email to common...@googlegroups.com
> <mailto:common...@googlegroups.com>.
> Visit this group at https://groups.google.com/group/common-crawl.
> For more options, visit https://groups.google.com/d/optout.

jackun...@gmail.com

unread,

Jul 14, 2017, 7:47:58 AM7/14/17

to Common Crawl

Awesome, thanks for the quick reply! (I was only doing single requests).

Am Freitag, 14. Juli 2017 13:04:33 UTC+2 schrieb Sebastian Nagel:

Hi,

I have to check the logs what the reason was, the server
is a small one (single CPU) which runs from time to time
out of TCP memory.

If you're using the CDX client (https://github.com/ikreymer/cdx-index-client),
please try to set the number of parallel requests to a low number via
-p 2
The default is the 2*CPUs_on_client_machine - if you run the requests from
a powerful server, most of the requests will time out.

The good news: I've started today to migrate the server to a
more powerful machine which should be ready the upcoming week.

Thanks,
Sebastian

On 07/14/2017 12:45 PM, jackun...@gmail.com wrote:
> Hi there,
>
> lately I got a lot of 502 responses from the http://index.commoncrawl.org/ Index Server (no matter
> what index I am using) using the api.
>
> Does someone know if there are any issues?
>
> Thanks!
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to

> common-crawl...@googlegroups.com <mailto:common-crawl+unsub...@googlegroups.com>.

Max Jacobson

unread,

Oct 16, 2018, 11:37:45 PM10/16/18

to Common Crawl

Hello,

I believe the server is down again. Could anyone suggest an alternative that might allow for higher-volume or higher frequency searches in a way that is kind to the infrastructure?

Best,

Max

Chillar Anand

unread,

Oct 16, 2018, 11:41:31 PM10/16/18

to Common Crawl

I am also planning to setup a mirror for this index.

Is there any documentation to setup mirror?

Sebastian Nagel

unread,

Oct 17, 2018, 4:05:47 AM10/17/18

to common...@googlegroups.com

Hi Max,

the server is responding but still loaded.

Yes, there are alternatives:

1 the columnar index
http://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/
Highly recommended for analytical workloads and if only few fields
are required (e.g., only domain and HTTP status). We maintain a collection
of example queries to start with:
https://github.com/commoncrawl/cc-index-table

2 download the CDX files and process them offline, see
https://groups.google.com/d/msg/common-crawl/3QmQjFA_3y4/vTbhGqIBBQAJ
For sure more efficient if you need to query millions of URLs.

3 run your own index server (see following posts in this discussion)

4 host and domain names with page counts are also part of the crawl statistics,
see https://groups.google.com/d/msg/common-crawl/vsD4vBpDdG0/SckaJ6OfAgAJ

Note that for options 1 and 3 an AWS account is required, resp. you should run
the workloads in the AWS us-east-1 cloud region where the data is stored.

Best,
Sebastian

> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to

> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.

Sebastian Nagel

unread,

Oct 17, 2018, 4:51:16 AM10/17/18

to common...@googlegroups.com

Hi,

thanks again for the notice about the index server availability.

> Is there any documentation to setup mirror?

Please have a look at the project
https://github.com/commoncrawl/cc-index-server

First, you need to install the files
cluster.idx
metadata.yaml
for at least one monthly crawl. The script install-collections.sh
will install them for all *50* monthly crawls. Please see this
discussion how to download less:
https://groups.google.com/d/msg/common-crawl/2xT4OEISYiM/YedFmUrXAQAJ

Second, to run the index server locally there are two options:
- the script run-uwsgi.sh
- or a Dockerfile

I would recommend to run the Docker container:
git clone https://github.com/commoncrawl/cc-index-server.git
cd cc-index-server
docker build . -t cc-index
docker run --rm --publish 8080:8080 -ti cc-index

The server should now respond on http://localhost:8080/

For production:
- you should to run the server on AWS in the us-east-1 region.
The most part of the index is stored on S3 in this region,
accessing it from outside the AWS cloud is possible but much
slower.
- alternatively, you may set up a "real" server using
nginx + uwsgi ( + certbot )

Best,
Sebastian

Max Jacobson

unread,

Oct 17, 2018, 10:24:57 AM10/17/18

to Common Crawl

Thanks Sebastian, that's incredibly helpful, especially the context about the alternatives.

> common-crawl...@googlegroups.com <mailto:common-crawl+unsub...@googlegroups.com>.

Chillar Anand

unread,

Oct 18, 2018, 3:33:19 AM10/18/18

to Common Crawl

Thanks @Sebastian for providing detailed instructions.

I will setup a mirror for index server.

> common-crawl...@googlegroups.com <mailto:common-crawl+unsub...@googlegroups.com>.

Chillar Anand

unread,

Oct 18, 2018, 5:46:35 AM10/18/18

to Common Crawl

A mirror for common crawl index is available at http://ccindex.avilpage.com/

Reply all

Reply to author

Forward