cc-index-server returning errors

50 views
Skip to first unread message

Erik Wickstrom

unread,
May 31, 2017, 7:51:55 PM5/31/17
to common...@googlegroups.com
Hi,

I'm hosting my own copy of the cc-index-server (from https://github.com/commoncrawl/cc-index-server).  It downloaded the collections and started up without any issues, but when I try to run a query, I get the following error:


A server error occurred. Please contact the administrator.

Here is the traceback from the server log:

cc-index-server │ 172.17.0.3 - - [31/May/2017 23:07:52] "GET /CC-MAIN-2017-17-index?url=commoncrawl.org HTTP/1.1" 500 59
cc-index-server │ 2017-05-31 23:08:33,546: [DEBUG]: Loading 1 blocks from s3://commoncrawl/cc-index/collections/CC-MAIN-2017-17/indexes/cdx-00249.gz:416445277+148604
cc-index-server │ 2017-05-31 23:08:33,546: [DEBUG]: path=/
cc-index-server │ 2017-05-31 23:08:33,546: [DEBUG]: auth_path=/commoncrawl/
cc-index-server │ 2017-05-31 23:08:33,546: [DEBUG]: Method: HEAD
cc-index-server │ 2017-05-31 23:08:33,546: [DEBUG]: Path: /
cc-index-server │ 2017-05-31 23:08:33,546: [DEBUG]: Data:
cc-index-server │ 2017-05-31 23:08:33,546: [DEBUG]: Headers: {}
cc-index-server │ 2017-05-31 23:08:33,547: [DEBUG]: Host: commoncrawl.s3.amazonaws.com
cc-index-server │ 2017-05-31 23:08:33,547: [DEBUG]: Port: 443
cc-index-server │ 2017-05-31 23:08:33,547: [DEBUG]: Params: {}
cc-index-server │ 2017-05-31 23:08:33,547: [DEBUG]: Token: None
cc-index-server │ 2017-05-31 23:08:33,547: [DEBUG]: StringToSign:
cc-index-server │ HEAD
cc-index-server │
cc-index-server │
cc-index-server │ Wed, 31 May 2017 23:08:33 GMT
cc-index-server │ /commoncrawl/
cc-index-server │ 2017-05-31 23:08:33,547: [DEBUG]: Signature:
cc-index-server │ AWS XXXXXXXXXXXXXXXXXXXX:YYYYYYYYYYYYYYYYYYYYYYYYYYYY
cc-index-server │ 2017-05-31 23:08:33,548: [DEBUG]: Final headers: {'Date': 'Wed, 31 May 2017 23:08:33 GMT', 'Content-Length': '0', 'Authorization': u'AWS XXXXXXXXXXXXXXXXXXXX:
YYYYYYYYYYYYYYYYYYYYYYYYYYYY', 'User-Agent': 'Boto/2.47.0 Python/2.7.13 Linux/4.9.27-moby'}
cc-index-server │ 2017-05-31 23:08:33,548: [DEBUG]: encountered BadStatusLine exception, reconnecting
cc-index-server │ 2017-05-31 23:08:33,548: [DEBUG]: establishing HTTPS connection: host=commoncrawl.s3.amazonaws.com, kwargs={'port': 443, 'timeout': 70}
cc-index-server │ 2017-05-31 23:08:34,416: [DEBUG]: Token: None
cc-index-server │ 2017-05-31 23:08:34,416: [DEBUG]: StringToSign:
cc-index-server │ HEAD
cc-index-server │
cc-index-server │
cc-index-server │ Wed, 31 May 2017 23:08:33 GMT
cc-index-server │ /commoncrawl/
cc-index-server │ 2017-05-31 23:08:34,417: [DEBUG]: Signature:
cc-index-server │ AWS XXXXXXXXXXXXXXXXXXXX:YYYYYYYYYYYYYYYYYYYYYYYYYYYY
cc-index-server │ 2017-05-31 23:08:34,417: [DEBUG]: Final headers: {'Date': 'Wed, 31 May 2017 23:08:33 GMT', 'Content-Length': '0', 'Authorization': u'AWS XXXXXXXXXXXXXXXXXXXX: YYYYYYYYYYYYYYYYYYYYYYYYYYYY', 'User-Agent': 'Boto/2.47.0 Python/2.7.13 Linux/4.9.27-moby'}
cc-index-server │ 2017-05-31 23:08:35,064: [DEBUG]: Response headers: [('date', 'Wed, 31 May 2017 23:08:34 GMT'), ('x-amz-id-2', 'UG54DlZsL8E7YZ3dP9ouoTlvpUj+oTD5PYe+dG6eyVUiKrv8+4JBp51mPT+eha7s6sNC6Zbzy+4='), ('server', 'AmazonS3'), ('transfer-encoding', 'chunked'), ('x-amz-request-id', '7157866D1F505C8E'), ('x-amz-bucket-region', 'us-east-1'), ('content-type', 'application/xml')]
cc-index-server │ Traceback (most recent call last):
cc-index-server │ File "/usr/local/lib/python2.7/wsgiref/handlers.py", line 85, in run
cc-index-server │ self.result = application(self.environ, self.start_response)
cc-index-server │ File "/usr/local/lib/python2.7/site-packages/pywb/framework/wsgi_wrappers.py", line 25, in __call__
cc-index-server │ return self.handle_methods(env, start_response)
cc-index-server │ File "/usr/local/lib/python2.7/site-packages/pywb/framework/wsgi_wrappers.py", line 78, in handle_methods
cc-index-server │ response = self.handle_exception(env, e, True)
cc-index-server │ File "/usr/local/lib/python2.7/site-packages/pywb/framework/wsgi_wrappers.py", line 89, in handle_exception
cc-index-server │ status = exc.status()
cc-index-server │ TypeError: 'int' object is not callable
cc-index-server │ 172.17.0.3 - - [31/May/2017 23:08:35] "GET /CC-MAIN-2017-17-index?url=commoncrawl.org HTTP/1.1" 500 59
cc-index-server │ 2017-05-31 23:11:13,441: [DEBUG]: Loading 1 blocks from s3://commoncrawl/cc-index/collections/CC-MAIN-2017-17/indexes/cdx-00235.gz:801238783+201224
cc-index-server │ 2017-05-31 23:11:13,441: [DEBUG]: path=/
cc-index-server │ 2017-05-31 23:11:13,442: [DEBUG]: auth_path=/commoncrawl/
cc-index-server │ 2017-05-31 23:11:13,442: [DEBUG]: Method: HEAD
cc-index-server │ 2017-05-31 23:11:13,442: [DEBUG]: Path: /
cc-index-server │ 2017-05-31 23:11:13,442: [DEBUG]: Data:
cc-index-server │ 2017-05-31 23:11:13,443: [DEBUG]: Headers: {}
cc-index-server │ 2017-05-31 23:11:13,443: [DEBUG]: Host: commoncrawl.s3.amazonaws.com
cc-index-server │ 2017-05-31 23:11:13,443: [DEBUG]: Port: 443
cc-index-server │ 2017-05-31 23:11:13,443: [DEBUG]: Params: {}
cc-index-server │ 2017-05-31 23:11:13,444: [DEBUG]: establishing HTTPS connection: host=commoncrawl.s3.amazonaws.com, kwargs={'port': 443, 'timeout': 70}
cc-index-server │ 2017-05-31 23:11:13,444: [DEBUG]: Token: None
cc-index-server │ 2017-05-31 23:11:13,444: [DEBUG]: StringToSign:
cc-index-server │ HEAD
cc-index-server │
cc-index-server │
cc-index-server │ Wed, 31 May 2017 23:11:13 GMT
cc-index-server │ /commoncrawl/
cc-index-server │ 2017-05-31 23:11:13,445: [DEBUG]: Signature:
cc-index-server │ AWS XXXXXXXXXXXXXXXXXXXX:YYYYYYYYYYYYYYYYYYYYYYYYYYYY
cc-index-server │ 2017-05-31 23:11:13,445: [DEBUG]: Final headers: {'Date': 'Wed, 31 May 2017 23:11:13 GMT', 'Content-Length': '0', 'Authorization': u'AWS XXXXXXXXXXXXXXXXXXXX:YY
YYYYYYYYYYYYYYYYYYYYYYYYYY', 'User-Agent': 'Boto/2.47.0 Python/2.7.13 Linux/4.9.27-moby'}
cc-index-server │ 2017-05-31 23:11:14,377: [DEBUG]: Response headers: [('date', 'Wed, 31 May 2017 23:11:13 GMT'), ('x-amz-id-2', 'wfA/bfEKKda5kBUD5fSqx0h7TG5o+ScO+8aiA6jbMe1gIpnBUW4TrtoyP3ucOWSNdZep2hAyz5o='), ('server', 'AmazonS3'), ('transfer-encoding', 'chunked'), ('x-amz-request-id', '6298E832BCE2A426'), ('x-amz-bucket-region', 'us-east-1'), ('content-type', 'application/xml')]
cc-index-server │ Traceback (most recent call last):
cc-index-server │ File "/usr/local/lib/python2.7/wsgiref/handlers.py", line 85, in run
cc-index-server │ self.result = application(self.environ, self.start_response)
cc-index-server │ File "/usr/local/lib/python2.7/site-packages/pywb/framework/wsgi_wrappers.py", line 25, in __call__
cc-index-server │ return self.handle_methods(env, start_response)
cc-index-server │ File "/usr/local/lib/python2.7/site-packages/pywb/framework/wsgi_wrappers.py", line 78, in handle_methods
cc-index-server │ response = self.handle_exception(env, e, True)
cc-index-server │ File "/usr/local/lib/python2.7/site-packages/pywb/framework/wsgi_wrappers.py", line 89, in handle_exception
cc-index-server │ status = exc.status()
cc-index-server │ TypeError: 'int' object is not callable
cc-index-server │ 172.17.0.3 - - [31/May/2017 23:11:14] "GET /CC-MAIN-2017-17-index?url=http%3A%2F%2Fwww.howtocleanstuff.net%2Fhow-to-remove-mold-from-cement-walls-and-floors%2F&output=json HTTP/1.1" 500 59
convox │ 1 files uploaded
I'm starting the server in Docker with the following Dockerfile

FROM python:2.7 RUN apt-get -qq update && apt-get -qqy install awscli # Install dependencies
COPY ./requirements.txt /tmp/requirements.txt
RUN pip install -r /tmp/requirements.txt # Add the cc-index-server code into the image
COPY ./ /opt/webapp/
WORKDIR /opt/webapp RUN ./install-collections.sh CMD ./cdx-server

Thanks for your help!
Erik

Sebastian Nagel

unread,
Jun 1, 2017, 6:55:01 AM6/1/17
to common...@googlegroups.com
Hi Erik,

first, thanks for the Dockerfile. Is it ok to add it to the project on github
(and also push it upstream to https://github.com/ikreymer/cc-index-server)?
It's pretty nice for debugging, thanks!

I was able to reproduce your problem (attached the modified Dockerfile):

My first trial succeeded but when comparing the log output with that you've sent,
I've observed that my cdx server does not sent any AWS credentials (access key + signature [1]),
here wiped out:

AWS XXXXXXXXXXXXXXXXXXXX:YYYYYYYYYYYYYYYYYYYYYYYYYYYY
'Authorization': u'AWS XXXXXXXXXXXXXXXXXXXX:YYYYYYYYYYYYYYYYYYYYYYYYYYYYYY', 'User-Agent': ...

I've added the following lines to the Dockerfile which make boto use authenticated requests [2]:

ENV AWS_ACCESS_KEY_ID=XXXXXXXXXXXXXXXXXXXX
ENV AWS_SECRET_ACCESS_KEY=YYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYY

If access keys are valid, everything succeeds. If they are not (e.g, XXX + YYY) the error
is reproducible. It's a problem of pywb [3] when logging the error, I've opened a ticket and PR to
fix it [4].

Now the error is properly logged:

AWS XXXXXXXXXXXXXXXXXXXX:YHgPFqt1y1Sti7SlunIjrbOQPXM=
2017-06-01 10:29:36,779: [DEBUG]: Final headers: {'Date': 'Thu, 01 Jun 2017 10:29:36 GMT',
'Content-Length': '0', 'Authorization': u'AWS XXXXXXXXXXXXXXXXXXXX:YHgPFqt1y1Sti7SlunIjrbOQPXM=',
'User-Agent': 'Boto/2.47.0 Python/2.7.13 Linux/4.10.0-20-generic'}
2017-06-01 10:29:37,229: [DEBUG]: Response headers: [('date', 'Thu, 01 Jun 2017 10:29:36 GMT'),
('x-amz-id-2', '7hSZwSMYjrG+ptmuFy+lnwDoKOdFSmW7FkldyLC4Fb4Z8F7qvZ8Gg8893SHPt4SmTYzik6RDul8='),
('server', 'AmazonS3'), ('transfer-encoding', 'chunked'), ('x-amz-request-id', 'F3ECD1302EC0287F'),
('x-amz-bucket-region', 'us-east-1'), ('content-type', 'application/xml')]
Traceback (most recent call last):
File "/usr/local/lib/python2.7/site-packages/pywb/framework/wsgi_wrappers.py", line 65, in
handle_methods
...
File "/usr/local/lib/python2.7/site-packages/boto/s3/connection.py", line 539, in head_bucket
raise err
S3ResponseError: S3ResponseError: 403 Forbidden


Of course, I don't whether this was also the reason in your case. Please, try to check the fix in
pywb by copying the changed wsgi_wrappers.py over to
/usr/local/lib/python2.7/site-packages/pywb/framework/wsgi_wrappers.py
in the Docker image/container.

However, the simplest solution is to avoid authentication. Make sure that no access keys
are exposed to boto via environment variables or boto config files.


Best,
Sebastian


[1] http://docs.aws.amazon.com/AWSECommerceService/latest/DG/HMACSignatures.html
[2] http://boto.cloudhackers.com/en/latest/s3_tut.html
[3] https://github.com/ikreymer/pywb
[4] https://github.com/ikreymer/pywb/issues/219
> cc-index-server │ AWS XXXXXXXXXXXXXXXXXXXX:YYYYYYYYYYYYYYYYYYYYYYYYYYYYYY
> cc-index-server │ 2017-05-31 23:11:13,445: [DEBUG]: Final headers: {'Date': 'Wed, 31 May
> 2017 23:11:13 GMT', 'Content-Length': '0', 'Authorization': u'AWS
> XXXXXXXXXXXXXXXXXXXX:YYYYYYYYYYYYYYYYYYYYYYYYYYYYYY', 'User-Agent': 'Boto/2.47.0 Python/2.7.13
> FROM python:2.7RUN apt-get -qq update && apt-get -qqy install awscli# Install dependencies
> COPY ./requirements.txt /tmp/requirements.txt
> RUN pip install -r /tmp/requirements.txt# Add the cc-index-server code into the image
> COPY ./ /opt/webapp/
> WORKDIR /opt/webappRUN ./install-collections.shCMD ./cdx-server
>
>
> Thanks for your help!
>
> Erik
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
> To post to this group, send email to common...@googlegroups.com
> <mailto:common...@googlegroups.com>.
> Visit this group at https://groups.google.com/group/common-crawl.
> For more options, visit https://groups.google.com/d/optout.

Dockerfile

Erik Wickstrom

unread,
Jun 1, 2017, 12:57:36 PM6/1/17
to Common Crawl
Hi Sebastian,

I just sent a pull request to the repo with the Dockerfile.  Glad it is helpful!

Removing the AWS environment variables from my container did the trick.

Thanks for your quick help!

Erik
Reply all
Reply to author
Forward
0 new messages