Hi Erik,
first, thanks for the Dockerfile. Is it ok to add it to the project on github
(and also push it upstream to
https://github.com/ikreymer/cc-index-server)?
It's pretty nice for debugging, thanks!
I was able to reproduce your problem (attached the modified Dockerfile):
My first trial succeeded but when comparing the log output with that you've sent,
I've observed that my cdx server does not sent any AWS credentials (access key + signature [1]),
here wiped out:
AWS XXXXXXXXXXXXXXXXXXXX:YYYYYYYYYYYYYYYYYYYYYYYYYYYY
'Authorization': u'AWS XXXXXXXXXXXXXXXXXXXX:YYYYYYYYYYYYYYYYYYYYYYYYYYYYYY', 'User-Agent': ...
I've added the following lines to the Dockerfile which make boto use authenticated requests [2]:
ENV AWS_ACCESS_KEY_ID=XXXXXXXXXXXXXXXXXXXX
ENV AWS_SECRET_ACCESS_KEY=YYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYY
If access keys are valid, everything succeeds. If they are not (e.g, XXX + YYY) the error
is reproducible. It's a problem of pywb [3] when logging the error, I've opened a ticket and PR to
fix it [4].
Now the error is properly logged:
AWS XXXXXXXXXXXXXXXXXXXX:YHgPFqt1y1Sti7SlunIjrbOQPXM=
2017-06-01 10:29:36,779: [DEBUG]: Final headers: {'Date': 'Thu, 01 Jun 2017 10:29:36 GMT',
'Content-Length': '0', 'Authorization': u'AWS XXXXXXXXXXXXXXXXXXXX:YHgPFqt1y1Sti7SlunIjrbOQPXM=',
'User-Agent': 'Boto/2.47.0 Python/2.7.13 Linux/4.10.0-20-generic'}
2017-06-01 10:29:37,229: [DEBUG]: Response headers: [('date', 'Thu, 01 Jun 2017 10:29:36 GMT'),
('x-amz-id-2', '7hSZwSMYjrG+ptmuFy+lnwDoKOdFSmW7FkldyLC4Fb4Z8F7qvZ8Gg8893SHPt4SmTYzik6RDul8='),
('server', 'AmazonS3'), ('transfer-encoding', 'chunked'), ('x-amz-request-id', 'F3ECD1302EC0287F'),
('x-amz-bucket-region', 'us-east-1'), ('content-type', 'application/xml')]
Traceback (most recent call last):
File "/usr/local/lib/python2.7/site-packages/pywb/framework/wsgi_wrappers.py", line 65, in
handle_methods
...
File "/usr/local/lib/python2.7/site-packages/boto/s3/connection.py", line 539, in head_bucket
raise err
S3ResponseError: S3ResponseError: 403 Forbidden
Of course, I don't whether this was also the reason in your case. Please, try to check the fix in
pywb by copying the changed wsgi_wrappers.py over to
/usr/local/lib/python2.7/site-packages/pywb/framework/wsgi_wrappers.py
in the Docker image/container.
However, the simplest solution is to avoid authentication. Make sure that no access keys
are exposed to boto via environment variables or boto config files.
Best,
Sebastian
[1]
http://docs.aws.amazon.com/AWSECommerceService/latest/DG/HMACSignatures.html
[2]
http://boto.cloudhackers.com/en/latest/s3_tut.html
[3]
https://github.com/ikreymer/pywb
[4]
https://github.com/ikreymer/pywb/issues/219
> cc-index-server │ AWS XXXXXXXXXXXXXXXXXXXX:YYYYYYYYYYYYYYYYYYYYYYYYYYYYYY
> cc-index-server │ 2017-05-31 23:11:13,445: [DEBUG]: Final headers: {'Date': 'Wed, 31 May
> 2017 23:11:13 GMT', 'Content-Length': '0', 'Authorization': u'AWS
> XXXXXXXXXXXXXXXXXXXX:YYYYYYYYYYYYYYYYYYYYYYYYYYYYYY', 'User-Agent': 'Boto/2.47.0 Python/2.7.13
> FROM python:2.7RUN apt-get -qq update && apt-get -qqy install awscli# Install dependencies
> COPY ./requirements.txt /tmp/requirements.txt
> RUN pip install -r /tmp/requirements.txt# Add the cc-index-server code into the image
> COPY ./ /opt/webapp/
> WORKDIR /opt/webappRUN ./install-collections.shCMD ./cdx-server
>
>
> Thanks for your help!
>
> Erik
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
>
common-crawl...@googlegroups.com <mailto:
common-crawl...@googlegroups.com>.
> To post to this group, send email to
common...@googlegroups.com
> <mailto:
common...@googlegroups.com>.
> Visit this group at
https://groups.google.com/group/common-crawl.
> For more options, visit
https://groups.google.com/d/optout.