Hello,
i can confirm that the new server takes only a couple of seconds for query instead of 2 minutes with the old API. However, 5 seconds for 8 MB request seems kind of a long time for me just for querying the urls. I wanted to setup my own instance according to the tutorial here
https://github.com/ikreymer/cc-index-server.
1) First problem is with the s3cmd itself. It requires user to sign up for some bullshit Amazon program and requires credit card information for registration. I have found a workaround for this problem by manually downloading the latest index file from
https://commoncrawl.s3.amazonaws.com/cc-index/collections/CC-MAIN-2017-26/indexes/cluster.idx
2) At this point i thought it will work, i have downloaded the pywb by issuing command
pip install pywb
and to make the server work i had to make directory structure as follows
cc-index-server/collections/CC-MAIN-2017-26/indexes
Then from cc-index-server i have tried to run cdx-server with output
/usr/local/lib/python2.7/dist-packages/requests/__init__.py:80:
RequestsDependencyWarning: urllib3 (1.21.1) or chardet (2.3.0) doesn't
match a supported version!
RequestsDependencyWarning)
2017-07-20 19:14:25,079: [DEBUG]:
2017-07-20 19:14:25,091: [DEBUG]: Adding query_html: query.html
2017-07-20 19:14:25,091: [DEBUG]: CDX Surt-Ordered? True
2017-07-20 19:14:25,190: [DEBUG]: CustomCanonilizer? True
2017-07-20 19:14:25,191: [DEBUG]: FuzzyMatcher? True
2017-07-20
19:14:25,191: [DEBUG]: Adding CDX Source: ZipNum Cluster:
/home/doma/Documents/cc-index-server/collections/CC-MAIN-2017-26/indexes/cluster.idx,
<pywb.cdx.zipnum.LocPrefixResolver object at 0x7f2a32735410>
2017-07-20 19:14:25,191: [DEBUG]: Adding CDX API Handler: CC-MAIN-2017-26-index
2017-07-20 19:14:25,192: [DEBUG]: *** pywb app inited with config from "create_cdx_server_app"!
2017-07-20 19:14:25,193: [INFO]: Starting pywb CDX Index Server on port 8080
All seemed fine. Unfortunately after trying to request
http://localhost:8080/CC-MAIN-2017-26-index?url=commoncrawl.orgfollowing error occured
2017-07-20 19:16:17,568: [DEBUG]: Loading 1 blocks from s3://aws-publicdatasets/common-crawl/cc-index/collections/CC-MAIN-2017-26/indexes/cdx-00250.gz:156436206+152458
2017-07-20 19:16:17,582: [DEBUG]: Retrieving credentials from metadata server.
2017-07-20 19:16:18,595: [ERROR]: Caught exception reading instance data
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/boto/utils.py", line 210, in retry_url
r = opener.open(req, timeout=timeout)
File "/usr/lib/python2.7/urllib2.py", line 429, in open
response = self._open(req, data)
File "/usr/lib/python2.7/urllib2.py", line 447, in _open
'_open', req)
File "/usr/lib/python2.7/urllib2.py", line 407, in _call_chain
result = func(*args)
File "/usr/lib/python2.7/urllib2.py", line 1228, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "/usr/lib/python2.7/urllib2.py", line 1198, in do_open
raise URLError(err)
URLError: <urlopen error timed out>
2017-07-20 19:16:18,596: [ERROR]: Unable to read instance data, giving up
2017-07-20 19:16:18,596: [DEBUG]: Retrieving credentials from metadata server.
2017-07-20 19:16:19,599: [ERROR]: Caught exception reading instance data
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/boto/utils.py", line 210, in retry_url
r = opener.open(req, timeout=timeout)
File "/usr/lib/python2.7/urllib2.py", line 429, in open
response = self._open(req, data)
File "/usr/lib/python2.7/urllib2.py", line 447, in _open
'_open', req)
File "/usr/lib/python2.7/urllib2.py", line 407, in _call_chain
result = func(*args)
File "/usr/lib/python2.7/urllib2.py", line 1228, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "/usr/lib/python2.7/urllib2.py", line 1198, in do_open
raise URLError(err)
URLError: <urlopen error timed out>
2017-07-20 19:16:19,599: [ERROR]: Unable to read instance data, giving up
2017-07-20 19:16:19,600: [DEBUG]: path=/
2017-07-20 19:16:19,600: [DEBUG]: auth_path=/aws-publicdatasets/
2017-07-20 19:16:19,600: [DEBUG]: Method: HEAD
2017-07-20 19:16:19,600: [DEBUG]: Path: /
2017-07-20 19:16:19,600: [DEBUG]: Data:
2017-07-20 19:16:19,600: [DEBUG]: Headers: {}
2017-07-20 19:16:19,600: [DEBUG]: Host:
aws-publicdatasets.s3.amazonaws.com2017-07-20 19:16:19,600: [DEBUG]: Port: 443
2017-07-20 19:16:19,600: [DEBUG]: Params: {}
2017-07-20 19:16:19,601: [DEBUG]: establishing HTTPS connection: host=
aws-publicdatasets.s3.amazonaws.com, kwargs={'port': 443, 'timeout': 70}
2017-07-20 19:16:19,601: [DEBUG]: Token: None
2017-07-20 19:16:19,601: [DEBUG]: Final headers: {'Content-Length': '0', 'User-Agent': 'Boto/2.48.0 Python/2.7.13 Linux/4.10.0-28-generic'}
2017-07-20 19:16:20,101: [DEBUG]: Response headers: [('x-amz-bucket-region', 'us-east-1'), ('x-amz-id-2', 'Fh/ENQ30JsCVQrSWlDwyYWR4oKaF3+KCOHHySqGZrJ+rE4DvmiI/4R1nVI/OYXqIq7p3n/bBKCc='), ('server', 'AmazonS3'), ('transfer-encoding', 'chunked'), ('x-amz-request-id', '7A5362E8C681FBCB'), ('date', 'Thu, 20 Jul 2017 17:16:20 GMT'), ('content-type', 'application/xml')]
2017-07-20 19:16:20,101: [DEBUG]: path=//common-crawl/cc-index/collections/CC-MAIN-2017-26/indexes/cdx-00250.gz
2017-07-20 19:16:20,101: [DEBUG]: auth_path=/aws-publicdatasets//common-crawl/cc-index/collections/CC-MAIN-2017-26/indexes/cdx-00250.gz
2017-07-20 19:16:20,101: [DEBUG]: Method: HEAD
2017-07-20 19:16:20,101: [DEBUG]: Path: /common-crawl/cc-index/collections/CC-MAIN-2017-26/indexes/cdx-00250.gz
2017-07-20 19:16:20,102: [DEBUG]: Data:
2017-07-20 19:16:20,102: [DEBUG]: Headers: {}
2017-07-20 19:16:20,102: [DEBUG]: Host:
aws-publicdatasets.s3.amazonaws.com2017-07-20 19:16:20,102: [DEBUG]: Port: 443
2017-07-20 19:16:20,102: [DEBUG]: Params: {}
2017-07-20 19:16:20,102: [DEBUG]: Token: None
2017-07-20 19:16:20,103: [DEBUG]: Final headers: {'Content-Length': '0', 'User-Agent': 'Boto/2.48.0 Python/2.7.13 Linux/4.10.0-28-generic'}
2017-07-20 19:16:20,227: [DEBUG]: Response headers: [('x-amz-id-2', 'j8dmt+Ny87VV5TF8B9CF8CmfB1XKNAHreaL1xwrE/QKjB634cofAPAZQysgawbptIpOA8fjxPgI='), ('server', 'AmazonS3'), ('transfer-encoding', 'chunked'), ('x-amz-request-id', 'AAD83B0EF362A91E'), ('date', 'Thu, 20 Jul 2017 17:16:19 GMT'), ('content-type', 'application/xml')]
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/pywb/framework/wsgi_wrappers.py", line 65, in handle_methods
response = wb_router(env)
File "/usr/local/lib/python2.7/dist-packages/pywb/framework/archivalrouter.py", line 43, in __call__
return route.handler(wbrequest)
File "/usr/local/lib/python2.7/dist-packages/pywb/webapp/cdx_api_handler.py", line 27, in __call__
cdx_iter = self.index_handler.load_cdx(wbrequest, params)
File "/usr/local/lib/python2.7/dist-packages/pywb/webapp/query_handler.py", line 103, in load_cdx
cdx_iter = self.cdx_server.load_cdx(**params)
File "/usr/local/lib/python2.7/dist-packages/pywb/cdx/cdxserver.py", line 79, in load_cdx
return self._check_cdx_iter(cdx_iter, query)
File "/usr/local/lib/python2.7/dist-packages/pywb/cdx/cdxserver.py", line 44, in _check_cdx_iter
cdx_iter = self.peek_iter(cdx_iter)
File "/usr/local/lib/python2.7/dist-packages/pywb/cdx/cdxserver.py", line 87, in peek_iter
first = next(iterable)
File "/usr/local/lib/python2.7/dist-packages/pywb/cdx/cdxops.py", line 52, in cdx_to_text
for cdx in cdx_iter:
File "/usr/local/lib/python2.7/dist-packages/pywb/cdx/cdxops.py", line 131, in <genexpr>
return (cdx for cdx, _ in zip(cdx_iter, range(limit)))
File "/usr/local/lib/python2.7/dist-packages/pywb/cdx/cdxops.py", line 121, in <genexpr>
return (cls(line) for line in text_iter)
File "/usr/local/lib/python2.7/dist-packages/pywb/cdx/cdxops.py", line 107, in create_merged_cdx_gen
for cdx in cdx_iter:
File "/usr/local/lib/python2.7/dist-packages/pywb/cdx/zipnum.py", line 163, in gen_cdx
for blk in blocks:
File "/usr/local/lib/python2.7/dist-packages/pywb/cdx/zipnum.py", line 299, in idx_to_cdx
yield self.block_to_cdx_iter(blocks, ranges, query)
File "/usr/local/lib/python2.7/dist-packages/pywb/cdx/zipnum.py", line 319, in block_to_cdx_iter
six.reraise(Exception, last_exc, last_traceback)
File "/usr/local/lib/python2.7/dist-packages/pywb/cdx/zipnum.py", line 312, in block_to_cdx_iter
return self.load_blocks(location, blocks, ranges, query)
File "/usr/local/lib/python2.7/dist-packages/pywb/cdx/zipnum.py", line 334, in load_blocks
reader = self.blk_loader.load(location, blocks.offset, blocks.length)
File "/usr/local/lib/python2.7/dist-packages/pywb/utils/loaders.py", line 264, in load
return loader.load(url, offset, length)
File "/usr/local/lib/python2.7/dist-packages/pywb/utils/loaders.py", line 439, in load
key.open_read(headers=headers)
AttributeError: 'NoneType' object has no attribute 'open_read'
127.0.0.1 - - [20/Jul/2017 19:16:20] "GET /CC-MAIN-2017-26-index?url=
commoncrawl.org HTTP/1.1" 500 79
Can you tell me how are you running the server?