Upgrade to Common Crawl Index Server

293 views
Skip to first unread message

Sebastian Nagel

unread,
Jul 20, 2017, 9:43:19 AM7/20/17
to common...@googlegroups.com
Hi everyone,

it was a task long overdue: move the URL index "index.commoncrawl.org"
to a more powerful machine that is capable to process the requests of
multiple users in parallel. It's now done. In theory, the new machine
should be faster by a factor of 5. In practice, it's hard to measure:
first, the old index server has a constantly high load which makes it
respond slow. Second, response time is mostly bound to S3 response
time because the a chunk of the index is fetched from s3://commoncrawl/.

But try it yourself on
http://test-index.commoncrawl.org/
(respectively for the 2012 index)
http://test-urlsearch.commoncrawl.org/

Testing is open for the next two weeks. If you detect any issues,
please, report them on this group. Thanks!

The old servers
http://index.commoncrawl.org/
http://urlsearch.commoncrawl.org/
will be available for the next two weeks. If the new machine is stable,
we'll switch the DNS records after an evaluation period of at least two
weeks so that all requests are send to the new machine. We'll announce
the final switch separately on this group.


The new index server also allows to get a JSON list of all available
indexes:
http://test-index.commoncrawl.org/collinfo.json

The list is sorted from new to old which makes it easy to find out the
name of the latest crawl and index [1]. E.g.,

% wget -O - http://test-index.commoncrawl.org/collinfo.json 2>/dev/null \
| jq --raw-output '.[0]."id"'
CC-MAIN-2017-26


Best,
Sebastian

brano199

unread,
Jul 20, 2017, 1:17:27 PM7/20/17
to Common Crawl
Hello,

i can confirm that the new server takes only a couple of seconds for query instead of 2 minutes with the old API. However, 5 seconds for 8 MB request seems kind of a long time for me just for querying the urls. I wanted to setup my own instance according to the tutorial here https://github.com/ikreymer/cc-index-server.

1) First problem is with the s3cmd itself. It requires user to sign up for some bullshit Amazon program and requires credit card information for registration. I have found a workaround for this problem by manually downloading the latest index file from
https://commoncrawl.s3.amazonaws.com/cc-index/collections/CC-MAIN-2017-26/indexes/cluster.idx

2) At this point i thought it will work, i have downloaded the pywb by issuing command
pip install pywb
and to make the server work i had to make directory structure as follows
cc-index-server/collections/CC-MAIN-2017-26/indexes

Then from cc-index-server i have tried to run cdx-server with output
/usr/local/lib/python2.7/dist-packages/requests/__init__.py:80: RequestsDependencyWarning: urllib3 (1.21.1) or chardet (2.3.0) doesn't match a supported version!
  RequestsDependencyWarning)
2017-07-20 19:14:25,079: [DEBUG]:
2017-07-20 19:14:25,091: [DEBUG]: Adding query_html: query.html
2017-07-20 19:14:25,091: [DEBUG]: CDX Surt-Ordered? True
2017-07-20 19:14:25,190: [DEBUG]: CustomCanonilizer? True
2017-07-20 19:14:25,191: [DEBUG]: FuzzyMatcher? True
2017-07-20 19:14:25,191: [DEBUG]: Adding CDX Source: ZipNum Cluster: /home/doma/Documents/cc-index-server/collections/CC-MAIN-2017-26/indexes/cluster.idx, <pywb.cdx.zipnum.LocPrefixResolver object at 0x7f2a32735410>
2017-07-20 19:14:25,191: [DEBUG]: Adding CDX API Handler: CC-MAIN-2017-26-index
2017-07-20 19:14:25,192: [DEBUG]: *** pywb app inited with config from "create_cdx_server_app"!

2017-07-20 19:14:25,193: [INFO]: Starting pywb CDX Index Server on port 8080

All seemed fine. Unfortunately after trying to request
http://localhost:8080/CC-MAIN-2017-26-index?url=commoncrawl.org

following error occured
2017-07-20 19:16:17,568: [DEBUG]: Loading 1 blocks from s3://aws-publicdatasets/common-crawl/cc-index/collections/CC-MAIN-2017-26/indexes/cdx-00250.gz:156436206+152458
2017-07-20 19:16:17,582: [DEBUG]: Retrieving credentials from metadata server.
2017-07-20 19:16:18,595: [ERROR]: Caught exception reading instance data
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/boto/utils.py", line 210, in retry_url
    r = opener.open(req, timeout=timeout)
  File "/usr/lib/python2.7/urllib2.py", line 429, in open
    response = self._open(req, data)
  File "/usr/lib/python2.7/urllib2.py", line 447, in _open
    '_open', req)
  File "/usr/lib/python2.7/urllib2.py", line 407, in _call_chain
    result = func(*args)
  File "/usr/lib/python2.7/urllib2.py", line 1228, in http_open
    return self.do_open(httplib.HTTPConnection, req)
  File "/usr/lib/python2.7/urllib2.py", line 1198, in do_open
    raise URLError(err)
URLError: <urlopen error timed out>
2017-07-20 19:16:18,596: [ERROR]: Unable to read instance data, giving up
2017-07-20 19:16:18,596: [DEBUG]: Retrieving credentials from metadata server.
2017-07-20 19:16:19,599: [ERROR]: Caught exception reading instance data
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/boto/utils.py", line 210, in retry_url
    r = opener.open(req, timeout=timeout)
  File "/usr/lib/python2.7/urllib2.py", line 429, in open
    response = self._open(req, data)
  File "/usr/lib/python2.7/urllib2.py", line 447, in _open
    '_open', req)
  File "/usr/lib/python2.7/urllib2.py", line 407, in _call_chain
    result = func(*args)
  File "/usr/lib/python2.7/urllib2.py", line 1228, in http_open
    return self.do_open(httplib.HTTPConnection, req)
  File "/usr/lib/python2.7/urllib2.py", line 1198, in do_open
    raise URLError(err)
URLError: <urlopen error timed out>
2017-07-20 19:16:19,599: [ERROR]: Unable to read instance data, giving up
2017-07-20 19:16:19,600: [DEBUG]: path=/
2017-07-20 19:16:19,600: [DEBUG]: auth_path=/aws-publicdatasets/
2017-07-20 19:16:19,600: [DEBUG]: Method: HEAD
2017-07-20 19:16:19,600: [DEBUG]: Path: /
2017-07-20 19:16:19,600: [DEBUG]: Data:
2017-07-20 19:16:19,600: [DEBUG]: Headers: {}
2017-07-20 19:16:19,600: [DEBUG]: Host: aws-publicdatasets.s3.amazonaws.com
2017-07-20 19:16:19,600: [DEBUG]: Port: 443
2017-07-20 19:16:19,600: [DEBUG]: Params: {}
2017-07-20 19:16:19,601: [DEBUG]: establishing HTTPS connection: host=aws-publicdatasets.s3.amazonaws.com, kwargs={'port': 443, 'timeout': 70}
2017-07-20 19:16:19,601: [DEBUG]: Token: None
2017-07-20 19:16:19,601: [DEBUG]: Final headers: {'Content-Length': '0', 'User-Agent': 'Boto/2.48.0 Python/2.7.13 Linux/4.10.0-28-generic'}
2017-07-20 19:16:20,101: [DEBUG]: Response headers: [('x-amz-bucket-region', 'us-east-1'), ('x-amz-id-2', 'Fh/ENQ30JsCVQrSWlDwyYWR4oKaF3+KCOHHySqGZrJ+rE4DvmiI/4R1nVI/OYXqIq7p3n/bBKCc='), ('server', 'AmazonS3'), ('transfer-encoding', 'chunked'), ('x-amz-request-id', '7A5362E8C681FBCB'), ('date', 'Thu, 20 Jul 2017 17:16:20 GMT'), ('content-type', 'application/xml')]
2017-07-20 19:16:20,101: [DEBUG]: path=//common-crawl/cc-index/collections/CC-MAIN-2017-26/indexes/cdx-00250.gz
2017-07-20 19:16:20,101: [DEBUG]: auth_path=/aws-publicdatasets//common-crawl/cc-index/collections/CC-MAIN-2017-26/indexes/cdx-00250.gz
2017-07-20 19:16:20,101: [DEBUG]: Method: HEAD
2017-07-20 19:16:20,101: [DEBUG]: Path: /common-crawl/cc-index/collections/CC-MAIN-2017-26/indexes/cdx-00250.gz
2017-07-20 19:16:20,102: [DEBUG]: Data:
2017-07-20 19:16:20,102: [DEBUG]: Headers: {}
2017-07-20 19:16:20,102: [DEBUG]: Host: aws-publicdatasets.s3.amazonaws.com
2017-07-20 19:16:20,102: [DEBUG]: Port: 443
2017-07-20 19:16:20,102: [DEBUG]: Params: {}
2017-07-20 19:16:20,102: [DEBUG]: Token: None
2017-07-20 19:16:20,103: [DEBUG]: Final headers: {'Content-Length': '0', 'User-Agent': 'Boto/2.48.0 Python/2.7.13 Linux/4.10.0-28-generic'}
2017-07-20 19:16:20,227: [DEBUG]: Response headers: [('x-amz-id-2', 'j8dmt+Ny87VV5TF8B9CF8CmfB1XKNAHreaL1xwrE/QKjB634cofAPAZQysgawbptIpOA8fjxPgI='), ('server', 'AmazonS3'), ('transfer-encoding', 'chunked'), ('x-amz-request-id', 'AAD83B0EF362A91E'), ('date', 'Thu, 20 Jul 2017 17:16:19 GMT'), ('content-type', 'application/xml')]
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/pywb/framework/wsgi_wrappers.py", line 65, in handle_methods
    response = wb_router(env)
  File "/usr/local/lib/python2.7/dist-packages/pywb/framework/archivalrouter.py", line 43, in __call__
    return route.handler(wbrequest)
  File "/usr/local/lib/python2.7/dist-packages/pywb/webapp/cdx_api_handler.py", line 27, in __call__
    cdx_iter = self.index_handler.load_cdx(wbrequest, params)
  File "/usr/local/lib/python2.7/dist-packages/pywb/webapp/query_handler.py", line 103, in load_cdx
    cdx_iter = self.cdx_server.load_cdx(**params)
  File "/usr/local/lib/python2.7/dist-packages/pywb/cdx/cdxserver.py", line 79, in load_cdx
    return self._check_cdx_iter(cdx_iter, query)
  File "/usr/local/lib/python2.7/dist-packages/pywb/cdx/cdxserver.py", line 44, in _check_cdx_iter
    cdx_iter = self.peek_iter(cdx_iter)
  File "/usr/local/lib/python2.7/dist-packages/pywb/cdx/cdxserver.py", line 87, in peek_iter
    first = next(iterable)
  File "/usr/local/lib/python2.7/dist-packages/pywb/cdx/cdxops.py", line 52, in cdx_to_text
    for cdx in cdx_iter:
  File "/usr/local/lib/python2.7/dist-packages/pywb/cdx/cdxops.py", line 131, in <genexpr>
    return (cdx for cdx, _ in zip(cdx_iter, range(limit)))
  File "/usr/local/lib/python2.7/dist-packages/pywb/cdx/cdxops.py", line 121, in <genexpr>
    return (cls(line) for line in text_iter)
  File "/usr/local/lib/python2.7/dist-packages/pywb/cdx/cdxops.py", line 107, in create_merged_cdx_gen
    for cdx in cdx_iter:
  File "/usr/local/lib/python2.7/dist-packages/pywb/cdx/zipnum.py", line 163, in gen_cdx
    for blk in blocks:
  File "/usr/local/lib/python2.7/dist-packages/pywb/cdx/zipnum.py", line 299, in idx_to_cdx
    yield self.block_to_cdx_iter(blocks, ranges, query)
  File "/usr/local/lib/python2.7/dist-packages/pywb/cdx/zipnum.py", line 319, in block_to_cdx_iter
    six.reraise(Exception, last_exc, last_traceback)
  File "/usr/local/lib/python2.7/dist-packages/pywb/cdx/zipnum.py", line 312, in block_to_cdx_iter
    return self.load_blocks(location, blocks, ranges, query)
  File "/usr/local/lib/python2.7/dist-packages/pywb/cdx/zipnum.py", line 334, in load_blocks
    reader = self.blk_loader.load(location, blocks.offset, blocks.length)
  File "/usr/local/lib/python2.7/dist-packages/pywb/utils/loaders.py", line 264, in load
    return loader.load(url, offset, length)
  File "/usr/local/lib/python2.7/dist-packages/pywb/utils/loaders.py", line 439, in load
    key.open_read(headers=headers)
AttributeError: 'NoneType' object has no attribute 'open_read'

127.0.0.1 - - [20/Jul/2017 19:16:20] "GET /CC-MAIN-2017-26-index?url=commoncrawl.org HTTP/1.1" 500 79

Can you tell me how are you running the server?

Sebastian Nagel

unread,
Jul 20, 2017, 1:33:44 PM7/20/17
to common...@googlegroups.com
Hi,

please, try the fork
https://github.com/commoncrawl/cc-index-server

Few little but important details have changed.
I'll ping Ilya Kreymer to pull the changes upstream.

Best,
Sebastian

On 07/20/2017 07:17 PM, brano199 wrote:
> Hello,
>
> i can confirm that the new server takes only a couple of seconds for query instead of 2 minutes with
> the old API. However, 5 seconds for 8 MB request seems kind of a long time for me just for querying
> the urls. I wanted to setup my own instance according to the tutorial here
> https://github.com/ikreymer/cc-index-server.
>
> 1) First problem is with the s3cmd itself. It requires user to sign up for some bullshit Amazon
> program and requires credit card information for registration. I have found a workaround for this
> problem by manually downloading the latest index file from
> *https://commoncrawl.s3.amazonaws.com/cc-index/collections/CC-MAIN-2017-26/indexes/cluster.idx
>
> *2) At this point i thought it will work, i have downloaded the pywb by issuing command
> <http://index.commoncrawl.org>"
> to a more powerful machine that is capable to process the requests of
> multiple users in parallel. It's now done. In theory, the new machine
> should be faster by a factor of 5. In practice, it's hard to measure:
> first, the old index server has a constantly high load which makes it
> respond slow. Second, response time is mostly bound to S3 response
> time because the a chunk of the index is fetched from s3://commoncrawl/.
>
> But try it yourself on
> http://test-index.commoncrawl.org/ <http://test-index.commoncrawl.org/>
> (respectively for the 2012 index)
> http://test-urlsearch.commoncrawl.org/ <http://test-urlsearch.commoncrawl.org/>
>
> Testing is open for the next two weeks. If you detect any issues,
> please, report them on this group. Thanks!
>
> The old servers
> http://index.commoncrawl.org/
> http://urlsearch.commoncrawl.org/ <http://urlsearch.commoncrawl.org/>
> will be available for the next two weeks. If the new machine is stable,
> we'll switch the DNS records after an evaluation period of at least two
> weeks so that all requests are send to the new machine. We'll announce
> the final switch separately on this group.
>
>
> The new index server also allows to get a JSON list of all available
> indexes:
> http://test-index.commoncrawl.org/collinfo.json <http://test-index.commoncrawl.org/collinfo.json>
>
> The list is sorted from new to old which makes it easy to find out the
> name of the latest crawl and index [1]. E.g.,
>
> % wget -O - http://test-index.commoncrawl.org/collinfo.json
> <http://test-index.commoncrawl.org/collinfo.json> 2>/dev/null \
> | jq --raw-output '.[0]."id"'
> CC-MAIN-2017-26
>
>
> Best,
> Sebastian
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
> To post to this group, send email to common...@googlegroups.com
> <mailto:common...@googlegroups.com>.
> Visit this group at https://groups.google.com/group/common-crawl.
> For more options, visit https://groups.google.com/d/optout.

brano199

unread,
Jul 20, 2017, 1:55:30 PM7/20/17
to Common Crawl
We are moving forward,but it is not completely fixed.

1) You should add awscli to requirements.txt, then all the collections are downloaded successfully using install-collections.sh.

2) When i run and query the server,it returns some results, for instance querying http://localhost:8080/CC-MAIN-2017-26-index?url=commoncrawl.org returns
 

org,commoncrawl)/ 20170624050557 {"url": "http://commoncrawl.org/", "mime": "text/html", "mime-detected": "text/html", "status": "200", "digest": "PT2FUP6YUK56CZJV4SP72DU4TBDWBL6I", "length": "5336", "offset": "81962473", "filename": "crawl-data/CC-MAIN-2017-26/segments/1498128320226.61/warc/CC-MAIN-20170624050312-20170624070312-00334.warc.gz"}
org,commoncrawl)/ 20170625120558 {"url": "http://commoncrawl.org/", "mime": "text/html", "mime-detected": "text/html", "status": "200", "digest": "PT2FUP6YUK56CZJV4SP72DU4TBDWBL6I", "length": "5336", "offset": "79896488", "filename": "crawl-data/CC-MAIN-2017-26/segments/1498128320491.13/warc/CC-MAIN-20170625115717-20170625135717-00334.warc.gz"}

However,there are still errors appearing in the error log. I don' t even like the first message saying something about dependency mismatch. Oh how i hate Python,that' s why i am never using it because of the dependency hell - Python 3 libraries are just not made. Ok, lets forget about the Python for a while. I have downloaded the cluster.idx files for each of the crawl datasets. Can you somehow summarize me what the python server is trying to do? I will try to to do the same in some sane language - C++.
I see each line has funny format
0,124,148,146)/index.php 20170628134953    cdx-00000.gz    0    183419    1

What do we want to do with it? In your blog you mentioned some B-tree prefix index over this?

/usr/local/lib/python2.7/dist-packages/requests/__init__.py:80: RequestsDependencyWarning: urllib3 (1.21.1) or chardet (2.3.0) doesn't match a supported version!
  RequestsDependencyWarning)
2017-07-20 19:51:12,829: [DEBUG]:
2017-07-20 19:51:12,860: [DEBUG]: Adding query_html: query.html
2017-07-20 19:51:12,861: [DEBUG]: CDX Surt-Ordered? True
2017-07-20 19:51:12,960: [DEBUG]: CustomCanonilizer? True
2017-07-20 19:51:12,960: [DEBUG]: FuzzyMatcher? True
2017-07-20 19:51:12,961: [DEBUG]: Adding CDX Source: ZipNum Cluster: /home/doma/Documents/cc-index-server/collections/CC-MAIN-2017-17/indexes/cluster.idx, <pywb.cdx.zipnum.LocPrefixResolver object at 0x7f94e7a5fc90>
2017-07-20 19:51:12,961: [DEBUG]: Adding CDX API Handler: CC-MAIN-2017-17-index
2017-07-20 19:51:12,962: [DEBUG]: Adding query_html: query.html
2017-07-20 19:51:12,962: [DEBUG]: CDX Surt-Ordered? True
2017-07-20 19:51:13,054: [DEBUG]: CustomCanonilizer? True
2017-07-20 19:51:13,054: [DEBUG]: FuzzyMatcher? True
2017-07-20 19:51:13,055: [DEBUG]: Adding CDX Source: ZipNum Cluster: /home/doma/Documents/cc-index-server/collections/CC-MAIN-2017-13/indexes/cluster.idx, <pywb.cdx.zipnum.LocPrefixResolver object at 0x7f94e44b8d90>
2017-07-20 19:51:13,055: [DEBUG]: Adding CDX API Handler: CC-MAIN-2017-13-index
2017-07-20 19:51:13,055: [DEBUG]: Adding query_html: query.html
2017-07-20 19:51:13,055: [DEBUG]: CDX Surt-Ordered? True
2017-07-20 19:51:13,153: [DEBUG]: CustomCanonilizer? True
2017-07-20 19:51:13,153: [DEBUG]: FuzzyMatcher? True
2017-07-20 19:51:13,153: [DEBUG]: Adding CDX Source: ZipNum Cluster: /home/doma/Documents/cc-index-server/collections/CC-MAIN-2014-42/indexes/cluster.idx, <pywb.cdx.zipnum.LocPrefixResolver object at 0x7f94e7a4ca10>
2017-07-20 19:51:13,154: [DEBUG]: Adding CDX API Handler: CC-MAIN-2014-42-index
2017-07-20 19:51:13,154: [DEBUG]: Adding query_html: query.html
2017-07-20 19:51:13,154: [DEBUG]: CDX Surt-Ordered? True
2017-07-20 19:51:13,253: [DEBUG]: CustomCanonilizer? True
2017-07-20 19:51:13,253: [DEBUG]: FuzzyMatcher? True
2017-07-20 19:51:13,254: [DEBUG]: Adding CDX Source: ZipNum Cluster: /home/doma/Documents/cc-index-server/collections/CC-MAIN-2014-41/indexes/cluster.idx, <pywb.cdx.zipnum.LocPrefixResolver object at 0x7f94e7a813d0>
2017-07-20 19:51:13,254: [DEBUG]: Adding CDX API Handler: CC-MAIN-2014-41-index
2017-07-20 19:51:13,254: [DEBUG]: Adding query_html: query.html
2017-07-20 19:51:13,254: [DEBUG]: CDX Surt-Ordered? True
2017-07-20 19:51:13,368: [DEBUG]: CustomCanonilizer? True
2017-07-20 19:51:13,368: [DEBUG]: FuzzyMatcher? True
2017-07-20 19:51:13,368: [DEBUG]: Adding CDX Source: ZipNum Cluster: /home/doma/Documents/cc-index-server/collections/CC-MAIN-2015-14/indexes/cluster.idx, <pywb.cdx.zipnum.LocPrefixResolver object at 0x7f94e7ab2910>
2017-07-20 19:51:13,368: [DEBUG]: Adding CDX API Handler: CC-MAIN-2015-14-index
2017-07-20 19:51:13,368: [DEBUG]: Adding query_html: query.html
2017-07-20 19:51:13,369: [DEBUG]: CDX Surt-Ordered? True
2017-07-20 19:51:13,460: [DEBUG]: CustomCanonilizer? True
2017-07-20 19:51:13,460: [DEBUG]: FuzzyMatcher? True
2017-07-20 19:51:13,460: [DEBUG]: Adding CDX Source: ZipNum Cluster: /home/doma/Documents/cc-index-server/collections/CC-MAIN-2015-18/indexes/cluster.idx, <pywb.cdx.zipnum.LocPrefixResolver object at 0x7f94e7a43e10>
2017-07-20 19:51:13,460: [DEBUG]: Adding CDX API Handler: CC-MAIN-2015-18-index
2017-07-20 19:51:13,461: [DEBUG]: Adding query_html: query.html
2017-07-20 19:51:13,461: [DEBUG]: CDX Surt-Ordered? True
2017-07-20 19:51:13,552: [DEBUG]: CustomCanonilizer? True
2017-07-20 19:51:13,552: [DEBUG]: FuzzyMatcher? True
2017-07-20 19:51:13,552: [DEBUG]: Adding CDX Source: ZipNum Cluster: /home/doma/Documents/cc-index-server/collections/CC-MAIN-2014-49/indexes/cluster.idx, <pywb.cdx.zipnum.LocPrefixResolver object at 0x7f94e448a590>
2017-07-20 19:51:13,553: [DEBUG]: Adding CDX API Handler: CC-MAIN-2014-49-index
2017-07-20 19:51:13,553: [DEBUG]: Adding query_html: query.html
2017-07-20 19:51:13,553: [DEBUG]: CDX Surt-Ordered? True
2017-07-20 19:51:13,645: [DEBUG]: CustomCanonilizer? True
2017-07-20 19:51:13,645: [DEBUG]: FuzzyMatcher? True
2017-07-20 19:51:13,645: [DEBUG]: Adding CDX Source: ZipNum Cluster: /home/doma/Documents/cc-index-server/collections/CC-MAIN-2013-20/indexes/cluster.idx, <pywb.cdx.zipnum.LocPrefixResolver object at 0x7f94e44b6310>
2017-07-20 19:51:13,645: [DEBUG]: Adding CDX API Handler: CC-MAIN-2013-20-index
2017-07-20 19:51:13,646: [DEBUG]: Adding query_html: query.html
2017-07-20 19:51:13,646: [DEBUG]: CDX Surt-Ordered? True
2017-07-20 19:51:13,736: [DEBUG]: CustomCanonilizer? True
2017-07-20 19:51:13,737: [DEBUG]: FuzzyMatcher? True
2017-07-20 19:51:13,737: [DEBUG]: Adding CDX Source: ZipNum Cluster: /home/doma/Documents/cc-index-server/collections/CC-MAIN-2013-48/indexes/cluster.idx, <pywb.cdx.zipnum.LocPrefixResolver object at 0x7f94e4482b10>
2017-07-20 19:51:13,737: [DEBUG]: Adding CDX API Handler: CC-MAIN-2013-48-index
2017-07-20 19:51:13,737: [DEBUG]: Adding query_html: query.html
2017-07-20 19:51:13,737: [DEBUG]: CDX Surt-Ordered? True
2017-07-20 19:51:13,829: [DEBUG]: CustomCanonilizer? True
2017-07-20 19:51:13,829: [DEBUG]: FuzzyMatcher? True
2017-07-20 19:51:13,829: [DEBUG]: Adding CDX Source: ZipNum Cluster: /home/doma/Documents/cc-index-server/collections/CC-MAIN-2016-22/indexes/cluster.idx, <pywb.cdx.zipnum.LocPrefixResolver object at 0x7f94e7a702d0>
2017-07-20 19:51:13,829: [DEBUG]: Adding CDX API Handler: CC-MAIN-2016-22-index
2017-07-20 19:51:13,829: [DEBUG]: Adding query_html: query.html
2017-07-20 19:51:13,829: [DEBUG]: CDX Surt-Ordered? True
2017-07-20 19:51:13,920: [DEBUG]: CustomCanonilizer? True
2017-07-20 19:51:13,920: [DEBUG]: FuzzyMatcher? True
2017-07-20 19:51:13,921: [DEBUG]: Adding CDX Source: ZipNum Cluster: /home/doma/Documents/cc-index-server/collections/CC-MAIN-2016-26/indexes/cluster.idx, <pywb.cdx.zipnum.LocPrefixResolver object at 0x7f94e7a2fc10>
2017-07-20 19:51:13,921: [DEBUG]: Adding CDX API Handler: CC-MAIN-2016-26-index
2017-07-20 19:51:13,921: [DEBUG]: Adding query_html: query.html
2017-07-20 19:51:13,921: [DEBUG]: CDX Surt-Ordered? True
2017-07-20 19:51:14,013: [DEBUG]: CustomCanonilizer? True
2017-07-20 19:51:14,013: [DEBUG]: FuzzyMatcher? True
2017-07-20 19:51:14,013: [DEBUG]: Adding CDX Source: ZipNum Cluster: /home/doma/Documents/cc-index-server/collections/CC-MAIN-2015-35/indexes/cluster.idx, <pywb.cdx.zipnum.LocPrefixResolver object at 0x7f94e7a63350>
2017-07-20 19:51:14,013: [DEBUG]: Adding CDX API Handler: CC-MAIN-2015-35-index
2017-07-20 19:51:14,014: [DEBUG]: Adding query_html: query.html
2017-07-20 19:51:14,014: [DEBUG]: CDX Surt-Ordered? True
2017-07-20 19:51:14,105: [DEBUG]: CustomCanonilizer? True
2017-07-20 19:51:14,105: [DEBUG]: FuzzyMatcher? True
2017-07-20 19:51:14,105: [DEBUG]: Adding CDX Source: ZipNum Cluster: /home/doma/Documents/cc-index-server/collections/CC-MAIN-2016-07/indexes/cluster.idx, <pywb.cdx.zipnum.LocPrefixResolver object at 0x7f94e448d090>
2017-07-20 19:51:14,105: [DEBUG]: Adding CDX API Handler: CC-MAIN-2016-07-index
2017-07-20 19:51:14,105: [DEBUG]: Adding query_html: query.html
2017-07-20 19:51:14,105: [DEBUG]: CDX Surt-Ordered? True
2017-07-20 19:51:14,197: [DEBUG]: CustomCanonilizer? True
2017-07-20 19:51:14,197: [DEBUG]: FuzzyMatcher? True
2017-07-20 19:51:14,197: [DEBUG]: Adding CDX Source: ZipNum Cluster: /home/doma/Documents/cc-index-server/collections/CC-MAIN-2016-44/indexes/cluster.idx, <pywb.cdx.zipnum.LocPrefixResolver object at 0x7f94e44b3a50>
2017-07-20 19:51:14,197: [DEBUG]: Adding CDX API Handler: CC-MAIN-2016-44-index
2017-07-20 19:51:14,198: [DEBUG]: Adding query_html: query.html
2017-07-20 19:51:14,198: [DEBUG]: CDX Surt-Ordered? True
2017-07-20 19:51:14,288: [DEBUG]: CustomCanonilizer? True
2017-07-20 19:51:14,288: [DEBUG]: FuzzyMatcher? True
2017-07-20 19:51:14,289: [DEBUG]: Adding CDX Source: ZipNum Cluster: /home/doma/Documents/cc-index-server/collections/CC-MAIN-2016-40/indexes/cluster.idx, <pywb.cdx.zipnum.LocPrefixResolver object at 0x7f94e7a68290>
2017-07-20 19:51:14,289: [DEBUG]: Adding CDX API Handler: CC-MAIN-2016-40-index
2017-07-20 19:51:14,289: [DEBUG]: Adding query_html: query.html
2017-07-20 19:51:14,289: [DEBUG]: CDX Surt-Ordered? True
2017-07-20 19:51:14,381: [DEBUG]: CustomCanonilizer? True
2017-07-20 19:51:14,381: [DEBUG]: FuzzyMatcher? True
2017-07-20 19:51:14,381: [DEBUG]: Adding CDX Source: ZipNum Cluster: /home/doma/Documents/cc-index-server/collections/CC-MAIN-2017-04/indexes/cluster.idx, <pywb.cdx.zipnum.LocPrefixResolver object at 0x7f94e7a1b690>
2017-07-20 19:51:14,381: [DEBUG]: Adding CDX API Handler: CC-MAIN-2017-04-index
2017-07-20 19:51:14,382: [DEBUG]: Adding query_html: query.html
2017-07-20 19:51:14,382: [DEBUG]: CDX Surt-Ordered? True
2017-07-20 19:51:14,474: [DEBUG]: CustomCanonilizer? True
2017-07-20 19:51:14,474: [DEBUG]: FuzzyMatcher? True
2017-07-20 19:51:14,474: [DEBUG]: Adding CDX Source: ZipNum Cluster: /home/doma/Documents/cc-index-server/collections/CC-MAIN-2017-09/indexes/cluster.idx, <pywb.cdx.zipnum.LocPrefixResolver object at 0x7f94e449c150>
2017-07-20 19:51:14,475: [DEBUG]: Adding CDX API Handler: CC-MAIN-2017-09-index
2017-07-20 19:51:14,475: [DEBUG]: Adding query_html: query.html
2017-07-20 19:51:14,475: [DEBUG]: CDX Surt-Ordered? True
2017-07-20 19:51:14,572: [DEBUG]: CustomCanonilizer? True
2017-07-20 19:51:14,572: [DEBUG]: FuzzyMatcher? True
2017-07-20 19:51:14,573: [DEBUG]: Adding CDX Source: ZipNum Cluster: /home/doma/Documents/cc-index-server/collections/CC-MAIN-2014-10/indexes/cluster.idx, <pywb.cdx.zipnum.LocPrefixResolver object at 0x7f94e7a32b90>
2017-07-20 19:51:14,573: [DEBUG]: Adding CDX API Handler: CC-MAIN-2014-10-index
2017-07-20 19:51:14,573: [DEBUG]: Adding query_html: query.html
2017-07-20 19:51:14,574: [DEBUG]: CDX Surt-Ordered? True
2017-07-20 19:51:14,673: [DEBUG]: CustomCanonilizer? True
2017-07-20 19:51:14,673: [DEBUG]: FuzzyMatcher? True
2017-07-20 19:51:14,674: [DEBUG]: Adding CDX Source: ZipNum Cluster: /home/doma/Documents/cc-index-server/collections/CC-MAIN-2017-22/indexes/cluster.idx, <pywb.cdx.zipnum.LocPrefixResolver object at 0x7f94e44a46d0>
2017-07-20 19:51:14,674: [DEBUG]: Adding CDX API Handler: CC-MAIN-2017-22-index
2017-07-20 19:51:14,674: [DEBUG]: Adding query_html: query.html
2017-07-20 19:51:14,674: [DEBUG]: CDX Surt-Ordered? True
2017-07-20 19:51:14,777: [DEBUG]: CustomCanonilizer? True
2017-07-20 19:51:14,778: [DEBUG]: FuzzyMatcher? True
2017-07-20 19:51:14,778: [DEBUG]: Adding CDX Source: ZipNum Cluster: /home/doma/Documents/cc-index-server/collections/CC-MAIN-2017-26/indexes/cluster.idx, <pywb.cdx.zipnum.LocPrefixResolver object at 0x7f94e44b1b90>
2017-07-20 19:51:14,778: [DEBUG]: Adding CDX API Handler: CC-MAIN-2017-26-index
2017-07-20 19:51:14,779: [DEBUG]: Adding query_html: query.html
2017-07-20 19:51:14,779: [DEBUG]: CDX Surt-Ordered? True
2017-07-20 19:51:14,876: [DEBUG]: CustomCanonilizer? True
2017-07-20 19:51:14,876: [DEBUG]: FuzzyMatcher? True
2017-07-20 19:51:14,876: [DEBUG]: Adding CDX Source: ZipNum Cluster: /home/doma/Documents/cc-index-server/collections/CC-MAIN-2014-23/indexes/cluster.idx, <pywb.cdx.zipnum.LocPrefixResolver object at 0x7f94e44a3d90>
2017-07-20 19:51:14,876: [DEBUG]: Adding CDX API Handler: CC-MAIN-2014-23-index
2017-07-20 19:51:14,877: [DEBUG]: Adding query_html: query.html
2017-07-20 19:51:14,877: [DEBUG]: CDX Surt-Ordered? True
2017-07-20 19:51:14,969: [DEBUG]: CustomCanonilizer? True
2017-07-20 19:51:14,969: [DEBUG]: FuzzyMatcher? True
2017-07-20 19:51:14,969: [DEBUG]: Adding CDX Source: ZipNum Cluster: /home/doma/Documents/cc-index-server/collections/CC-MAIN-2015-06/indexes/cluster.idx, <pywb.cdx.zipnum.LocPrefixResolver object at 0x7f94e7a78e90>
2017-07-20 19:51:14,969: [DEBUG]: Adding CDX API Handler: CC-MAIN-2015-06-index
2017-07-20 19:51:14,970: [DEBUG]: Adding query_html: query.html
2017-07-20 19:51:14,970: [DEBUG]: CDX Surt-Ordered? True
2017-07-20 19:51:15,061: [DEBUG]: CustomCanonilizer? True
2017-07-20 19:51:15,061: [DEBUG]: FuzzyMatcher? True
2017-07-20 19:51:15,061: [DEBUG]: Adding CDX Source: ZipNum Cluster: /home/doma/Documents/cc-index-server/collections/CC-MAIN-2014-52/indexes/cluster.idx, <pywb.cdx.zipnum.LocPrefixResolver object at 0x7f94e7a84290>
2017-07-20 19:51:15,061: [DEBUG]: Adding CDX API Handler: CC-MAIN-2014-52-index
2017-07-20 19:51:15,062: [DEBUG]: Adding query_html: query.html
2017-07-20 19:51:15,062: [DEBUG]: CDX Surt-Ordered? True
2017-07-20 19:51:15,152: [DEBUG]: CustomCanonilizer? True
2017-07-20 19:51:15,152: [DEBUG]: FuzzyMatcher? True
2017-07-20 19:51:15,152: [DEBUG]: Adding CDX Source: ZipNum Cluster: /home/doma/Documents/cc-index-server/collections/CC-MAIN-2015-22/indexes/cluster.idx, <pywb.cdx.zipnum.LocPrefixResolver object at 0x7f94e446a310>
2017-07-20 19:51:15,153: [DEBUG]: Adding CDX API Handler: CC-MAIN-2015-22-index
2017-07-20 19:51:15,153: [DEBUG]: Adding query_html: query.html
2017-07-20 19:51:15,153: [DEBUG]: CDX Surt-Ordered? True
2017-07-20 19:51:15,244: [DEBUG]: CustomCanonilizer? True
2017-07-20 19:51:15,244: [DEBUG]: FuzzyMatcher? True
2017-07-20 19:51:15,244: [DEBUG]: Adding CDX Source: ZipNum Cluster: /home/doma/Documents/cc-index-server/collections/CC-MAIN-2015-27/indexes/cluster.idx, <pywb.cdx.zipnum.LocPrefixResolver object at 0x7f94e4469e50>
2017-07-20 19:51:15,245: [DEBUG]: Adding CDX API Handler: CC-MAIN-2015-27-index
2017-07-20 19:51:15,245: [DEBUG]: Adding query_html: query.html
2017-07-20 19:51:15,245: [DEBUG]: CDX Surt-Ordered? True
2017-07-20 19:51:15,336: [DEBUG]: CustomCanonilizer? True
2017-07-20 19:51:15,337: [DEBUG]: FuzzyMatcher? True
2017-07-20 19:51:15,337: [DEBUG]: Adding CDX Source: ZipNum Cluster: /home/doma/Documents/cc-index-server/collections/CC-MAIN-2015-40/indexes/cluster.idx, <pywb.cdx.zipnum.LocPrefixResolver object at 0x7f94e7a82750>
2017-07-20 19:51:15,337: [DEBUG]: Adding CDX API Handler: CC-MAIN-2015-40-index
2017-07-20 19:51:15,337: [DEBUG]: Adding query_html: query.html
2017-07-20 19:51:15,337: [DEBUG]: CDX Surt-Ordered? True
2017-07-20 19:51:15,448: [DEBUG]: CustomCanonilizer? True
2017-07-20 19:51:15,448: [DEBUG]: FuzzyMatcher? True
2017-07-20 19:51:15,448: [DEBUG]: Adding CDX Source: ZipNum Cluster: /home/doma/Documents/cc-index-server/collections/CC-MAIN-2014-35/indexes/cluster.idx, <pywb.cdx.zipnum.LocPrefixResolver object at 0x7f94e7a3e090>
2017-07-20 19:51:15,448: [DEBUG]: Adding CDX API Handler: CC-MAIN-2014-35-index
2017-07-20 19:51:15,449: [DEBUG]: Adding query_html: query.html
2017-07-20 19:51:15,449: [DEBUG]: CDX Surt-Ordered? True
2017-07-20 19:51:15,541: [DEBUG]: CustomCanonilizer? True
2017-07-20 19:51:15,541: [DEBUG]: FuzzyMatcher? True
2017-07-20 19:51:15,542: [DEBUG]: Adding CDX Source: ZipNum Cluster: /home/doma/Documents/cc-index-server/collections/CC-MAIN-2014-15/indexes/cluster.idx, <pywb.cdx.zipnum.LocPrefixResolver object at 0x7f94e447ccd0>
2017-07-20 19:51:15,542: [DEBUG]: Adding CDX API Handler: CC-MAIN-2014-15-index
2017-07-20 19:51:15,542: [DEBUG]: Adding query_html: query.html
2017-07-20 19:51:15,542: [DEBUG]: CDX Surt-Ordered? True
2017-07-20 19:51:15,634: [DEBUG]: CustomCanonilizer? True
2017-07-20 19:51:15,635: [DEBUG]: FuzzyMatcher? True
2017-07-20 19:51:15,635: [DEBUG]: Adding CDX Source: ZipNum Cluster: /home/doma/Documents/cc-index-server/collections/CC-MAIN-2015-48/indexes/cluster.idx, <pywb.cdx.zipnum.LocPrefixResolver object at 0x7f94e4479e50>
2017-07-20 19:51:15,635: [DEBUG]: Adding CDX API Handler: CC-MAIN-2015-48-index
2017-07-20 19:51:15,635: [DEBUG]: Adding query_html: query.html
2017-07-20 19:51:15,635: [DEBUG]: CDX Surt-Ordered? True
2017-07-20 19:51:15,727: [DEBUG]: CustomCanonilizer? True
2017-07-20 19:51:15,727: [DEBUG]: FuzzyMatcher? True
2017-07-20 19:51:15,727: [DEBUG]: Adding CDX Source: ZipNum Cluster: /home/doma/Documents/cc-index-server/collections/CC-MAIN-2015-32/indexes/cluster.idx, <pywb.cdx.zipnum.LocPrefixResolver object at 0x7f94e7a64f50>
2017-07-20 19:51:15,727: [DEBUG]: Adding CDX API Handler: CC-MAIN-2015-32-index
2017-07-20 19:51:15,728: [DEBUG]: Adding query_html: query.html
2017-07-20 19:51:15,728: [DEBUG]: CDX Surt-Ordered? True
2017-07-20 19:51:15,819: [DEBUG]: CustomCanonilizer? True
2017-07-20 19:51:15,819: [DEBUG]: FuzzyMatcher? True
2017-07-20 19:51:15,819: [DEBUG]: Adding CDX Source: ZipNum Cluster: /home/doma/Documents/cc-index-server/collections/CC-MAIN-2016-36/indexes/cluster.idx, <pywb.cdx.zipnum.LocPrefixResolver object at 0x7f94e4411950>
2017-07-20 19:51:15,819: [DEBUG]: Adding CDX API Handler: CC-MAIN-2016-36-index
2017-07-20 19:51:15,819: [DEBUG]: Adding query_html: query.html
2017-07-20 19:51:15,820: [DEBUG]: CDX Surt-Ordered? True
2017-07-20 19:51:15,911: [DEBUG]: CustomCanonilizer? True
2017-07-20 19:51:15,911: [DEBUG]: FuzzyMatcher? True
2017-07-20 19:51:15,911: [DEBUG]: Adding CDX Source: ZipNum Cluster: /home/doma/Documents/cc-index-server/collections/CC-MAIN-2016-30/indexes/cluster.idx, <pywb.cdx.zipnum.LocPrefixResolver object at 0x7f94e4488110>
2017-07-20 19:51:15,911: [DEBUG]: Adding CDX API Handler: CC-MAIN-2016-30-index
2017-07-20 19:51:15,912: [DEBUG]: Adding query_html: query.html
2017-07-20 19:51:15,912: [DEBUG]: CDX Surt-Ordered? True
2017-07-20 19:51:16,003: [DEBUG]: CustomCanonilizer? True
2017-07-20 19:51:16,003: [DEBUG]: FuzzyMatcher? True
2017-07-20 19:51:16,004: [DEBUG]: Adding CDX Source: ZipNum Cluster: /home/doma/Documents/cc-index-server/collections/CC-MAIN-2015-11/indexes/cluster.idx, <pywb.cdx.zipnum.LocPrefixResolver object at 0x7f94e7a68c10>
2017-07-20 19:51:16,004: [DEBUG]: Adding CDX API Handler: CC-MAIN-2015-11-index
2017-07-20 19:51:16,004: [DEBUG]: Adding query_html: query.html
2017-07-20 19:51:16,004: [DEBUG]: CDX Surt-Ordered? True
2017-07-20 19:51:16,095: [DEBUG]: CustomCanonilizer? True
2017-07-20 19:51:16,096: [DEBUG]: FuzzyMatcher? True
2017-07-20 19:51:16,096: [DEBUG]: Adding CDX Source: ZipNum Cluster: /home/doma/Documents/cc-index-server/collections/CC-MAIN-2016-18/indexes/cluster.idx, <pywb.cdx.zipnum.LocPrefixResolver object at 0x7f94e4485bd0>
2017-07-20 19:51:16,096: [DEBUG]: Adding CDX API Handler: CC-MAIN-2016-18-index
2017-07-20 19:51:16,096: [DEBUG]: Adding query_html: query.html
2017-07-20 19:51:16,096: [DEBUG]: CDX Surt-Ordered? True
2017-07-20 19:51:16,187: [DEBUG]: CustomCanonilizer? True
2017-07-20 19:51:16,187: [DEBUG]: FuzzyMatcher? True
2017-07-20 19:51:16,187: [DEBUG]: Adding CDX Source: ZipNum Cluster: /home/doma/Documents/cc-index-server/collections/CC-MAIN-2016-50/indexes/cluster.idx, <pywb.cdx.zipnum.LocPrefixResolver object at 0x7f94e4410e90>
2017-07-20 19:51:16,188: [DEBUG]: Adding CDX API Handler: CC-MAIN-2016-50-index
2017-07-20 19:51:16,188: [DEBUG]: *** pywb app inited with config from "create_cdx_server_app"!

2017-07-20 19:51:16,189: [INFO]: Starting pywb CDX Index Server on port 8080
2017-07-20 19:51:18,851: [DEBUG]: Loading 1 blocks from s3://commoncrawl/cc-index/collections/CC-MAIN-2017-26/indexes/cdx-00250.gz:156436206+152458
2017-07-20 19:51:18,867: [DEBUG]: Retrieving credentials from metadata server.
2017-07-20 19:51:19,869: [ERROR]: Caught exception reading instance data

Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/boto/utils.py", line 210, in retry_url
    r = opener.open(req, timeout=timeout)
  File "/usr/lib/python2.7/urllib2.py", line 429, in open
    response = self._open(req, data)
  File "/usr/lib/python2.7/urllib2.py", line 447, in _open
    '_open', req)
  File "/usr/lib/python2.7/urllib2.py", line 407, in _call_chain
    result = func(*args)
  File "/usr/lib/python2.7/urllib2.py", line 1228, in http_open
    return self.do_open(httplib.HTTPConnection, req)
  File "/usr/lib/python2.7/urllib2.py", line 1198, in do_open
    raise URLError(err)
URLError: <urlopen error timed out>
2017-07-20 19:51:19,870: [ERROR]: Unable to read instance data, giving up
2017-07-20 19:51:19,871: [DEBUG]: Retrieving credentials from metadata server.
2017-07-20 19:51:20,873: [ERROR]: Caught exception reading instance data

Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/boto/utils.py", line 210, in retry_url
    r = opener.open(req, timeout=timeout)
  File "/usr/lib/python2.7/urllib2.py", line 429, in open
    response = self._open(req, data)
  File "/usr/lib/python2.7/urllib2.py", line 447, in _open
    '_open', req)
  File "/usr/lib/python2.7/urllib2.py", line 407, in _call_chain
    result = func(*args)
  File "/usr/lib/python2.7/urllib2.py", line 1228, in http_open
    return self.do_open(httplib.HTTPConnection, req)
  File "/usr/lib/python2.7/urllib2.py", line 1198, in do_open
    raise URLError(err)
URLError: <urlopen error timed out>
2017-07-20 19:51:20,873: [ERROR]: Unable to read instance data, giving up
2017-07-20 19:51:20,873: [DEBUG]: path=/
2017-07-20 19:51:20,873: [DEBUG]: auth_path=/commoncrawl/
2017-07-20 19:51:20,874: [DEBUG]: Method: HEAD
2017-07-20 19:51:20,874: [DEBUG]: Path: /
2017-07-20 19:51:20,874: [DEBUG]: Data:
2017-07-20 19:51:20,874: [DEBUG]: Headers: {}
2017-07-20 19:51:20,874: [DEBUG]: Host: commoncrawl.s3.amazonaws.com
2017-07-20 19:51:20,874: [DEBUG]: Port: 443
2017-07-20 19:51:20,875: [DEBUG]: Params: {}
2017-07-20 19:51:20,875: [DEBUG]: establishing HTTPS connection: host=commoncrawl.s3.amazonaws.com, kwargs={'port': 443, 'timeout': 70}
2017-07-20 19:51:20,875: [DEBUG]: Token: None
2017-07-20 19:51:20,876: [DEBUG]: Final headers: {'Content-Length': '0', 'User-Agent': 'Boto/2.48.0 Python/2.7.13 Linux/4.10.0-28-generic'}
2017-07-20 19:51:21,398: [DEBUG]: Response headers: [('x-amz-bucket-region', 'us-east-1'), ('x-amz-id-2', 'Dv/UXKw7KYZ4KUUFtc9H9dXdwggmkBdNj/QBSoaZFmEk+lDjQKDslNvmGDkD/zztBQ8NXHczJZA='), ('server', 'AmazonS3'), ('transfer-encoding', 'chunked'), ('x-amz-request-id', 'F7857E5F7035D7F0'), ('date', 'Thu, 20 Jul 2017 17:51:22 GMT'), ('content-type', 'application/xml')]
2017-07-20 19:51:21,398: [DEBUG]: path=//cc-index/collections/CC-MAIN-2017-26/indexes/cdx-00250.gz
2017-07-20 19:51:21,398: [DEBUG]: auth_path=/commoncrawl//cc-index/collections/CC-MAIN-2017-26/indexes/cdx-00250.gz
2017-07-20 19:51:21,399: [DEBUG]: Method: HEAD
2017-07-20 19:51:21,399: [DEBUG]: Path: /cc-index/collections/CC-MAIN-2017-26/indexes/cdx-00250.gz
2017-07-20 19:51:21,399: [DEBUG]: Data:
2017-07-20 19:51:21,399: [DEBUG]: Headers: {}
2017-07-20 19:51:21,399: [DEBUG]: Host: commoncrawl.s3.amazonaws.com
2017-07-20 19:51:21,399: [DEBUG]: Port: 443
2017-07-20 19:51:21,399: [DEBUG]: Params: {}
2017-07-20 19:51:21,399: [DEBUG]: Token: None
2017-07-20 19:51:21,400: [DEBUG]: Final headers: {'Content-Length': '0', 'User-Agent': 'Boto/2.48.0 Python/2.7.13 Linux/4.10.0-28-generic'}
2017-07-20 19:51:21,535: [DEBUG]: Response headers: [('content-length', '829067432'), ('x-amz-id-2', 'ggiqZwCRLKEJNnT7QC4Fc0SydMeeumj4rLD2zXFaJPDuN24b4fPglmhdCA9rMg0mRIzvYdGk1ew='), ('accept-ranges', 'bytes'), ('server', 'AmazonS3'), ('last-modified', 'Thu, 29 Jun 2017 22:25:09 GMT'), ('x-amz-request-id', '1D50E91C1B3717FF'), ('etag', '"6a53ecc7d298b047806f13fd3732136e"'), ('date', 'Thu, 20 Jul 2017 17:51:22 GMT'), ('content-type', 'application/octet-stream')]
2017-07-20 19:51:21,535: [DEBUG]: path=//cc-index/collections/CC-MAIN-2017-26/indexes/cdx-00250.gz
2017-07-20 19:51:21,535: [DEBUG]: auth_path=/commoncrawl//cc-index/collections/CC-MAIN-2017-26/indexes/cdx-00250.gz
2017-07-20 19:51:21,535: [DEBUG]: Method: GET
2017-07-20 19:51:21,535: [DEBUG]: Path: /cc-index/collections/CC-MAIN-2017-26/indexes/cdx-00250.gz
2017-07-20 19:51:21,535: [DEBUG]: Data:
2017-07-20 19:51:21,536: [DEBUG]: Headers: {'Range': 'bytes=156436206-156588663'}
2017-07-20 19:51:21,536: [DEBUG]: Host: commoncrawl.s3.amazonaws.com
2017-07-20 19:51:21,536: [DEBUG]: Port: 443
2017-07-20 19:51:21,536: [DEBUG]: Params: {}
2017-07-20 19:51:21,536: [DEBUG]: Token: None
2017-07-20 19:51:21,536: [DEBUG]: Final headers: {'Range': 'bytes=156436206-156588663', 'Content-Length': '0', 'User-Agent': 'Boto/2.48.0 Python/2.7.13 Linux/4.10.0-28-generic'}
2017-07-20 19:51:21,701: [DEBUG]: Response headers: [('content-length', '152458'), ('x-amz-id-2', 'rK9YH1gqh8eUB9EwbqijgxsAzdo+Yhk+vkF2B6kyutk1+di5GOG+yN/q+PCCm8e5GqrZCB/e75s='), ('accept-ranges', 'bytes'), ('server', 'AmazonS3'), ('last-modified', 'Thu, 29 Jun 2017 22:25:09 GMT'), ('content-range', 'bytes 156436206-156588663/829067432'), ('x-amz-request-id', 'DB8B705051D6E95B'), ('etag', '"6a53ecc7d298b047806f13fd3732136e"'), ('date', 'Thu, 20 Jul 2017 17:51:22 GMT'), ('content-type', 'application/octet-stream')]
127.0.0.1 - - [20/Jul/2017 19:51:22] "GET /CC-MAIN-2017-26-index?url=commoncrawl.org HTTP/1.1" 200 694





On Thursday, July 20, 2017 at 3:43:19 PM UTC+2, Sebastian Nagel wrote:
Message has been deleted

brano199

unread,
Jul 20, 2017, 2:21:34 PM7/20/17
to Common Crawl
It is not related to the dependencies. I have tried to update those libraries,that warning disappeared but the error still persists.
Message has been deleted

Sebastian Nagel

unread,
Jul 20, 2017, 3:55:33 PM7/20/17
to common...@googlegroups.com
Hi Brano,

just to make it clear: you don't have to pay to access the data from outside the AWS cloud,
it's also not required to own an AWS account. Boto may check for AWS credentials but everything
should work without them. Alternatively, it's possible to access the data via https://, see
cc-index-server/config.yaml.

You also don't have to pay when accessing the data from within the AWS us-east-1 (N. Virginia)
region. But be careful regarding your network setup, esp. when using Elastic IPs or Elastic Load
Balancer, which may cause data transfer costs. If the data is accessed from another AWS region
you have to pay for data transfer, afaik.

Afaics, the errors in the log output are just timeouts, one of the next trials then succeeds.

My assumption, however, is that you hardly can beat the new URL index server using a machine
that runs outside the AWS us-east-1 region. The data transfer over the internet will be the
limiting factor. To be significantly faster, run the index server as close as possible to
the index data (means: launch a small EC2 instance in the us-east-1 region). Alternatively,
for bulk processing, the index files (about 250 GB per monthly crawl) can be accessed directly.

Best,
Sebastian

On 07/20/2017 09:16 PM, brano199 wrote:
> I know what is the problem. I have looked into the Boto library which uses the pywb. It is checking
> for AWS credentials which means you have to pay. But its strange,because you fixed the
> install-collections script to don' t use key,just directly download it.
>
> On Thursday, July 20, 2017 at 8:21:34 PM UTC+2, brano199 wrote:
>
> It is not related to the dependencies. I have tried to update those libraries,that warning
> disappeared but the error still persists.
>

brano199

unread,
Jul 20, 2017, 6:06:19 PM7/20/17
to Common Crawl
Your fix worked and you are right, Sebastian. Running index service myself is actually slower than the fixed new URL indexer, so i guess i will use your server if it is not going to get overwhelmed now. You did a great job here :)

I was just concerned about the speed yesterday and this morning when one request literally took 2 minutes, now its just a matter of seconds.


"Alternatively, for bulk processing, the index files (about 250 GB per monthly crawl) can be accessed directly. "
Yes, downloading the correct index file of format indeed helps too.
https://commoncrawl.s3.amazonaws.com/cc-index/collections/CC-MAIN-2017-26/indexes/cdx-00000.gz.
I got it working by changing config.yaml to
archive_paths: ./
shard_index_loc:
     match: '.*(collections/[^/]+/)'
     replace: './'

"The new index server also allows to get a JSON list of all available
indexes:
  http://test-index.commoncrawl.org/collinfo.json"

Also a great future, i was doing this manually by parsing the contents of http://index.commoncrawl.org/ webpage.

Reply all
Reply to author
Forward
0 new messages