Get your sh*t together folks.

Squidblacklist org

unread,

Jun 21, 2017, 5:05:11 AM6/21/17

to Common Crawl

Test your stuff. Fix your stuff. This is the most recent version of Debian, so whats the problem?

The example *.io downloads just fine, but nothing else, and not a damn thing else.

./cdx-index-client.py -c CC-MAIN-2017-17 *.com --fl url -z
2017-06-21 04:05:18,434: [INFO]: Getting Index From http://index.commoncrawl.org/CC-MAIN-2017-17-index
2017-06-21 04:05:18,442: [INFO]: Starting new HTTP connection (1): index.commoncrawl.org
Traceback (most recent call last):
File "./cdx-index-client.py", line 403, in <module>
    main()
File "./cdx-index-client.py", line 399, in main
    read_index(r)
File "./cdx-index-client.py", line 314, in read_index
    num_pages = get_num_pages(api_url, r.url, r.page_size)
File "./cdx-index-client.py", line 44, in get_num_pages
    pages_info = r.json()
File "/usr/lib/python2.7/dist-packages/requests/models.py", line 793, in json
    return json.loads(self.text, **kwargs)
File "/usr/lib/python2.7/json/__init__.py", line 338, in loads
    return _default_decoder.decode(s)
File "/usr/lib/python2.7/json/decoder.py", line 366, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/python2.7/json/decoder.py", line 384, in raw_decode
    raise ValueError("No JSON object could be decoded")
ValueError: No JSON object could be decoded

Sebastian Nagel

unread,

Jun 21, 2017, 8:22:09 AM6/21/17

to Common Crawl

Hi,

what's the problem?

Please, read the docs first:

It is often good idea to check how big the dataset is:

./cdx-index-client.py -c CC-MAIN-2015-06 *.io --show-num-pages

will print the number of pages that will be fetched to get a list of urls in the '*.io' domain.

This will give a relative size of the query. A query with thousands of pages may take a long time!

and think first!

A query for *.com will fetch more than 50% of the index, i.e. more than 1.5 billion records or 100 GB for CC-MAIN-2017-17.

It's faster to download the index files and do a grep locally, see
   https://groups.google.com/d/topic/common-crawl/KhtAjdcqUOc/discussion
   https://groups.google.com/d/msg/common-crawl/MnR7zxPDrt4/GtuZze9qGAAJ

It would also reduce the load on the server which probably was the reason for your problem
(no result in time or a HTML error page received from the index server).

I'm sure Ilya is happy to accept your patches to improve the error handling, just file a bug report
and open a pull request at
   https://github.com/ikreymer/cdx-index-client

Please, note that as a non-profit with limited budget and human resources we cannot guarantee
that services such as the index server scale unlimited and run with zero downtime.
We know that the index server needs a more powerful hardware, the upgrade is already planned.
However, we expect that users behave cooperative and try not to overload these resources.

Best and thanks,
Sebastian

Tom Morris

unread,

Jun 21, 2017, 2:15:28 PM6/21/17

to common...@googlegroups.com

I thought anonymous trolls were to be ignored...

--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl+unsubscribe@googlegroups.com.
To post to this group, send email to common...@googlegroups.com.
Visit this group at https://groups.google.com/group/common-crawl.
For more options, visit https://groups.google.com/d/optout.

Reply all

Reply to author

Forward

Message has been deleted