Hi Nigel,
> Your assessment was correct. We were down to round trips of up to 13 secs/request.
Yeah, the index server is heavily loaded. I hope to get it moved to a more powerful
EC2 instance this spring.
You want to fetch a web page (WARC record) using filename, offset, and length given in the cdx file?
It's possible without using the index server, by just accessing AWS S3, see
https://groups.google.com/d/msg/common-crawl/8vnQnUA-0-0/TAb1LeNWFgAJ
Curl could be also used instead of the AWS CLI:
curl -s -r375611148-$((375611148+17240-1)) \
"
https://commoncrawl.s3.amazonaws.com/crawl-data/..." | gzip -dc
> We are considering obtaining
> permission to donate these to cc.
Great!
> If we can demonstrate we can cross reference to our own
> data we may have a concept.
Let us know. If you have a good idea, how to simplify and speed-up the task to sub-sample
CC data, let us also know. It's a frequent but non-trivial problem.
Best,
Sebastian
On 01/28/2017 08:05 PM, Nigel Vickers wrote:
>
>
> On Monday, 16 January 2017 18:24:03 UTC+1, Sebastian Nagel wrote:
> Hallo Sebastian,
> Thanks for your suggestions.
>
>
> yes, that's ok. However, the server
index.commoncrawl.org <
http://index.commoncrawl.org> is