Failing to download .wet.gz files

411 views
Skip to first unread message

Soham Tripathy

unread,
Oct 9, 2023, 11:11:39 PM10/9/23
to Common Crawl
I am trying to reproduce a dataset as mentioned in the link   microsoft/biosbias: Code to reproduce data for Bias in Bios (github.com)

So in the download_bios.py the link to download is: https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2018-13/wet.paths.gz
Why is not giving the correct output?
Kindly help!
The authors of the paper seem to have generated dataset using this link only.
Thanks

Greg Lindahl

unread,
Oct 11, 2023, 11:40:57 PM10/11/23
to common...@googlegroups.com
Please see:

https://commoncrawl.org/blog/introducing-cloudfront-access-to-common-crawl-data

Also noted in https://github.com/microsoft/biosbias/issues/4 -- the fix
given there is correct.

-- greg

On Mon, Oct 09, 2023 at 08:11:38PM -0700, Soham Tripathy wrote:
> I am trying to reproduce a dataset as mentioned in the link microsoft/biosbias:
> Code to reproduce data for Bias in Bios (github.com)
> <https://github.com/microsoft/biosbias/tree/master>
>
> So in the download_bios.py the link to download is:
> https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2018-13/wet.paths.gz
> Why is not giving the correct output?
> Kindly help!
> The authors of the paper seem to have generated dataset using this link
> only.
> Thanks
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/a6e44ac3-6522-488c-a5eb-33760efca90cn%40googlegroups.com.

Soham Tripathy

unread,
Oct 12, 2023, 8:37:43 AM10/12/23
to Common Crawl
Yes the issue is resolved. Also I wanted to know if there is a limit to the number of requests made to the Common Crawl website per ip-address?

Greg Lindahl

unread,
Oct 13, 2023, 6:35:17 PM10/13/23
to common...@googlegroups.com
Common Crawl's AWS bucket tends to be overloaded sometimes because it
is overly popular. I'm in the middle of trying to analyze and make
access more fair... but one of the most concrete pieces of advice I
can give everyone is that you should limit the number of ips and the
number of requests per ip to something reasonable.

And at the least, please do something to slow down if you are getting
slow down responses.

-- greg
> To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/3e99fdc9-80a5-4dfb-a2da-3edb759edda1n%40googlegroups.com.

Soham Tripathy

unread,
Oct 15, 2023, 3:21:42 AM10/15/23
to common...@googlegroups.com
Thanks a lot!

Lawrence Stewart

unread,
Oct 18, 2023, 11:54:26 AM10/18/23
to Common Crawl
I'm getting the "An error occurred (SlowDown) when calling the GetObject operation (reached max retries: 4): Please reduce your request rate" error on the first download request. 

Is cloudfront now the recommended way to download? I remember downloads used to be much slower than s3. 

Lawrence Stewart

unread,
Oct 18, 2023, 11:30:54 PM10/18/23
to Common Crawl
I'd like to add the blocking/throttling on s3 seems to be quite aggressive. I'm experience a failing node from my ISP, that's causing a lot of packet loss, interrupting the s3 downloads. So I'm getting throttled with boto3 built in retrys and as a I can't even download a single warc from s3 without getting throttled by s3. 

I guess the popularity has exploded with LLMs over the past few months. 

Anything we can do to get AWS to up resource allocation or bandwidth? 

Henry S. Thompson

unread,
Oct 19, 2023, 3:33:22 AM10/19/23
to common...@googlegroups.com
Lawrence Stewart writes:

> I'd like to add the blocking/throttling on s3 seems to be quite
> aggressive.

Agreed. I had a bit of success (10 warc files over a few hours) using
the CLI and setting AWS_RETRY_MODE=adaptive, but now nothing is
getting through regardless of how long I wait.

I tried --debug, and here's the failure:

2023-10-19 08:22:31,033 - MainThread - botocore.utils - DEBUG - Caught retryable HTTP exception while making metadata service request to http://169.254.169.254/latest/api/token: Connect timeout on endpoint URL: "http://169.254.169.254/latest/api/token"
Traceback (most recent call last):
File "urllib3/connection.py", line 174, in _new_conn
File "urllib3/util/connection.py", line 95, in create_connection
File "urllib3/util/connection.py", line 85, in create_connection
TimeoutError: timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "awscli/botocore/httpsession.py", line 448, in send
File "urllib3/connectionpool.py", line 798, in urlopen
File "urllib3/util/retry.py", line 525, in increment
File "urllib3/packages/six.py", line 770, in reraise
File "urllib3/connectionpool.py", line 714, in urlopen
File "urllib3/connectionpool.py", line 415, in _make_request
File "urllib3/connection.py", line 244, in request
File "http/client.py", line 1286, in request
File "awscli/botocore/awsrequest.py", line 94, in _send_request
File "http/client.py", line 1332, in _send_request
File "http/client.py", line 1281, in endheaders
File "awscli/botocore/awsrequest.py", line 122, in _send_output
File "awscli/botocore/awsrequest.py", line 206, in send
File "http/client.py", line 979, in send
File "urllib3/connection.py", line 205, in connect
File "urllib3/connection.py", line 179, in _new_conn
urllib3.exceptions.ConnectTimeoutError: (<botocore.awsrequest.AWSHTTPConnection object at 0x7f92535dba50>, 'Connection to 169.254.169.254 timed out. (connect timeout=1)')

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "awscli/botocore/utils.py", line 383, in _fetch_metadata_token
File "awscli/botocore/httpsession.py", line 483, in send
botocore.exceptions.ConnectTimeoutError: Connect timeout on endpoint URL: "http://169.254.169.254/latest/api/token"
2023-10-19 08:22:31,047 - MainThread - urllib3.connectionpool - DEBUG - Starting new HTTP connection (2): 169.254.169.254:80

No sign of adaptive retrying -- the retry happens within
milliseconds (the timestamps are mine, in BST):

2023-10-19 08:22:31,047 - MainThread - urllib3.connectionpool - DEBUG - Starting new HTTP connection (2): 169.254.169.254:80
2023-10-19 08:22:32,048 - MainThread - botocore.utils - DEBUG - Caught retryable HTTP exception while making metadata service request to http://169.254.169.254/latest/meta-data/placement/availability-zone/: Connect timeout on endpoint URL: "http://169.254.169.254/latest/meta-data/placement/availability-zone/"
Traceback (most recent call last):

...

ht
--
Henry S. Thompson, School of Informatics, University of Edinburgh
10 Crichton Street, Edinburgh EH8 9AB, SCOTLAND -- (44) 131 650-4440
Fax: (44) 131 650-4587, e-mail: h...@inf.ed.ac.uk
URL: http://www.ltg.ed.ac.uk/~ht/
[mail from me _always_ has a .sig like this -- mail without it is forged spam]

Greg Lindahl

unread,
Oct 19, 2023, 7:50:56 AM10/19/23
to common...@googlegroups.com
CloudFront is the recommended way to download. Its performance graphs
look pretty similar to directly accessing the S3 bucket.

What's been going on recently is that someone has been sending about
70,000 requests per second. Normal is about 1,000. This has been going
on for 36 hours. I have no way to find out who is doing it.

I may have another go at trying to turn on IP-based requests-per-second
limiting.

-- greg
> >> https://groups.google.com/d/msgid/common-crawl/20231013223512.GA17154%40rd.bx9.net
> >> .
> >>
> >
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/fe0f46e3-ca40-4bcc-beaa-6b016f0e6d53n%40googlegroups.com.

Lawrence Stewart

unread,
Oct 19, 2023, 10:36:28 AM10/19/23
to Common Crawl
> What's been going on recently is that someone has been sending about
70,000 requests per second. Normal is about 1,000. This has been going
on for 36 hours. I have no way to find out who is doing it.

Could a contact with the open data sponsorship give cloudwatch for a few hours? 

Roi Krakovski

unread,
Oct 20, 2023, 4:44:32 AM10/20/23
to Common Crawl
I am also not able to download the file:
https://data.commoncrawl.org/crawl-data/CC-NEWS/2023/10/CC-NEWS-20231020064311-02094.warc.gz
Any idea how this can be fixed?
Thanks!
Roi

Henry S. Thompson

unread,
Oct 20, 2023, 8:47:00 AM10/20/23
to common...@googlegroups.com
'Henry S. Thompson' via Common Crawl writes:

> Agreed. I had a bit of success (10 warc files over a few hours) using
> the CLI and setting AWS_RETRY_MODE=adaptive, but now nothing is
> getting through regardless of how long I wait.

A bit of progress. Using these two settings in my script, and a
single thread (i.e. trying only one "aws s3 cp s3:..." request at a
time):

export AWS_RETRY_MODE=adaptive
export AWS_MAX_ATTEMPTS=5

I've been able to get about 115 files, with 27 failures, in the last
80 minutes or so.

Greg Lindahl

unread,
Oct 20, 2023, 2:18:00 PM10/20/23
to 'Henry S. Thompson' via Common Crawl
Henry,

I'd recommend infinite retries, actually -- as long as it's doing some
kind of back-off, eventually you'll either get lucky or the excessive
traffic against our S3 bucket will fall enough for you to get through.

The AWS CLI client always does back-off. For wget this sort of "infinite
retry with backoff" looks like this:

wget -c -t 0 --retry-on-http-error=503 https://data.commoncrawl.org/crawl-data/...

To get the same effect in the AWS CLI, I suppose you could set

export AWS_MAX_ATTEMPTS=9999

I have a ticket open with the S3 folks to change the configuration of our
bucket. Hopefully that will get done during the US workday today and will
improve things.

-- greg
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/f5b34y5ihqa.fsf%40ecclerig.inf.ed.ac.uk.

Henry S. Thompson

unread,
Oct 21, 2023, 2:52:49 PM10/21/23
to common...@googlegroups.com
Greg Lindahl writes:

> I'd recommend infinite retries, actually -- as long as it's doing some
> kind of back-off, eventually you'll either get lucky or the excessive
> traffic against our S3 bucket will fall enough for you to get through.

I was thinking of increasing the count, thanks for the suggestion.

> The AWS CLI client always does back-off. For wget this sort of "infinite
> retry with backoff" looks like this:

Setting AWS_RETRY_MODE=adaptive was what seemed to make the difference
for me, although I also did have to increase the retry count.
According to the documentation [1] the default retry count for
'standard' retry mode is 5 (including the initial request), but for
'adaptive' it's only 3. Bumping it up to 5 appeared to help.

To be clear, _none_ of my experience was from controlled experiments,
as I made changes one after another, not running multiple simultaneous
alternatives.

> wget -c -t 0 --retry-on-http-error=503 https://data.commoncrawl.org/crawl-data/...

Although your earlier message implied the throttling was _not_ on a
per-client-IP basis, which I had sort of assumed must be the case,
I've never had any joy trying to do my own client-side retry, which is
where retrying with wget (or curl) happens.

> To get the same effect in the AWS CLI, I suppose you could set
>
> export AWS_MAX_ATTEMPTS=9999

Yes, I think that's the right way to go. One thing I observed from
the firehose you get from aws --debug is that the retry timeouts are
very small. Here's an ascii histogram of the sleep time in seconds:

n min max
35 0.050 1.857

0.140 4 ****
0.321 4 ****
0.501 9 *********
0.682 4 ****
0.863 4 ****
1.044 1 *
1.224 2 **
1.405 1 *
1.586 2 **
1.767 4 ****

> I have a ticket open with the S3 folks to change the configuration of our
> bucket. Hopefully that will get done during the US workday today and will
> improve things.

Thanks, let's hope so!

ht

[Late update -- running with AWS_MAX_ATTEMPTS=50 overnight, _no_
successful downloads. Each failed retrieval is taking around 4
minutes, so I've backed off to AWS_MAX_ATTEMPTS=5, but only had 3
successes in 10 hours... :-[

Amir Shukayev

unread,
Oct 25, 2023, 6:24:45 PM10/25/23
to Common Crawl
Is it possible the the method of accessing WARC files using offsets from the index file in Athena is ballooning the number of requests to 1 request per record? Not sure how it works within Athena. 

Greg Lindahl

unread,
Oct 26, 2023, 11:50:54 AM10/26/23
to 'Amir Shukayev' via Common Crawl
On Wed, Oct 25, 2023 at 03:24:45PM -0700, 'Amir Shukayev' via Common Crawl wrote:
> Is it possible the the method of accessing WARC files using offsets from
> the index file in Athena is ballooning the number of requests to 1 request
> per record? Not sure how it works within Athena.

The parquet files that underlie the columnar index are column stores,
and when Athena decides that it needs to look at a group of records,
it will read and decompress the appropriate column(s) from that
group. I've never seen this result in something unnecessarily
inefficient.

With both indexes, if you are doing something like "let's extract all
of the WARC records for website foo.com" or "let's extract all of the
WARC records with a language code of French", then indeed you will be
reading 1 WARC record at a time using a range request.

This has always been true and is not itself a problem. The problem is
that someone is hitting us with too many requests per second, is
getting 99% 503s, and isn't slowing down.

-- greg


Reply all
Reply to author
Forward
0 new messages