Peristent 503s despite waiting between requests

712 views
Skip to first unread message

Ben Durham

unread,
Oct 22, 2023, 8:16:21 AM10/22/23
to Common Crawl
I have been intending to use boto3 to download some .warc files from the CC-NEWS crawls. I want enough of them such that it is not feasible to download them manually from the AWS s3 interface, which led me to boto3. However, for testing purposes, I did get some .warcs from the s3 interface to test out some of the later stages of my process. While I did get 503s occasionally when downloading files from the s3 web interface, I was always eventually able to get them if I tried again a few seconds later, which is part of why I'm so confused on why I can't get them with boto3 even if i set it up to try again indefinitely if faced with a 503.

Now that I'm trying to gradually work my way through the warcs I want using boto3, I have had no success in retrieving any of the .warcs despite leaving my script running for about an hour, tested at different times of day on different days of the week. This could very easily be user error, as I am not an expert in using boto3.

My approach thus far has been to invoke s3 as a resource and then use s3.meta.client.download_file, but every time I use this, even after setting it up to try again after some number of seconds if it gets a SlowDown response, I have yet to successfully download a single file through this method.

Is there some option I might pass to download_file? Or am using the wrong tool for the job? Or is this just an infrastructure problem that I can't do anything about?

Thank you in advance for any advice or suggestions.

Greg Lindahl

unread,
Oct 22, 2023, 2:07:00 PM10/22/23
to common...@googlegroups.com
> My approach thus far has been to invoke s3 as a resource and then use
> s3.meta.client.download_file, but every time I use this, even after setting
> it up to try again after some number of seconds if it gets a SlowDown
> response, I have yet to successfully download a single file through this
> method.

Ben,

That sounds like a fine way to download, and I am not surprised that
(right now) you are getting nothing but 503s. I've been keeping an
eye on the performance graph, and the overly-aggressive downloader who
started wrecking performance for everyone on Oct 17 at 1900 UTC is still
doing it now, 5 days later.

My recommendation is to set up your download with infinite retries and
leave it running for days. Eventually either the aggressive downloader
will stop, or you'll hit the 1% success rate.

-- greg


Yossi Tamari

unread,
Oct 22, 2023, 5:33:26 PM10/22/23
to common...@googlegroups.com
Hi Ben,

Note that download_file already rerties 503 errors (5 times by default), but, again, by default, it also downloads the files in 10 parallel chunks, and if one chunk fails (after the retries), the whole download fails.That means that if there's a 1% chance of getting through the throttling, your chance is considerably lower.

I would suggest setting max_concurrency to 1 (by passing a TransferConfig object to the down_load file method), and increasing the number of botocore retries (by passing a Config object to the client constructor), instead of retrying yourself.

Relevant documentation:


Yossi.


--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/5be053cf-3ef0-4478-b3b9-17e6df2bef1dn%40googlegroups.com.

Henry S. Thompson

unread,
Oct 23, 2023, 3:21:01 AM10/23/23
to common...@googlegroups.com
Yossi Tamari writes:

> Note that download_file already rerties 503 errors (5 times by
> default), but, again, by default, it also downloads the files in 10
> parallel chunks, and if one chunk fails (after the retries), the
> whole download fails.That means that if there's a 1% chance of
> getting through the throttling, your chance is considerably lower.

Ah, that's helpful. Looking carefully at the CLI output with --debug,
I can see the 10 threads being launched.

> I would suggest setting max_concurrency to 1 (by passing a
> TransferConfig object to the down_load file method)

Unfortunately I can't see any way of controlling concurrency when
using the CLI.

I've stopped my downloader pending further advice -- I've had no
success for 24 hours now.

ht
--
Henry S. Thompson, School of Informatics, University of Edinburgh
10 Crichton Street, Edinburgh EH8 9AB, SCOTLAND -- (44) 131 650-4440
Fax: (44) 131 650-4587, e-mail: h...@inf.ed.ac.uk
URL: http://www.ltg.ed.ac.uk/~ht/
[mail from me _always_ has a .sig like this -- mail without it is forged spam]

yo...@yossi.at

unread,
Oct 23, 2023, 4:03:42 AM10/23/23
to common...@googlegroups.com
Hi Henry,

Setting multipart_threshold to a high enough value should work.

https://docs.aws.amazon.com/cli/latest/topic/s3-config.html#multipart-threshold

Yossi.
--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/f5bjzrdhkiv.fsf%40ecclerig.inf.ed.ac.uk.

Henry S. Thompson

unread,
Oct 24, 2023, 9:05:10 AM10/24/23
to common...@googlegroups.com
yossi writes:

> Setting multipart_threshold to a high enough value should work.
>
> https://awscli.amazonaws.com/v2/documentation/api/latest/topic/config-vars.html#cli-aws-help-config-vars

Thanks! That does work better at the moment.

See also https://awscli.amazonaws.com/v2/documentation/api/latest/topic/config-vars.html#cli-aws-help-config-vars

It's _crucial_ to read the instructions above, rather than anywhere
else searching takes you, for information about parameter setting.
There is a lot of stale and/or confused 'help' out there.

What is working for me at the moment (including some extra insurance),
is the following, using Linux CLI:

Just once:

aws configure --profile [yourChoice] set s3.multipart_threshold 4GB
aws configure --profile [yourChoice] set s3.max_concurrent_requests 1
aws configure --profile [yourChoice] set s3.multipart_chunksize 32MB
aws configure --profile [yourChoice] set retry_mode adaptive
aws configure --profile [yourChoice] set max_attempts 100

Then, in your actual script for fetching:

aws s3 cp s3://commoncrawl/... ... --profile yourChoice --only-show-errors

This is currently giving me successful warc.gz retrievals
at a rate of at best a bit under a file a minute, with longer waits of
anywhere between 2 and 10 minutes.

To see what's happening in detail, add --debug.

Daniel Cunha

unread,
Oct 26, 2023, 5:12:35 PM10/26/23
to Common Crawl
Any chance of getting this "aggressive downloader" blocked? It seems unfair to have one person/entity lock out an entire community

Cheers,
Daniel

Arthur Strong

unread,
Oct 26, 2023, 9:25:15 PM10/26/23
to Common Crawl
Yes, rate limit for each IP would be a great idea.

Gregor Kaczor

unread,
Oct 31, 2023, 11:51:55 AM10/31/23
to Common Crawl
Hello everybody,

as of today I still get a 503.
INFO   | jvm 1    | 2023/10/31 16:39:06 | 503 Service Unavailable for GET /crawl-data/CC-MAIN-2023-06/segments/1674764494986.94/wat/CC-MAIN-20230127132641-20230127162641-00562.warc.wat.gz

This vandal prevents the download for the community for 14 days so far. Thats 3 days longer than it takes me to download WAT files. The community can hardly continue with their projects because someone takes all the resources for himself. Not even sure if he will ever stop that denial of service attack.
What are the ways out of this situation? 
  • IP-Rate was already suggested by Artur.
  • I would like to suggest using torrent-files. Are there torrent files for the warc & wat files? That would provide for a relief.
Any other suggestions?
Cheers,
Greg

Rich Skrenta

unread,
Oct 31, 2023, 1:03:40 PM10/31/23
to common...@googlegroups.com
Apologies. We realize that the datasets are not useful if they can't be downloaded. We're in active workstreams with aws engineers to deploy fixes. Please be patient, and sorry for the issues.

Rich


--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
--
Rich Skrenta
Executive Director, Common Crawl Foundation

Luke

unread,
Nov 3, 2023, 2:23:55 PM11/3/23
to Common Crawl
I am also finding it incredibly slow to access commoncrawl files.
I set up an EC2 instance on us-east-1
I have a list of all the specific warc_files, offsets, and lengths, that I would like to access, and I'm trying to read only the parts of the files that are relevant, to be "efficient"

I found this thread, and setting adaptive retries with max_attempts=100 helped a small amount

However, the code is slow: accessing 3147 bytes from a single s3 file took 7.29 minutes:

import boto3
import botocore

s3 = boto3.client(
service_name='s3',
region_name="us-east-1",
config=botocore.client.Config(
max_pool_connections=1,
retries={'mode': 'adaptive', 'max_attempts': 100},
),
)

bucket_name = 'commoncrawl'
file_key = 'crawl-data/CC-MAIN-2023-40/segments/1695233510888.64/warc/CC-MAIN-20231001105617-20231001135617-00880.warc.gz' # replace with your actual file key
response = s3.get_object(Bucket=bucket_name, Key=file_key, Range='bytes=82652490-82655637')
file_content = response['Body'].read()

What's the recommended way to access commoncrawl data? Are we supposed to download all the warcfiles we need, first, to our own S3 bucket, and then make requests s3.get_object with Range values?

Very Respectfully,

Lorenzo Simionato

unread,
Nov 3, 2023, 5:28:35 PM11/3/23
to Common Crawl
Was creating a Requester Pays bucket considered, maybe in addition to the regular free bucket?
That would be a good workaround for cases like this, the costs should be low for people that need only a few files.

Carlos Baraza

unread,
Nov 3, 2023, 5:34:50 PM11/3/23
to Common Crawl
That seems sensible @Lorenzo. This is such a frustrating issue, consequence of a selfish and disrespectful act.

I hope we can find a good long term solution to stop people abusing the system. I really wonder why would anyone need 70k+ requests per second during 3 weeks... It very much feels like a bug on whatever downloader they have developed, and it doesn't look like anyone is monitoring that system on their side.

Lawrence Stewart

unread,
Nov 7, 2023, 4:37:02 PM11/7/23
to Common Crawl
2 ideas:

1 - use API gateway? and require auth and a rate limit to download from cloudfront. Hopefully AWS would sponsor the costs, or I'm not sure how commoncrawl manages finance. 

2 - Would there be costs for commoncrawl to allow others to clone/sync the s3 buckets to their own s3? I think all that would be need is for commoncrawl to set cross account IAM access, per <https://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/copy-data-from-an-s3-bucket-to-another-account-and-region-by-using-the-aws-cli.html>, this would push the cost to the user. 

Lawrence Stewart

unread,
Nov 7, 2023, 4:39:37 PM11/7/23
to Common Crawl
Or kill cloudfront and require aws credentials through S3, than I'm not sure if the logs show which accounts are abusing. Or perhaps there's other cost advantages to cloudfront. 
Reply all
Reply to author
Forward
0 new messages