yossi writes:
> Setting multipart_threshold to a high enough value should work.
>
>
https://awscli.amazonaws.com/v2/documentation/api/latest/topic/config-vars.html#cli-aws-help-config-vars
Thanks! That does work better at the moment.
See also
https://awscli.amazonaws.com/v2/documentation/api/latest/topic/config-vars.html#cli-aws-help-config-vars
It's _crucial_ to read the instructions above, rather than anywhere
else searching takes you, for information about parameter setting.
There is a lot of stale and/or confused 'help' out there.
What is working for me at the moment (including some extra insurance),
is the following, using Linux CLI:
Just once:
aws configure --profile [yourChoice] set s3.multipart_threshold 4GB
aws configure --profile [yourChoice] set s3.max_concurrent_requests 1
aws configure --profile [yourChoice] set s3.multipart_chunksize 32MB
aws configure --profile [yourChoice] set retry_mode adaptive
aws configure --profile [yourChoice] set max_attempts 100
Then, in your actual script for fetching:
aws s3 cp s3://commoncrawl/... ... --profile yourChoice --only-show-errors
This is currently giving me successful warc.gz retrievals
at a rate of at best a bit under a file a minute, with longer waits of
anywhere between 2 and 10 minutes.
To see what's happening in detail, add --debug.