if you're using Amazon Sagemaker, it's strongly recommended
to use the S3 API and run the instance in the AWS region "us-east-1"
where the bucket s3://commoncrawl/ is located.
If just verified the download speed on an EC2 instance running
$> time aws s3 cp
Accessing the same file via CloudFront may take longer (right now it
definitely takes longer). Could you try to use the S3 API?
From a Jupyter Python notebook, you typically would use boto3, e.g.
s3client = boto3.client('s3', use_ssl=False)
with open('local_path', 'wb') as data:
# process data
On 8/23/22 16:08, Nikolay Kadochnikov wrote:
> Good day everyone.
> I am extracting data from Common Crawl news, as part of my research
> project. It used to take about 6 seconds seconds to download a single
> 1Gb WARC file, but starting from yesterday it is taking ~17 minutes on
> I thought my traffic from Google Cloud was getting throttled, so I
> decided to try running it from AWS Sagemaker... but getting the same 17
> minutes per file speeds. Can you please provide guidance on how to get
> the download speed back to normal?
> 2022-08-23 09_04_02-JupyterLab — Mozilla Firefox.jpg