Hi Nikolay,
if you're using Amazon Sagemaker, it's strongly recommended
to use the S3 API and run the instance in the AWS region "us-east-1"
where the bucket s3://commoncrawl/ is located.
If just verified the download speed on an EC2 instance running
in us-east-1:
$> time aws s3 cp
s3://commoncrawl/crawl-data/CC-NEWS/2022/08/CC-NEWS-20220815172849-00253.warc.gz
.
download:
s3://commoncrawl/crawl-data/CC-NEWS/2022/08/CC-NEWS-20220815172849-00253.warc.gz
to ./CC-NEWS-20220815172849-00253.warc.gz
real 0m5.947s
user 0m4.668s
sys 0m4.262s
Accessing the same file via CloudFront may take longer (right now it
definitely takes longer). Could you try to use the S3 API?
From a Jupyter Python notebook, you typically would use boto3, e.g.
import boto3
s3client = boto3.client('s3', use_ssl=False)
with open('local_path', 'wb') as data:
s3client.download_fileobj(
'commoncrawl',
'crawl-data/CC-NEWS/2022/08/CC-NEWS-20220815172849-00253.warc.gz',
data
)
data.seek(0)
# process data
See also:
https://commoncrawl.org/access-the-data/
https://commoncrawl.org/2022/03/introducing-cloudfront-access-to-common-crawl-data/
https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html
Best,
Sebastian
On 8/23/22 16:08, Nikolay Kadochnikov wrote:
> Good day everyone.
>
> I am extracting data from Common Crawl news, as part of my research
> project. It used to take about 6 seconds seconds to download a single
> 1Gb WARC file, but starting from yesterday it is taking ~17 minutes on
> average.
>
> I thought my traffic from Google Cloud was getting throttled, so I
> decided to try running it from AWS Sagemaker... but getting the same 17
> minutes per file speeds. Can you please provide guidance on how to get
> the download speed back to normal?
>
> 2022-08-23 09_04_02-JupyterLab — Mozilla Firefox.jpg
>