Using AWS cli vs. cURL to pull file segments (partial files)

134 views
Skip to first unread message

Stephen K

unread,
Dec 22, 2022, 5:06:44 AM12/22/22
to Common Crawl
Are there any benefits to using the AWS CLI:

aws s3 cp  s3://... 

vs. 

HTTP request:

curl -s -r$offset-$(($offset+$length-1)) "$warc_path" >  /output.gz


and separately is it possible to use offsets to pull partial file segments with the AWS cli like I'm doing with the above cURL request?

Thanks,

Stephen

Sebastian Nagel

unread,
Dec 22, 2022, 5:40:29 AM12/22/22
to common...@googlegroups.com
Hi Stephen,

> Are there any benefits to using the AWS CLI:

Since April 2022 accessing Common Crawl data via the S3 API requires
authentication (see [1,2]) and the AWS CLI implements authentication.

> and separately is it possible to use offsets to pull partial file
> segments with the AWS cli like I'm doing with the above cURL request?

The command "aws s3api get-object" provides the option "--range", see
the [3] or "aws s3api get-object help".

Best,
Sebastian

[1]
https://commoncrawl.org/2022/03/introducing-cloudfront-access-to-common-crawl-data/
[2] https://commoncrawl.org/access-the-data/
[3] https://docs.aws.amazon.com/cli/latest/reference/s3api/get-object.html

Stephen Krings

unread,
Dec 22, 2022, 3:13:58 PM12/22/22
to common...@googlegroups.com
Awesome, thanks Sebastian. For anyone else who might view this thread, I was able to get this work with the AWS CLI in the following way:

aws s3api get-object --bucket commoncrawl --key crawl-data/CC-MAIN-2022-49/segments/1669446711003.56/warc/CC-MAIN-20221205032447-20221205062447-00645.warc.gz --range bytes=8112713-8115488 my_data_range.gz

--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/d646a18f-4bd3-9588-1322-aea3646abea6%40commoncrawl.org.
Reply all
Reply to author
Forward
0 new messages