Groups keyboard shortcuts have been updated
Dismiss
See shortcuts

Assistance Required – Unable to Access Common Crawl Files

46 views
Skip to first unread message

Manmohan Nayak

unread,
Mar 31, 2025, 1:45:28 PMMar 31
to common...@googlegroups.com
Dear Common Crawl Team,

I am attempting to download specific files following the instructions provided in the links below:

While I was able to download the folder, I noticed that it contains no data and appears to be empty.

Additionally, I attempted to download the files using the following commands:

aws s3 cp s3://commoncrawl/crawl-data/CC-MAIN-2018-17/segments/1524125937193.1/warc/CC-MAIN-20180420081400-20180420101400-00000.warc.gz <local_path> --no-sign-request

Unfortunately, both commands were unsuccessful. Could you please confirm if there are any restrictions on accessing these files? I am trying to access them from the UK.

I would appreciate any guidance you can provide.

Best regards,


Tom Morris

unread,
Mar 31, 2025, 2:54:39 PMMar 31
to common...@googlegroups.com
The page at first URL you referenced says:

Please note, access to data from the Amazon cloud using the S3 API is only allowed for authenticated users. Please see our blog announcement for more information.

and none of the example commands include the --no-sign-request flag that you included, so that explains the likely problem with your first command.

The second command works for me. What error message are you getting?

Tom


--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/common-crawl/CAHsytomMB50q7y6O%2BmDo%3DYeBNdD2ws%3DfBFPvzbx1dimTPy7tNA%40mail.gmail.com.

Manmohan Nayak

unread,
Mar 31, 2025, 4:38:30 PMMar 31
to common...@googlegroups.com
Thanks Tom,
Blog helps for S3 API connections.

With wget I get this error.

'wget : Unable to read data from the transport connection: An existing connection was forcibly closed by the remote host.                                    At line:1 char:1                                                                                                                                            + wget "https://data.commoncrawl.org/crawl-data/CC-MAIN-2018-17/segment ...                                                                                 + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : NotSpecified: (:) [Invoke-WebRequest], IOException
    + FullyQualifiedErrorId : System.IO.IOException,Microsoft.PowerShell.Commands.InvokeWebRequestCommand

'
Regards







--
Best Regards

Thom Vaughan

unread,
Mar 31, 2025, 5:24:32 PMMar 31
to Common Crawl
Hi Manmohan,

Thanks for the follow-up, and thanks Tom for helping debug the S3 issue.

As for the error you're seeing with wget, it's worth noting that data.commoncrawl.org is protected by Cloudflare’s Web Application Firewall, and your error may be caused by something Cloudflare is blocking. Unfortunately there's no visibility on our end into these specifics and the best advice we can offer here is simply to try to take it easy with the number of requests.

Best regards,
TV

Greg Lindahl

unread,
Mar 31, 2025, 10:13:46 PMMar 31
to common...@googlegroups.com
*Cloudfront, not Cloudflare

Also Manmohan Nayak, we removed the --no-sign-request flag from our get-started document in late 2023. I'm a little surprised that you saw it.

-- greg



Reply all
Reply to author
Forward
0 new messages