404 ERROR downloading Files

195 views
Skip to first unread message

Jipeng ZHANG

unread,
Nov 12, 2023, 10:17:30 AM11/12/23
to Common Crawl
Hello,

I am trying to download the files from https://data.commoncrawl.org/.

After downloading '.wet' url file from the corresponding snapshot, I tried to attached the header https://data.commoncrawl.org/ into each url path.


I am wondering how can I deal with this error.

Thanks.



Lorenzo Simionato

unread,
Nov 12, 2023, 2:09:31 PM11/12/23
to Common Crawl
Where did you get that path from? I don't see it in the list of paths you can get from:
https://data.commoncrawl.org/crawl-data/CC-MAIN-2023-40/wet.paths.gz

Greg Lindahl

unread,
Nov 12, 2023, 2:10:33 PM11/12/23
to Common Crawl
This file is actually named 

CC-MAIN-20230921073711-20230921103711-00667.warc.wet.gz

$ aws s3 ls s3://commoncrawl/crawl-data/CC-MAIN-2023-40/segments/1695233505362.29/wet/CC-MAIN-20230921073711-20230921103711-00667.warc.wet.gz

2023-10-05 06:30:31  112136888 CC-MAIN-20230921073711-20230921103711-00667.warc.wet.gz


The filename you give is in segment 1695233506429.78.


Hope this helps! -- greg



On Sunday, November 12, 2023 at 7:17:30 AM UTC-8 zjp119...@gmail.com wrote:
Reply all
Reply to author
Forward
0 new messages