Groups

404 ERROR downloading Files

195 views

Skip to first unread message

Jipeng ZHANG

unread,

Nov 12, 2023, 10:17:30 AM11/12/23

to Common Crawl

Hello,

I am trying to download the files from https://data.commoncrawl.org/.

After downloading '.wet' url file from the corresponding snapshot, I tried to attached the header https://data.commoncrawl.org/ into each url path.

However, I found many of the wet files got 404 ERROR when downloading, like https://data.commoncrawl.org/crawl-data/CC-MAIN-2023-40/segments/1695233505362.29/wet/CC-MAIN-20230922234442-20230923024442-00667.warc.wet.gz .

I am wondering how can I deal with this error.

Thanks.

Lorenzo Simionato

unread,

Nov 12, 2023, 2:09:31 PM11/12/23

to Common Crawl

Where did you get that path from? I don't see it in the list of paths you can get from:
https://data.commoncrawl.org/crawl-data/CC-MAIN-2023-40/wet.paths.gz

Greg Lindahl

unread,

Nov 12, 2023, 2:10:33 PM11/12/23

to Common Crawl

This file is actually named

CC-MAIN-20230921073711-20230921103711-00667.warc.wet.gz

$ aws s3 ls s3://commoncrawl/crawl-data/CC-MAIN-2023-40/segments/1695233505362.29/wet/CC-MAIN-20230921073711-20230921103711-00667.warc.wet.gz

2023-10-05 06:30:31 112136888 CC-MAIN-20230921073711-20230921103711-00667.warc.wet.gz

The filename you give is in segment 1695233506429.78.

Hope this helps! -- greg

On Sunday, November 12, 2023 at 7:17:30 AM UTC-8 zjp119...@gmail.com wrote:

Reply all

Reply to author

Forward

0 new messages

Search

Clear search

Close search

Google apps

Main menu