Questions about downloading news data

175 views

Skip to first unread message

Bahar Zafer

unread,

Jan 23, 2024, 1:39:52 AM1/23/24

to Common Crawl

Hi,

I have started to download news from the main dataset by matching several domains of interest with the URLs of the records. Looking for ways to cut the running time, I noticed the CC-NEWS data is much smaller in size and contains fewer WARC files for each year.

I wonder if CC-NEWS is a subset of CC-MAIN. Will I end up with the same news content if I run my code on CC-NEWS instead of CC-MAIN?

Do WARC files have the same structure in both data sets? Is the WET format also available for the CC-NEWS as working with plain text content is somewhat convenient?

So far, I have tried two scripts in Python, one using boto3 to connect to and download from S3, and the other using httpx to download directly from https://data.commoncrawl.org/. With both scripts, it took approximately 1 hour to iterate over 200 WET paths in CC-MAIN-2015-06 by doing the same matching and downloading task on my computer. Would it be faster to download the files using the AWS command line or any other tool/package?

Thanks,

Bahar

Reply all

Reply to author

Forward

0 new messages