Hi,
I have started to download news from the main dataset by matching several domains of interest with the URLs of the records. Looking for ways to cut the running time, I noticed the CC-NEWS data is much smaller in size and contains fewer WARC files for each year.
I wonder if CC-NEWS is a subset of CC-MAIN. Will I end up with the same news content if I run my code on CC-NEWS instead of CC-MAIN?
Do WARC files have the same structure in both data sets? Is the WET format also available for the CC-NEWS as working with plain text content is somewhat convenient?
So far, I have tried two scripts in Python, one using boto3 to connect to and download from S3, and the other using httpx to download directly from
https://data.commoncrawl.org/. With both scripts, it took approximately 1 hour to iterate over 200 WET paths in CC-MAIN-2015-06 by doing the same matching and downloading task on my computer. Would it be faster to download the files using the AWS command line or any other tool/package?
Thanks,
Bahar