Getting the Common Crawls news URLS

541 views
Skip to first unread message

Simon Burfield

unread,
Sep 23, 2018, 7:36:39 AM9/23/18
to common...@googlegroups.com
Hi all

I want to start using the common crawl news feed as it sounds awesome

My usual process for getting common crawl data is using the following URL to get a list urls to scan

--
Simon Burfield
iOS/Android Developer + LEGO MINDSTORMS / Robotics Builder

Tom Morris

unread,
Sep 23, 2018, 8:09:00 PM9/23/18
to common...@googlegroups.com
Hi Simon,

I don't think the news collection has an index. You can get a list of all the news archive WARCs using the aws command line utility:

    aws s3 ls --recursive --no-sign-request s3://commoncrawl/crawl-data/CC-NEWS/

but it doesn't appear to be the production one.

I don't think there's any easy fixed HTTP access pattern because the WARC names have variable timestamps in them, but you could use the prefix query parameter on the bucket listing. For example to get a listing of just the news WARCs for today, you could request:


which will return you an XML document with a listing of the various WARCs which match and then download one or more of them


The last couple of WARCs have ~2700-2800 domains represented. I've attached the domain page counts for the most recent one.

Hope that helps!

Tom

--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To post to this group, send email to common...@googlegroups.com.
Visit this group at https://groups.google.com/group/common-crawl.
For more options, visit https://groups.google.com/d/optout.
news-domain-counts-20180923140500.txt

Simon Burfield

unread,
Sep 23, 2018, 11:16:43 PM9/23/18
to common...@googlegroups.com
Hi Tom

Thank you for this, I will give it a try, all I want to do is run a Java app to stick them in Mongo.  Someone may have already done this on github :). Your query idea looks best :)
Reply all
Reply to author
Forward
0 new messages