News Crawler seed list

499 views
Skip to first unread message

KP

unread,
Oct 26, 2016, 6:19:41 AM10/26/16
to Common Crawl
Hi all,

  I'm interseted in the New sites crawler but would like a list of the sites that are crawled. I went through the source and found https://github.com/commoncrawl/news-crawl/blob/master/seeds/feeds.txt which has a few sites listed but when I downloaded one of the WARC files, I noticed there are many other sources also crawled. Is there a list of news sites that are crawled?

Thank you

Sebastian Nagel

unread,
Oct 26, 2016, 7:01:09 AM10/26/16
to common...@googlegroups.com
Hi,

the list in seeds/feeds.txt is a manual compilation of a couple of news feeds.

The actual seeds are mostly mined from dmoz.org, see
https://github.com/commoncrawl/news-crawl/issues/8
The linked script extracts 50,000 URLs from 40,000 news sites (or sites categorized as such).
Then I've used Nutch to fetch these URLs and extract RSS and Atom feeds announced in the HTML
content as alternate links, e.g.
<link rel="alternate" type="application/rss+xml" title="ABC Live &raquo; Feed"
href="http://abclive.in/feed/" />

There are/were several challenges and improvements to be done:
- use DMOZ category translations to get links from non-English speaking countries
(this was only partially solved in an earlier version of the news link extraction script)
- extract also feeds announced as "ordinary" links
<a href="index.xml" title="To subscribe to this feed, drag or copy/paste this link to an RSS feed
reader application.">Subscribe to RSS feed</a>
- ev. crawl deeper and (my stupid configuration fault) follow redirects

The list mined this way starting from dmoz.org data contains about 6000 feed URLs from 3000 host/sites.

Best,
Sebastian
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
> To post to this group, send email to common...@googlegroups.com
> <mailto:common...@googlegroups.com>.
> Visit this group at https://groups.google.com/group/common-crawl.
> For more options, visit https://groups.google.com/d/optout.

KP

unread,
Oct 26, 2016, 8:00:24 AM10/26/16
to Common Crawl
Hi Sebastian,
 
   That's great, I'll see if I can extract the exact list given this info but this is very useful information.

Thank you

kind regards
kP

Bogdan Metea

unread,
Jun 20, 2018, 11:18:54 AM6/20/18
to Common Crawl
Hi Sebastian,

I am also trying to find the seed list for common crawl news. I tried to use the shell script from : https://gist.github.com/sebastian-nagel/eee8ed036ee89b1ae09f1124ccfa06d7

Since dmoz is down I didn't have any luck. Is there a way to get the seed list from somewhere else?

Thanks in advance!

Sebastian Nagel

unread,
Jun 20, 2018, 12:29:09 PM6/20/18
to common...@googlegroups.com
Hi Bogdan,

indeed, the mirror dmoztools.net does not yet have the RDF dumps online, although that's planned:
https://www.resource-zone.com/forum/t/where-could-i-get-the-latest-data-dump-before-it-closed.53576/

The Internet Archive has copies of the RDF dumps:
https://web.archive.org/web/20170317132728/http://rdf.dmoz.org/rdf/
I've tried successfully
https://web.archive.org/web/20170317132727/http://rdf.dmoz.org/rdf/structure.rdf.u8.gz
https://web.archive.org/web/20170317132823/http://rdf.dmoz.org/rdf/categories.txt.gz
but looks like the large content.rdf.u8.gz is truncated:
https://web.archive.org/web/20170312160530/http://rdf.dmoz.org/rdf/content.rdf.u8.gz

Maybe there are other copies out there, e.g.
http://download.huihoo.com/dmoz/

Let me know if you need anything. I have a copy on my laptop dating to 2016-12-16.

Best,
Sebastian
Reply all
Reply to author
Forward
0 new messages