Hi,
the list in seeds/feeds.txt is a manual compilation of a couple of news feeds.
The actual seeds are mostly mined from
dmoz.org, see
https://github.com/commoncrawl/news-crawl/issues/8
The linked script extracts 50,000 URLs from 40,000 news sites (or sites categorized as such).
Then I've used Nutch to fetch these URLs and extract RSS and Atom feeds announced in the HTML
content as alternate links, e.g.
<link rel="alternate" type="application/rss+xml" title="ABC Live » Feed"
href="
http://abclive.in/feed/" />
There are/were several challenges and improvements to be done:
- use DMOZ category translations to get links from non-English speaking countries
(this was only partially solved in an earlier version of the news link extraction script)
- extract also feeds announced as "ordinary" links
<a href="index.xml" title="To subscribe to this feed, drag or copy/paste this link to an RSS feed
reader application.">Subscribe to RSS feed</a>
- ev. crawl deeper and (my stupid configuration fault) follow redirects
The list mined this way starting from
dmoz.org data contains about 6000 feed URLs from 3000 host/sites.
Best,
Sebastian
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
>
common-crawl...@googlegroups.com <mailto:
common-crawl...@googlegroups.com>.
> To post to this group, send email to
common...@googlegroups.com
> <mailto:
common...@googlegroups.com>.
> Visit this group at
https://groups.google.com/group/common-crawl.
> For more options, visit
https://groups.google.com/d/optout.