Common Crawl for News Articles

1,052 views
Skip to first unread message

Charu Arora

unread,
Oct 4, 2016, 1:21:35 AM10/4/16
to Common Crawl
Has anybody used Common Crawl to retrieve articles/webpages from a particular domain or data source such as The New York Times ? If yes, how did you go along doing that ?
 

Sebastian Nagel

unread,
Oct 13, 2016, 7:27:24 AM10/13/16
to common...@googlegroups.com
Hi,

we've released a dataset containing news articles a couple of weeks ago, see [1] for details.
Articles from the New York Times are missing, probably because of a bug [2]. We hope to get this
fixed during the next weeks together with a couple of other improvements.

For the past years, the only way is to find news articles from the main crawl archives. The best way
to go are the URL indexes, e.g.

For crawls from 2013 and newer (here CC-MAIN-2014-52 = week 52 in 2014):
http://index.commoncrawl.org/CC-MAIN-2014-52-index?url=*.lemonde.fr&output=json
For the 2012 crawl archives:
http://urlsearch.commoncrawl.org/?q=www.lemonde.fr
resp. temporarily for the next days:
http://ec2-54-221-249-42.compute-1.amazonaws.com/?q=www.lemonde.fr

Both indexes also allow to retrieve the page content:

- the 2012 index provides hyperlinks

- for index.commoncrawl.org this example shows how to build the URL:

http://index.commoncrawl.org/CC-MAIN-2014-52?http://www.lemonde.fr/1914-1918-90-ans-apres-l-armistice/article/2008/10/23/1914-la-guerre-de-mouvement_1107586_736535.html

Please, also have a look at the index API [3] or use this list for further questions.

Best,
Sebastian

[1] http://commoncrawl.org/2016/10/news-dataset-available/
[2] https://github.com/commoncrawl/news-crawl/issues/3
[3] https://github.com/ikreymer/pywb/wiki/CDX-Server-API#api-reference

On 10/04/2016 07:21 AM, Charu Arora wrote:
> Has anybody used Common Crawl to retrieve articles/webpages from a particular domain or data source
> such as* The* *New York Times* ? If yes, how did you go along doing that ?
>
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
> To post to this group, send email to common...@googlegroups.com
> <mailto:common...@googlegroups.com>.
> Visit this group at https://groups.google.com/group/common-crawl.
> For more options, visit https://groups.google.com/d/optout.

Ivan Habernal

unread,
Oct 13, 2016, 8:04:13 AM10/13/16
to Common Crawl
Hi Sebastian,

Really appreciate your new dataset compilation efforts!

But at this time, I'm getting really curious - how does CommonCrawl handle copyright infringement? Especially for New York Times, I remember to negotiate with PARS International which is responsible for content reuse of NYT... Well, they insisted on calculating the exact copies that will be re-used, duration, etc., they didn't understand at all that I need their content make available for anyone interested in only doing research. Maybe CommonCrawl succeeded? I know crawling is a grey zone of the Internet but CommonCrawl is big and getting famous, so copyright questions will eventually become an issue, I guess.

Best,

Ivan

Sebastian Nagel

unread,
Oct 13, 2016, 9:53:32 AM10/13/16
to common...@googlegroups.com
Hi Ivan,

as a freelance engineer I'm hardly the right person to answer your question. The short answer from
what I know about it: the publication of the crawl archives is covered by the fair use principle [1].

From a technical perspective let's also emphasize that the crawler is polite and respects robots.txt
statements without exception. There are many newspapers which disallow crawling their site entirely
or exclude all news content. That's a matter of fact, but for text and data mining it's more
important to provide a diverse sample (not too small, of course). That's the target we want to
achieve, and that's not really different from the main crawl which also contains a significant
amount of news content.

Best,
Sebastian


[1] https://en.wikipedia.org/wiki/Fair_use

Ravi Ranjan

unread,
May 3, 2018, 8:31:35 AM5/3/18
to Common Crawl
Where can we find list of all news websites being crawled or the list of those websites that are not being crawled?
Reply all
Reply to author
Forward
0 new messages