Hi,
we've released a dataset containing news articles a couple of weeks ago, see [1] for details.
Articles from the New York Times are missing, probably because of a bug [2]. We hope to get this
fixed during the next weeks together with a couple of other improvements.
For the past years, the only way is to find news articles from the main crawl archives. The best way
to go are the URL indexes, e.g.
For crawls from 2013 and newer (here CC-MAIN-2014-52 = week 52 in 2014):
http://index.commoncrawl.org/CC-MAIN-2014-52-index?url=*.lemonde.fr&output=json
For the 2012 crawl archives:
http://urlsearch.commoncrawl.org/?q=www.lemonde.fr
resp. temporarily for the next days:
http://ec2-54-221-249-42.compute-1.amazonaws.com/?q=www.lemonde.fr
Both indexes also allow to retrieve the page content:
- the 2012 index provides hyperlinks
- for
index.commoncrawl.org this example shows how to build the URL:
http://index.commoncrawl.org/CC-MAIN-2014-52?http://www.lemonde.fr/1914-1918-90-ans-apres-l-armistice/article/2008/10/23/1914-la-guerre-de-mouvement_1107586_736535.html
Please, also have a look at the index API [3] or use this list for further questions.
Best,
Sebastian
[1]
http://commoncrawl.org/2016/10/news-dataset-available/
[2]
https://github.com/commoncrawl/news-crawl/issues/3
[3]
https://github.com/ikreymer/pywb/wiki/CDX-Server-API#api-reference
On 10/04/2016 07:21 AM, Charu Arora wrote:
> Has anybody used Common Crawl to retrieve articles/webpages from a particular domain or data source
> such as* The* *New York Times* ? If yes, how did you go along doing that ?
>
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
>
common-crawl...@googlegroups.com <mailto:
common-crawl...@googlegroups.com>.
> To post to this group, send email to
common...@googlegroups.com
> <mailto:
common...@googlegroups.com>.
> Visit this group at
https://groups.google.com/group/common-crawl.
> For more options, visit
https://groups.google.com/d/optout.