Is it possible to crawl a set of urls for scientific research purpose.

zhanghao...@gmail.com

unread,

Feb 12, 2018, 3:46:52 PM2/12/18

to Common Crawl

Hi,

We want to create a corpus from snopes.com for fake news detection domain. But in order to reproduce this corpus, we want to use common crawl as snopes website continuously updates. The problem is common crawl doesn't crawl a part of this urls. Is it possible to crawl these specific urls?

Best regards,

Hao

Tom Morris

unread,

Feb 12, 2018, 7:05:35 PM2/12/18

to common...@googlegroups.com

On Mon, Feb 12, 2018 at 3:46 PM, <zhanghao...@gmail.com> wrote:

We want to create a corpus from snopes.com for fake news detection domain. But in order to reproduce this corpus, we want to use common crawl as snopes website continuously updates. The problem is common crawl doesn't crawl a part of this urls. Is it possible to crawl these specific urls?

snopes.com is copyrighted, so you'll need to get permission if you want to reuse their material:

https://www.snopes.com/frequently-asked-questions/

Q: May I reproduce your material on my web site if I operate a non-commercial site, and I give you credit?

A: No. Using our material without our permission is copyright infringement, even if your site is noncommercial, and even if you give us credit.

zhanghao...@gmail.com

unread,

Feb 13, 2018, 7:35:56 AM2/13/18

to Common Crawl

I know it, so what we want is to publish a crawler to let others can reproduce this corpus rather than direct get it. Is this still not allowed? The problem is snopes continuously updates, if we publish a crawler directly crawled from snopes, others cannot get same corpus.

Best regards,
Hao

在 2018年2月13日星期二 UTC+1上午1:05:35，Tom Morris写道：

Reply all

Reply to author

Forward