Website data

68 views
Skip to first unread message

has....@gmail.com

unread,
May 25, 2017, 5:18:01 AM5/25/17
to Common Crawl
Is it possible to download the warc files or wet file which have data related to only one website.??
For eg - data related only cinemablend.com
Help will be appreciated.
Thanks,
Moid Hassan

Sebastian Nagel

unread,
May 26, 2017, 4:42:45 AM5/26/17
to common...@googlegroups.com
Hi Moid,

the pages are shuffled before they are written to WARC and WET files. So, every WARC/WAT/WET file
contains a (pseudo)random sample of web pages.

The easiest way is to use the URL index to look up a domain and then iterate over found pages
and fetch them from WARC files. How to actually do it was discussed recently in this group [1].
Here a short example:

For cinemablend.com you have to start with the queries
http://index.commoncrawl.org/CC-MAIN-2017-17-index?url=cinemablend.com&matchType=domain&page=0
...
http://index.commoncrawl.org/CC-MAIN-2017-17-index?url=cinemablend.com&matchType=domain&page=12
(have a look at the API [1] to figure out how many pages you can expect)

Clip out the URLs from the result and fetch them from the archives, e.g.

http://index.commoncrawl.org/CC-MAIN-2017-17/http://www.cinemablend.com/10-000-B-C-2378.html?tid=15047
(you can also use a HTTP range query, see [1])

Some programming is required to put the URLs together and run it until you have all desired content.

Thanks,
Sebastian

[1] https://groups.google.com/d/topic/common-crawl/pQ34q-_EARU/discussion
[2] https://github.com/ikreymer/pywb/wiki/CDX-Server-API
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
> To post to this group, send email to common...@googlegroups.com
> <mailto:common...@googlegroups.com>.
> Visit this group at https://groups.google.com/group/common-crawl.
> For more options, visit https://groups.google.com/d/optout.

has....@gmail.com

unread,
May 26, 2017, 5:18:14 AM5/26/17
to Common Crawl
Hi Sebastian,

Thanks for the help!
I will try the approaches mentioned here.
One more thing is commoncrawl data can be used for commercial purpose also?

Thanks & Regards,
Moid Hassan

Sebastian Nagel

unread,
May 26, 2017, 5:33:02 AM5/26/17
to common...@googlegroups.com
> One more thing is commoncrawl data can be used for commercial purpose also?

Yes, it's not limited to non-commercial or academic use. Please, also check
http://commoncrawl.org/terms-of-use/
http://commoncrawl.org/terms-of-use/full/

Best,
Sebastian

On 05/26/2017 11:18 AM, has....@gmail.com wrote:
> Hi Sebastian,
>
> Thanks for the help!
> I will try the approaches mentioned here.
> One more thing is commoncrawl data can be used for commercial purpose also?
>
> Thanks & Regards,
> Moid Hassan
>
> On Thursday, May 25, 2017 at 2:48:01 PM UTC+5:30, has....@gmail.com wrote:
>
> Is it possible to download the warc files or wet file which have data related to only one website.??
> For eg - data related only cinemablend.com <http://cinemablend.com>
> Help will be appreciated.
> Thanks,
> Moid Hassan
>
Reply all
Reply to author
Forward
0 new messages