Request for clarifications on commoncrawl

91 views
Skip to first unread message

Srinath Achanta

unread,
Jun 14, 2015, 11:48:48 PM6/14/15
to common...@googlegroups.com
Hi,

I am glad to have come across Commoncrawl and your vision of open data appeals to me.

I am specifically interested in eCommerce web pages and business/merchant web pages for my personal project.
And so far , I have been looking into commoncrawl for past few days and the data set looks promising.

However, I request for some clarifications on below mentioned points


1. Does periodic crawl data cover only those pages not part of previous crawl or is it re-crawl of entire web pages?
2. Currently I assume that each period is a re-crawl of entire web and I am wondering why there is reasonable gap in the size of webpages crawled for each period?
   For instance, a gap of data size for two different periods 168TB and 223TB respectively is quite big.
3. Is there any statistics available defining number of webpages crawled per website or any statistics for that matter. 
   I did look at Webcommons data , but I am more keen on statistics for the Jan-2015 crawl which is still unavailable.



Also, how can I help Commoncrawl in realizing the vision of open data.


Thank you.

Wojciech Stokowiec

unread,
Oct 8, 2015, 6:16:07 AM10/8/15
to Common Crawl
This is actually a great question. I have been asking myself the same questions :)

Kind regards, 
Wojciech Stokowiec
Reply all
Reply to author
Forward
0 new messages