Winter 2015 Crawl

124 views
Skip to first unread message

Robert Meusel

unread,
Nov 18, 2015, 11:37:01 AM11/18/15
to Common Crawl
Hey there, 

As we (WebDataCommons Project) are planing to run our anual extraction for RDFa, Microformats and Microdata from the crawl I was wondering (as since July 2015 no crawl has been released) if there will be a crawl (end of this year) available? We really hope to be able to continue our work on the data. Please let us know.

Thanks a lot,
Robert

Tom Morris

unread,
Nov 18, 2015, 2:44:33 PM11/18/15
to common...@googlegroups.com
The August crawl was uploaded a little while ago, but hasn't been announced yet:

$ aws --no-sign-request s3 ls s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2015-40/
                           PRE segments/
2015-11-09 12:48:13        690 segment.paths.gz
2015-11-09 12:47:51      79908 warc.paths.gz
2015-11-09 12:48:00      79527 wat.paths.gz
2015-11-09 12:48:07      79527 wet.paths.gz

This seems to be about the normal delay, so I would expect the next month's crawl to show up in early December.  Depending on whether you want the December 2015 crawl or just the latest that's available at some particular date, you should be able to plan with this info (assuming they're consistent in their scheduling).

Tom

--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To post to this group, send email to common...@googlegroups.com.
Visit this group at http://groups.google.com/group/common-crawl.
For more options, visit https://groups.google.com/d/optout.

Robert Meusel

unread,
Nov 19, 2015, 4:03:42 AM11/19/15
to Common Crawl
Thanks Tom for pointing this out. In an optimal case, we would use data which is roughtly one year after the former extraction. So I stay tuned and will check the public dataset directly. 

Robert

Robert Meusel

unread,
Feb 19, 2016, 5:29:45 AM2/19/16
to Common Crawl
Seems like the November crawl is the last one of 2015, as I found a new folder in the public repository from 2016. 
So I might use the Nov 2015 crawl for the Structured Data Extraction of WDC. 

How was this crawl gathered? Is still a list of URLs recrawled or does this crawl includes URLs which are found during crawling?

Thanks

Tom Morris

unread,
Feb 21, 2016, 2:22:22 PM2/21/16
to common...@googlegroups.com
On Fri, Feb 19, 2016 at 5:29 AM, Robert Meusel <robert...@gmail.com> wrote:
Seems like the November crawl is the last one of 2015, as I found a new folder in the public repository from 2016. 

Yup, looks like nothing between 2015-48 and 2016-07.
 
How was this crawl gathered? Is still a list of URLs recrawled or does this crawl includes URLs which are found during crawling?

If I had to guess, I'd assume the methodology was unchanged, but hopefully whoever ran the crawl with chime in with the real answer.

Hint, hint, ...

Tom 

Julien Nioche

unread,
Feb 22, 2016, 9:03:17 AM2/22/16
to common...@googlegroups.com
Hi,

2016-07 is not yet complete, I'll send an announcement when it's the case.

Thanks

Julien
 

Tom Morris

unread,
Feb 22, 2016, 9:59:48 AM2/22/16
to common...@googlegroups.com
Hi Julien, 

I think Robert's key question was concerning the crawl methodology for the 2015-48 crawl. Did it use a fixed list of URLs (which list?) or did it discover new URLs from the crawl frontier as it crawled?

I'm sure he's also appreciate confirmation of his assumption that 2015-48 was the last crawl of calendar year 2015.

Tom

Julien Nioche

unread,
Feb 22, 2016, 4:05:12 PM2/22/16
to common...@googlegroups.com
Hi Tom

Indeed, my previous comment was a quick note of warning more than anything else. I hadn't fully read the thread.

AFAIK the previous crawls used the a fixed list which initially came from Blekko. That list might have been expanded with the URLs discovered at some point but I do not know for sure. The forthcoming release will also be based on the same list but future ones are likely to be different. 

2015-48 was indeed the last crawl of 2015 - the corresponding blog entry [http://blog.commoncrawl.org/2015/12/november-2015-crawl-archive-now-available/] has a few typos (40 mentioned instead of 48) but the links are correct.

HTH

Julien
Reply all
Reply to author
Forward
0 new messages