Does the Common Crawl include SSL sites?

104 views
Skip to first unread message

Colin Dellow

unread,
Feb 17, 2015, 6:50:48 PM2/17/15
to common...@googlegroups.com
Does the common crawl include SSL sites?

I'm grepping the December 2014 WAT files to get this answer myself, but it'll take a few days since I'm just using excess capacity on my existing EC2 boxes.

So far I've only looked at 4% of WAT files and none contain any SSL sites. :( I'm still hopeful though, since this thread talks about using a custom nutch with SSL support.

PS - this is a phenomenal resource. Thanks for making it available!

Stephen Merity

unread,
Feb 18, 2015, 1:50:59 PM2/18/15
to common...@googlegroups.com
Hi Colin,

Thanks for bringing this up. I'd have expected to see a HTTPS URL within one WAT file, let alone 4% of a crawl archive, so I went to investigate.

To clarify, HTTPS support was backported to the Nutch implementation that Common Crawl uses. I spent last night hunting down why HTTPS URLs were not be produced in the December crawl archives. I eventually found the cause - a single extra whitespace in a config file that resulted in filtering HTTPS URLs when they shouldn't have.

As such, there won't be any HTTPS URLs in the most recent crawl archives but the error will be fixed for February 2015 and all future crawls.

--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To post to this group, send email to common...@googlegroups.com.
Visit this group at http://groups.google.com/group/common-crawl.
For more options, visit https://groups.google.com/d/optout.



--
Regards,
Stephen Merity
Data Scientist @ Common Crawl

Colin Dellow

unread,
Feb 18, 2015, 1:53:53 PM2/18/15
to common...@googlegroups.com
On Wednesday, 18 February 2015 13:50:59 UTC-5, Stephen Merity wrote:
Hi Colin,

Thanks for bringing this up. I'd have expected to see a HTTPS URL within one WAT file, let alone 4% of a crawl archive, so I went to investigate.

To clarify, HTTPS support was backported to the Nutch implementation that Common Crawl uses. I spent last night hunting down why HTTPS URLs were not be produced in the December crawl archives. I eventually found the cause - a single extra whitespace in a config file that resulted in filtering HTTPS URLs when they shouldn't have.

Ouch :)
 

As such, there won't be any HTTPS URLs in the most recent crawl archives but the error will be fixed for February 2015 and all future crawls.

Thanks for the fix and the extra info!

Colin Dellow

unread,
Feb 18, 2015, 2:50:36 PM2/18/15
to common...@googlegroups.com
For anyone else who finds this thread - looks like July 2014 was the last crawl with SSL URLs.
Reply all
Reply to author
Forward
0 new messages