How much Porn is in the Common Crawl Corpus?

436 views
Skip to first unread message

Alex Goretoy

unread,
Feb 21, 2014, 5:08:41 AM2/21/14
to common...@googlegroups.com
Hello,

I have a few questions, if you can please answer them for me it would be greatly appreciated.

How much porn is there in the common crawl corpus data set?

Is there a ton of porn in there?

I am looking for recipe data, with name, description and images.

What were the root url that started the crawl? and what was the depth of the crawl?

On this page [1], I am not able to see if there are images in teh corpus at all, and I really need that porn, I mean recipe data with images.

Thank you for all the tremendous work and efforts on common crawl.

Hannes Mühleisen

unread,
Aug 26, 2014, 9:47:11 AM8/26/14
to common...@googlegroups.com
Hello Alex,

according to my recent analysis on the 2014 Common Crawl data, about 4% of URLs in the Common Crawl are porn.

Best,

Hannes

Robert Meusel

unread,
Aug 27, 2014, 7:52:10 AM8/27/14
to common...@googlegroups.com
Hi Alex,

for the 2012 corpus we used DMOZ (www.dmoz.org) to classify the domains and according to this list 2% of all domains are porn. 

You can get a full list of html-pages of the 2012 crawl using: 


If you want to run some classification (e.g. using keyword lists).

Cheers,
Robert
Reply all
Reply to author
Forward
0 new messages