--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To post to this group, send email to common...@googlegroups.com.
Visit this group at http://groups.google.com/group/common-crawl.
For more options, visit https://groups.google.com/d/optout.
depending on what you're trying to analyse you probably _don't_ want to filter on 'big data' first since you're introducing a pretty big selection bias, i.e. your correlation statistic between "big data" and other-meaningful-term given the doc already has "big data" in it is going to be very different to the correlation statistic between "big data" and other-meaningful-term given a completely random page.
this is one of the nice things about common crawl, it's easy to randomly sample a small percentage of data and have some faith that it's a representative of the internet... content duplication aside :) note that it's a good idea to do 2 or 3 sperate samples too so you can check things like sample variance.
Is there any form of a visible text dump of the common crawl available these days?
Mat
Is there any form of a visible text dump of the common crawl available these days?
--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To post to this group, send email to common...@googlegroups.com.
Visit this group at http://groups.google.com/group/common-crawl.
For more options, visit https://groups.google.com/d/optout.