Hi Scott, hi Tom,
since a couple of month the overlap is measured, even back to 2013. The counts are open, and also
the code to calculate the overlap and more metrics. Have a look at
https://github.com/commoncrawl/cc-crawl-statistics/blob/master/plots/crawloverlap.md
but cumulative counts may be more interesting to catch the coverage of multiple crawls taken together:
https://github.com/commoncrawl/cc-crawl-statistics/blob/master/plots/crawlsize.md
> If they measure the overlap, they don't publish it, which I would hope would be something
> antithetical to their mission.
Well, no. Let me explain ...
First, it does not make sense to cover any metrics which everyone could calculate from the
data (which is open).
Second, help in analyzing the data is welcome. I know many of you have much more experience in
analyzing and visualizing data. Take the counts from
s3://commoncrawl/crawl-analysis/
and analyze them and publish the results (cf. the analysis of the 2012 crawl [1]). You are welcome!
From my side, it's only lack of time to prepare more plots and tables and finally put everything
together. Most important is to have the metrics immediately after a monthly crawl to measure the
impact of crawler configuration changes: counts on S3 and plots on github are updated every month.
Thanks and best,
Sebastian
[1]
http://commoncrawl.org/2013/08/a-look-inside-common-crawls-210tb-2012-web-corpus/
On 02/21/2017 04:53 AM, Tom Morris wrote:
> If they measure the overlap, they don't publish it, which I would hope would be something
> antithetical to their mission.
>
> Tom
>
> On Mon, Feb 20, 2017 at 10:46 PM, Scott <
scott....@gmail.com <mailto:
scott....@gmail.com>>
> wrote:
>
> Sebastian,
>
> Curious if overlap is currently being measured.
>
> I might attempt if not already.
>
> Scott
>
> On Thursday, January 19, 2017 at 6:06:50 AM UTC-5, Sebastian Nagel wrote:
>
> Hi Stephan,
>
> yes, that is correct: to provide a complete web crawl we need for sure more than 3 billion
> pages
> which is the average monthly crawl size at present.
>
> We try to make the crawls more diverse and reduce the overlap between monthly crawls, so
> that we get
> more different pages/URLS over time. So, there is a good chance that pages from
>
jobs.theguardian.com <
http://jobs.theguardian.com>
> > I wondered why
jobs.theguardian.com <
http://jobs.theguardian.com> <
http://jobs.theguardian.com> is
> >
common-crawl...@googlegroups.com <mailto:
common-crawl...@googlegroups.com>.
> > Visit this group at
https://groups.google.com/group/common-crawl <
https://groups.google.com/group/common-crawl>.
> > For more options, visit
https://groups.google.com/d/optout <
https://groups.google.com/d/optout>.
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
>
common-crawl...@googlegroups.com <mailto:
common-crawl...@googlegroups.com>.
> <
https://groups.google.com/group/common-crawl>.
> For more options, visit
https://groups.google.com/d/optout <
https://groups.google.com/d/optout>.
>
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
>
common-crawl...@googlegroups.com <mailto:
common-crawl...@googlegroups.com>.