Page crawl depth

115 views
Skip to first unread message

Petzl Stephan

unread,
Jan 19, 2017, 4:32:18 AM1/19/17
to Common Crawl
Hi there!

I wondered why jobs.theguardian.com is basically not crawled at all.

Is there a reason? Am I missing something, or it's just a matter of resources?

Thanks!
Stephan

Petzl Stephan

unread,
Jan 19, 2017, 4:34:25 AM1/19/17
to Common Crawl

Sebastian Nagel

unread,
Jan 19, 2017, 6:06:50 AM1/19/17
to common...@googlegroups.com
Hi Stephan,

yes, that is correct: to provide a complete web crawl we need for sure more than 3 billion pages
which is the average monthly crawl size at present.

We try to make the crawls more diverse and reduce the overlap between monthly crawls, so that we get
more different pages/URLS over time. So, there is a good chance that pages from jobs.theguardian.com
are included in one of the upcoming monthly crawls. But we cannot guarantee
that a particular host or domain is crawled entirely in any of the monthly archives.

Best,
Sebastian



On 01/19/2017 10:34 AM, Petzl Stephan wrote:
> Ah, I think I just found the answer:
> https://groups.google.com/forum/#!topic/common-crawl/FWtuQuyVF7o
>
> On Thursday, January 19, 2017 at 10:32:18 AM UTC+1, Petzl Stephan wrote:
>
> Hi there!
>
> I wondered why jobs.theguardian.com <http://jobs.theguardian.com> is basically not crawled at all.
> See http://index.commoncrawl.org/CC-MAIN-2016-50-index?url=jobs.theguardian.com/*&output=json
> <http://index.commoncrawl.org/CC-MAIN-2016-50-index?url=jobs.theguardian.com/*&output=json>
>
> Is there a reason? Am I missing something, or it's just a matter of resources?
>
> Thanks!
> Stephan
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
> To post to this group, send email to common...@googlegroups.com
> <mailto:common...@googlegroups.com>.
> Visit this group at https://groups.google.com/group/common-crawl.
> For more options, visit https://groups.google.com/d/optout.

Scott

unread,
Feb 20, 2017, 10:46:53 PM2/20/17
to Common Crawl
Sebastian,

Curious if overlap is currently being measured.

I might attempt if not already.

Scott


On Thursday, January 19, 2017 at 6:06:50 AM UTC-5, Sebastian Nagel wrote:
Hi Stephan,

yes, that is correct: to provide a complete web crawl we need for sure more than 3 billion pages
which is the average monthly crawl size at present.

We try to make the crawls more diverse and reduce the overlap between monthly crawls, so that we get
more different pages/URLS over time. So, there is a good chance that pages from jobs.theguardian.com
are included in one of the upcoming monthly crawls. But we cannot guarantee
that a particular host or domain is crawled entirely in any of the monthly archives.

Best,
Sebastian



On 01/19/2017 10:34 AM, Petzl Stephan wrote:
> Ah, I think I just found the answer:
> https://groups.google.com/forum/#!topic/common-crawl/FWtuQuyVF7o
>
> On Thursday, January 19, 2017 at 10:32:18 AM UTC+1, Petzl Stephan wrote:
>
>     Hi there!
>
>     I wondered why jobs.theguardian.com <http://jobs.theguardian.com> is basically not crawled at all.
>     See http://index.commoncrawl.org/CC-MAIN-2016-50-index?url=jobs.theguardian.com/*&output=json
>     <http://index.commoncrawl.org/CC-MAIN-2016-50-index?url=jobs.theguardian.com/*&output=json>
>
>     Is there a reason? Am I missing something, or it's just a matter of resources?
>
>     Thanks!
>     Stephan
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to

Tom Morris

unread,
Feb 20, 2017, 10:53:58 PM2/20/17
to common...@googlegroups.com
If they measure the overlap, they don't publish it, which I would hope would be something antithetical to their mission.

Tom

> To post to this group, send email to common...@googlegroups.com
> <mailto:common...@googlegroups.com>.
> Visit this group at https://groups.google.com/group/common-crawl.
> For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl+unsubscribe@googlegroups.com.
To post to this group, send email to common...@googlegroups.com.

Sebastian Nagel

unread,
Feb 21, 2017, 4:05:18 AM2/21/17
to common...@googlegroups.com
Hi Scott, hi Tom,

since a couple of month the overlap is measured, even back to 2013. The counts are open, and also
the code to calculate the overlap and more metrics. Have a look at
https://github.com/commoncrawl/cc-crawl-statistics/blob/master/plots/crawloverlap.md
but cumulative counts may be more interesting to catch the coverage of multiple crawls taken together:
https://github.com/commoncrawl/cc-crawl-statistics/blob/master/plots/crawlsize.md

> If they measure the overlap, they don't publish it, which I would hope would be something
> antithetical to their mission.

Well, no. Let me explain ...

First, it does not make sense to cover any metrics which everyone could calculate from the
data (which is open).

Second, help in analyzing the data is welcome. I know many of you have much more experience in
analyzing and visualizing data. Take the counts from
s3://commoncrawl/crawl-analysis/
and analyze them and publish the results (cf. the analysis of the 2012 crawl [1]). You are welcome!
From my side, it's only lack of time to prepare more plots and tables and finally put everything
together. Most important is to have the metrics immediately after a monthly crawl to measure the
impact of crawler configuration changes: counts on S3 and plots on github are updated every month.

Thanks and best,
Sebastian

[1] http://commoncrawl.org/2013/08/a-look-inside-common-crawls-210tb-2012-web-corpus/


On 02/21/2017 04:53 AM, Tom Morris wrote:
> If they measure the overlap, they don't publish it, which I would hope would be something
> antithetical to their mission.
>
> Tom
>
> On Mon, Feb 20, 2017 at 10:46 PM, Scott <scott....@gmail.com <mailto:scott....@gmail.com>>
> wrote:
>
> Sebastian,
>
> Curious if overlap is currently being measured.
>
> I might attempt if not already.
>
> Scott
>
> On Thursday, January 19, 2017 at 6:06:50 AM UTC-5, Sebastian Nagel wrote:
>
> Hi Stephan,
>
> yes, that is correct: to provide a complete web crawl we need for sure more than 3 billion
> pages
> which is the average monthly crawl size at present.
>
> We try to make the crawls more diverse and reduce the overlap between monthly crawls, so
> that we get
> more different pages/URLS over time. So, there is a good chance that pages from
> jobs.theguardian.com <http://jobs.theguardian.com>
> are included in one of the upcoming monthly crawls. But we cannot guarantee
> that a particular host or domain is crawled entirely in any of the monthly archives.
>
> Best,
> Sebastian
>
>
>
> On 01/19/2017 10:34 AM, Petzl Stephan wrote:
> > Ah, I think I just found the answer:
> > https://groups.google.com/forum/#!topic/common-crawl/FWtuQuyVF7o
> <https://groups.google.com/forum/#!topic/common-crawl/FWtuQuyVF7o>
> >
> > On Thursday, January 19, 2017 at 10:32:18 AM UTC+1, Petzl Stephan wrote:
> >
> > Hi there!
> >
> > I wondered why jobs.theguardian.com <http://jobs.theguardian.com> <http://jobs.theguardian.com> is
> basically not crawled at all.
> > See http://index.commoncrawl.org/CC-MAIN-2016-50-index?url=jobs.theguardian.com/*&output=json
> <http://index.commoncrawl.org/CC-MAIN-2016-50-index?url=jobs.theguardian.com/*&output=json>
> > <http://index.commoncrawl.org/CC-MAIN-2016-50-index?url=jobs.theguardian.com/*&output=json
> <http://index.commoncrawl.org/CC-MAIN-2016-50-index?url=jobs.theguardian.com/*&output=json>>
> >
> > Is there a reason? Am I missing something, or it's just a matter of resources?
> >
> > Thanks!
> > Stephan
> >
> > --
> > You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> > To unsubscribe from this group and stop receiving emails from it, send an email to
> > common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
> > To post to this group, send email to common...@googlegroups.com
> > <mailto:common...@googlegroups.com>.
> > Visit this group at https://groups.google.com/group/common-crawl <https://groups.google.com/group/common-crawl>.
> > For more options, visit https://groups.google.com/d/optout <https://groups.google.com/d/optout>.
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
> To post to this group, send email to common...@googlegroups.com
> <mailto:common...@googlegroups.com>.
> Visit this group at https://groups.google.com/group/common-crawl
> <https://groups.google.com/group/common-crawl>.
> For more options, visit https://groups.google.com/d/optout <https://groups.google.com/d/optout>.
>
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.

Scott

unread,
Feb 21, 2017, 10:20:23 PM2/21/17
to Common Crawl
Thanks Sebastian!

Like the plots, showing good trends too.  Congratulations!

I've haven't looked at this Google group in over year and didn't think to check github too.

Scott
>         > To post to this group, send email to common...@googlegroups.com
>         > <mailto:common...@googlegroups.com>.
>         > Visit this group at https://groups.google.com/group/common-crawl <https://groups.google.com/group/common-crawl>.
>         > For more options, visit https://groups.google.com/d/optout <https://groups.google.com/d/optout>.
>
>     --
>     You received this message because you are subscribed to the Google Groups "Common Crawl" group.
>     To unsubscribe from this group and stop receiving emails from it, send an email to
>     To post to this group, send email to common...@googlegroups.com
>     <mailto:common...@googlegroups.com>.
>     Visit this group at https://groups.google.com/group/common-crawl
>     <https://groups.google.com/group/common-crawl>.
>     For more options, visit https://groups.google.com/d/optout <https://groups.google.com/d/optout>.
>
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to

Tom Morris

unread,
Feb 22, 2017, 12:14:21 AM2/22/17
to common...@googlegroups.com
Thanks Sebastian. That looks like a great resource and I'm glad you published it. :-)

Tom

> On Mon, Feb 20, 2017 at 10:46 PM, Scott <scott....@gmail.com <mailto:scott.tablett@gmail.com>>

>         > To post to this group, send email to common...@googlegroups.com
>         > <mailto:common...@googlegroups.com>.
>         > Visit this group at https://groups.google.com/group/common-crawl <https://groups.google.com/group/common-crawl>.
>         > For more options, visit https://groups.google.com/d/optout <https://groups.google.com/d/optout>.
>
>     --
>     You received this message because you are subscribed to the Google Groups "Common Crawl" group.
>     To unsubscribe from this group and stop receiving emails from it, send an email to

>     To post to this group, send email to common...@googlegroups.com
>     <mailto:common-crawl@googlegroups.com>.

>     Visit this group at https://groups.google.com/group/common-crawl
>     <https://groups.google.com/group/common-crawl>.
>     For more options, visit https://groups.google.com/d/optout <https://groups.google.com/d/optout>.
>
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to

> To post to this group, send email to common...@googlegroups.com
> Visit this group at https://groups.google.com/group/common-crawl.
> For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl+unsubscribe@googlegroups.com.
To post to this group, send email to common...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages