how complete is CommonCrawl?

David Portabella

unread,

Sep 14, 2016, 8:27:13 AM9/14/16

to Common Crawl

I find quite a log of pages missing from the CommonCrawl dataset.

For instance, the home page of www.ipc.com has a link to an "About" page (normal link, no javascript), but that page is missing in the dataset.
http://www.ipc.com/ -> click "About" -> http://www.ipc.com/about-us/mission-and-values

I query www.ipc.com/* at http://index.commoncrawl.org/CC-MAIN-2016-36
and the only url that appear are:

http://www.ipc.com/

http://www.ipc.com

http://www.ipc.com/

http://www.ipc.com/solutions/connecting-global-financial-community/simplified-market-access/connexus

http://www.ipc.com/solutions/exchanging-information/real-time-collaboration/communications-software

some of them duplicated (why?),

and http://www.ipc.com/about-us/mission-and-values does not appear.

I find many missing examples like this one.

Why is so?

The query "www.ipc.com/*" is not correct?

How complete is the CommonCrawl dataset?

Regards,

David

Tom Morris

unread,

Sep 14, 2016, 8:42:31 AM9/14/16

to common...@googlegroups.com

All crawls, even Google's, are, by their nature, incomplete. CommonCrawl's is much less complete than Google's due both to the resources available to run it and to the technology. You could get a rough idea of how much less complete by comparing domain & page counts between the two, but the coverage isn't homogenous and usually any consumer of the crawl is only interested in a subset (a certain language, a certain subject such as travel, etc), so it makes more sense to analyze coverage in the domain of interest.

Tom

--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl+unsubscribe@googlegroups.com.
To post to this group, send email to common...@googlegroups.com.
Visit this group at https://groups.google.com/group/common-crawl.
For more options, visit https://groups.google.com/d/optout.

David Portabella

unread,

Sep 14, 2016, 9:21:25 AM9/14/16

to Common Crawl

Thanks for the answer.

In order to understand the limitations of CommonCrawl, is it possible to focus on one example and understand why it is not crawled?

The example I gave is very simple. CommonCrawl has crawled the page http://www.ipc.com/. And this page contains a simple link (no javascript):

<a href="/about-us/mission-and-values">Mission and Values</a>

So, it is a level 1 link (not level 100).

Why CommonCrawl have not crawled this page?

It is a limitation on technology? or resources?

How can I learn more about the limitations?

Cheers,

David

To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.

Sebastian Nagel

unread,

Sep 14, 2016, 11:17:49 AM9/14/16

to common...@googlegroups.com

Hi David,

by now the Common Crawl Crawler does not follow links directly
to avoid that the data sets contain too much duplicates and spam.

Right now there are more than 7 billion URLs in the URL database.
That's only a sample of the web, of course. Every month a subsample
is selected for fetching. The URLs are mostly donations from Blekko
and moz.com, and should be free from duplicates and spam.

Since the donations and therefore updates of URL database aren't
on a regular basis now, it's likely that the mentioned URL is
just missing in our URL database. We know that we need more frequent
updates and working on it. But in any case, there will never
a guarantee that any host or domain is crawled entirely. We
have to sample for every crawl simply because of limited resources.
Also every monthly crawl data set should be a representative sample
of the web by its own. This may require to take only a sample of
the pages of one single host or domain.

Regarding the duplicate URLs (http://www.ipc.com/):
That's probably because of outdated URLs which are redirected
to the home page. The crawler is distributed and as limitation
is not able to deduplicate redirect targets. We also hope to
get this fixed in the future.

Best,
Sebastian

On 09/14/2016 03:21 PM, David Portabella wrote:
> Thanks for the answer.
>
> In order to understand the limitations of CommonCrawl, is it possible to focus on one example and
> understand why it is not crawled?
>
> The example I gave is very simple. CommonCrawl has crawled the page http://www.ipc.com/. And this
> page contains a simple link (no javascript):
> <a href="/about-us/mission-and-values">Mission and Values</a>
>
> So, it is a level 1 link (not level 100).
> Why CommonCrawl have not crawled this page?
>
> It is a limitation on technology? or resources?
>
> How can I learn more about the limitations?
>
>
> Cheers,
> David
>
>
> On Wednesday, September 14, 2016 at 2:42:31 PM UTC+2, Tom Morris wrote:
>
> All crawls, even Google's, are, by their nature, incomplete. CommonCrawl's is much less complete
> than Google's due both to the resources available to run it and to the technology. You could get
> a rough idea of how much less complete by comparing domain & page counts between the two, but
> the coverage isn't homogenous and usually any consumer of the crawl is only interested in a
> subset (a certain language, a certain subject such as travel, etc), so it makes more sense to
> analyze coverage in the domain of interest.
>
> Tom
>

> On Wed, Sep 14, 2016 at 8:27 AM, David Portabella <david.po...@gmail.com <javascript:>> wrote:
>
> I find quite a log of pages missing from the CommonCrawl dataset.
>

> For instance, the home page of www.ipc.com <http://www.ipc.com> has a link to an "About"

> page (normal link, no javascript), but that page is missing in the dataset.
> http://www.ipc.com/ -> click "About" -> http://www.ipc.com/about-us/mission-and-values
> <http://www.ipc.com/about-us/mission-and-values>
>

> I query www.ipc.com/* <http://www.ipc.com/*> at http://index.commoncrawl.org/CC-MAIN-2016-36

> <http://index.commoncrawl.org/CC-MAIN-2016-36>
> and the only url that appear are:
> http://www.ipc.com/
> http://www.ipc.com/
> http://www.ipc.com/
> http://www.ipc.com
> http://www.ipc.com/
> http://www.ipc.com/
> http://www.ipc.com/solutions/connecting-global-financial-community/simplified-market-access/connexus

> <http://www.ipc.com/solutions/connecting-global-financial-community/simplified-market-access/connexus>
> http://www.ipc.com/solutions/exchanging-information/real-time-collaboration/communications-software

> <http://www.ipc.com/solutions/exchanging-information/real-time-collaboration/communications-software>
>
> some of them duplicated (why?),
> and http://www.ipc.com/about-us/mission-and-values
> <http://www.ipc.com/about-us/mission-and-values> does not appear.
>
> I find many missing examples like this one.
> Why is so?

> The query "www.ipc.com/* <http://www.ipc.com/*>" is not correct?

>
> How complete is the CommonCrawl dataset?
>
> Regards,
> David
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to

> common-crawl...@googlegroups.com <javascript:>.
> To post to this group, send email to common...@googlegroups.com <javascript:>.

> Visit this group at https://groups.google.com/group/common-crawl

> <https://groups.google.com/group/common-crawl>.
> For more options, visit https://groups.google.com/d/optout <https://groups.google.com/d/optout>.

>
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to

> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.

> To post to this group, send email to common...@googlegroups.com

> <mailto:common...@googlegroups.com>.

David Portabella

unread,

Sep 14, 2016, 12:04:33 PM9/14/16

to Common Crawl

Thanks Sebastian for the answer.

> This may require to take only a sample of the pages of one single host or domain.

I see. This constraint might not fit our project requirements. :(

I see that the latest snapshot of CommonCrawl has 1.23 billion web pages,

and the latest snapshot of archive.org has 505 billion web pages.

I checked archive.org, and they do have that example page that I was mentioning: http://www.ipc.com/about-us/mission-and-values

it seems their dataset is more complete.

but as you said, there might be many duplicates in those 505 billion web pages.

Do you know if it is possible to download or run an extraction algorithm on the archive.org dataset?

Cheers,

David

Greg Lindahl

unread,

Sep 14, 2016, 1:04:10 PM9/14/16

to common...@googlegroups.com

On Wed, Sep 14, 2016 at 09:04:32AM -0700, David Portabella wrote:

> I see that the latest snapshot of CommonCrawl

> <http://commoncrawl.org/2016/07/june-2016-crawl-archive-now-available/> has
> 1.23 billion web pages,
> and the latest snapshot of archive.org <https://archive.org/web/> has 505
> billion web pages.

Archive.org's 505 billion number is 'captures'; it counts every time
it downloaded something from the web, even if it was an exact repeat
of a previous capture.

> Do you know if it is possible to download or run an extraction algorithm on
> the archive.org dataset?

archive.org has a CDX index endpoint, very similar to what Common
Crawl has. But that's just the index; getting at the raw data in
archive.org is much more complicated.

-- greg

Tom Morris

unread,

Sep 15, 2016, 1:21:12 AM9/15/16

to common...@googlegroups.com

On Wed, Sep 14, 2016 at 12:04 PM, David Portabella <david.po...@gmail.com> wrote:

Thanks Sebastian for the answer.

> This may require to take only a sample of the pages of one single host or domain.
I see. This constraint might not fit our project requirements. :(

I see that the latest snapshot of CommonCrawl has 1.23 billion web pages,
and the latest snapshot of archive.org has 505 billion web pages.

As Greg mentioned, this isn't a snapshot count for archive.org, but a count over all time.

Over the most recent 18 months for which there are crawls, CommonCrawl captured 54.4 billion pages which is still almost an order of magnitude difference, but not 2+ orders of magnitude. As I said before though, raw numbers don't really tell much of the story. For your cohort of interest, the ipc.com About page, the score is archive.org 1 and CommonCrawl 0, so that's how you should judge things (assuming it's a representative sample).

Tom

Greg Lindahl

unread,

Sep 16, 2016, 1:25:14 AM9/16/16

to common...@googlegroups.com

On Thu, Sep 15, 2016 at 01:21:08AM -0400, Tom Morris wrote:

> For your cohort of interest,
> the ipc.com About page, the score is archive.org 1 and CommonCrawl 0, so
> that's how you should judge things (assuming it's a representative sample).

And in all fairness, was blekko wrong? :-)

David Portabella

unread,

Sep 16, 2016, 6:59:18 AM9/16/16

to Common Crawl

Thanks again for your answers.

> This may require to take only a sample of the pages of one single host or domain.

What strategy is used to make the sample?

Could you point me where is it done in the source code?

https://github.com/commoncrawl/commoncrawl-crawler

The CommonCrawl initiative is great, and it is in the interest of everybody to have such an open archive as complete as possible.

You mention that you take a sample because of limited resources.

What resources would CommonCrawl need in order to remove the sample restriction?

Cheers,

David

Sebastian Nagel

unread,

Sep 19, 2016, 10:14:02 AM9/19/16

to common...@googlegroups.com

Hi David,

> What strategy is used to make the sample?
> Could you point me where is it done in the source code?

The code for selecting the actual fetch list is here:
https://github.com/commoncrawl/nutch/blob/cc/src/java/org/apache/nutch/crawl/Generator2.java

But don't overestimate the impact of the fetch list selection. There is a large overlap between
successive monthly crawls: about 0.65 in terms of Jaccard similarity, that is 65% or 1.25 bln. of
the 1.95 bln. unique URLs fetched in the July and August crawls are common to both crawls. It's our
goal to make the crawls more diverse and to reduce the overlap to satisfy users interested in large
and diverse content for NLP or machine learning.

> You mention that you take a sample because of limited resources.
> What resources would CommonCrawl need in order to remove the sample restriction?

Removing the "sample restriction" would mean to grow the crawls in orders of magnitude.
We'd definitely need more of the following resources:

1. storage of the crawl archives: we now have 1.3 Petabytes of released crawl archives and we add 40
TB new data every month. Amazon is generous in allowing us to save this data as a public data set
free of charge. This is also an optimal solution for our users because the resources to process
and/or download the data or the processing results are available on AWS.

2. computation: a monthly crawl now runs for 10-12 days on 100 EC2 instances.

3. development / operation: run and monitor the crawls, maintain and develop the crawler, ensure
quality of the crawled data, eliminate duplicates and spam, etc.

Of course, also the users would need more resources (of point 2 and 3) to process more data. And
there are many users with limited resources. That could be one argument to invest resources not only
in more but also in better data.

Best,
Sebastian

On 09/16/2016 12:59 PM, David Portabella wrote:
> Thanks again for your answers.
>
>> This may require to take only a sample of the pages of one single host or domain.
>
> What strategy is used to make the sample?
> Could you point me where is it done in the source code?
> https://github.com/commoncrawl/commoncrawl-crawler
>
>
> The CommonCrawl initiative is great, and it is in the interest of everybody to have such an open
> archive as complete as possible.
> You mention that you take a sample because of limited resources.
>
> What resources would CommonCrawl need in order to remove the sample restriction?
>
>
> Cheers,
> David
>
>
> On Friday, September 16, 2016 at 7:25:14 AM UTC+2, Greg Lindahl wrote:
>
> On Thu, Sep 15, 2016 at 01:21:08AM -0400, Tom Morris wrote:
>
> > For your cohort of interest,

> > the ipc.com <http://ipc.com> About page, the score is archive.org
> <http://www.google.com/url?q=http%3A%2F%2Farchive.org&sa=D&sntz=1&usg=AFQjCNGv_39C9Z7cKFxrcZ90XxPGVIJoBw>

> 1 and CommonCrawl 0, so
> > that's how you should judge things (assuming it's a representative sample).
>
> And in all fairness, was blekko wrong? :-)
>

David Portabella

unread,

Sep 19, 2016, 10:53:40 AM9/19/16

to Common Crawl

Hi Sebastian,

Thanks for this detailed explanation.

I'll see if the Scale Up project can help on this.

Cheers,

David

Christian Lund

unread,

Sep 25, 2016, 8:17:22 AM9/25/16

to Common Crawl

We run the newly started Webxtrakt project. We collect information about domain names and present the results using filters + a variety of upcoming useful tools for web professionals. We are constantly expanding our domain name databases (at the moment we emphasise on ccTLDs).

From a sample test on the CDX Server API it is clear that a lot of ccTLD domains are missing from the crawl index and we would be happy to provide you with new seeds on a recurring basis. We could provide Common Crawl with access to XML feeds or via requests to the CDX Server API (in case you log requests that return no results).

Feel free to get in touch with me to see how we can contribute.

Sebastian Nagel

unread,

Sep 27, 2016, 11:20:34 AM9/27/16

to common...@googlegroups.com

Hi Christian,

thanks for the offer to provide seeds. I'll come back to you off-list in one or two weeks.
For the September crawl it's too late now, it's running.

> From a sample test on the CDX Server API it is clear that a lot of ccTLD domains are missing from
> the crawl index

Yes, we know about this and really want to improve the coverage of the crawls. Of course, we have to
find a good balance between broad coverage (as many hosts/domains as possible) and in-depth coverage
of popular domains, esp. because the number of hosts in the web (take 1 bln. as an estimate) is
close to the size of one of our monthly crawls. But a ranked list of missing hosts/domains would
help us to improve the coverage. Thanks!

> ... or via requests to the CDX Server API (in case you log requests that return no results).
Better not. Although the server logs are kept for a couple of months, it's a small server with
limited disk space. By now we also do not mine the logs.

Best,
Sebastian

On 09/25/2016 02:17 PM, Christian Lund wrote:
> We run the newly started Webxtrakt <htps://webxtrakt.com> project. We collect information about

Prashant Shiralkar

unread,

Oct 18, 2018, 3:41:50 PM10/18/18

to Common Crawl

Hi Sebatian,

Sorry to reopen this old thread, but since I could not find an answer, I am asking here - is the URL database used to sample for monthly crawls available? If so, where and how can I access it?

Waiting to hear. Thanks!

Prashant

Sebastian Nagel

unread,

Oct 19, 2018, 4:51:03 AM10/19/18

to common...@googlegroups.com

Hi Prashant,

> this old thread

... and a lot has changed since 2016. The URL database has become
much more dynamic. It's updated every month

- adding status information (fetch time, HTTP status, checksum/signature)
of 4 - 5 billion page requests

- updating the URL list with 3 - 4 billion links, URLs from sitemaps and
seed donations. This adds between 500 and 1000 million new URLs

- flagging URLs as duplicates

- removing URLs not seen as links or seeds since longer from the database.
End of 2017 we've reached 17.5 billion entries [1] in the URL database and
updates of the db became too expensive/slow. By removing obsolete URLs the
size of the db is kept manageable.

Technically, the URL "database" is a Apache Nutch CrawlDb, a set of Hadoop
map files stored on S3 in a private bucket. There is nothing secret in it,
most of the URLs are published anyway in the crawl archives and in the URL
indexes. The arguments against making it public are the format (bound
to a specific software) and the dynamic nature (files may change/disappear
at any time without a warning). Please let me know if you're still interested
and the URL indexes are not an alternative - we can discuss internally whether
making the URL database public is possible.

Best,
Sebastian

[1] https://commoncrawl.github.io/cc-crawl-statistics/plots/crawlermetrics

On 10/18/18 9:41 PM, Prashant Shiralkar wrote:
> Hi Sebatian,
>
> Sorry to reopen this old thread, but since I could not find an answer, I am asking here - is the URL
> database used to sample for monthly crawls available? If so, where and how can I access it?
>
> Waiting to hear. Thanks!
>
> Prashant
>
> On Wednesday, September 14, 2016 at 8:17:49 AM UTC-7, Sebastian Nagel wrote:
>
> Hi David,
>
> by now the Common Crawl Crawler does not follow links directly
> to avoid that the data sets contain too much duplicates and spam.
>
> Right now there are more than 7 billion URLs in the URL database.
> That's only a sample of the web, of course. Every month a subsample
> is selected for fetching. The URLs are mostly donations from Blekko

> and moz.com <http://moz.com>, and should be free from duplicates and spam.

>
> Since the donations and therefore updates of URL database aren't
> on a regular basis now, it's likely that the mentioned URL is
> just missing in our URL database. We know that we need more frequent
> updates and working on it. But in any case, there will never
> a guarantee that any host or domain is crawled entirely. We
> have to sample for every crawl simply because of limited resources.
> Also every monthly crawl data set should be a representative sample
> of the web by its own. This may require to take only a sample of
> the pages of one single host or domain.
>
> Regarding the duplicate URLs (http://www.ipc.com/):
> That's probably because of outdated URLs which are redirected
> to the home page. The crawler is distributed and as limitation
> is not able to deduplicate redirect targets. We also hope to
> get this fixed in the future.
>
> Best,
> Sebastian
>
>
>

Prashant Shiralkar

unread,

Oct 20, 2018, 11:20:48 PM10/20/18

to Common Crawl

I see. That's helpful. Thanks a lot for the info, Sebastian!

Jayce Wong

unread,

Nov 6, 2018, 3:58:48 AM11/6/18

to Common Crawl

Hi Sebastian,

Thanks for your effort in providing the data in common crawl. It helps a lot.

I'm trying to use URLs from common crawl to carry on my research, and I want to verify the security of the URLs.

I notice that you mention that the URLs are from moz.com, so generally they should be legitimate.

However, when I used VirusTotal to verify the security of the URLs, I found some of them were malicious.

Therefore, I'd like to know whether you have checked the security of the URLs (safe, malicious, etc.).

Sorry to bother you, but I don't have the resource to verify the security of all the URLs.

Best,

Jayce

在 2018年10月19日星期五 UTC+8下午4:51:03，Sebastian Nagel写道：

> common-crawl...@googlegroups.com <mailto:common-crawl+unsub...@googlegroups.com>.

> To post to this group, send email to common...@googlegroups.com
> <mailto:common...@googlegroups.com>.
> Visit this group at https://groups.google.com/group/common-crawl.
> For more options, visit https://groups.google.com/d/optout.

在 2018年10月19日星期五 UTC+8下午4:51:03，Sebastian Nagel写道：

> common-crawl...@googlegroups.com <mailto:common-crawl+unsub...@googlegroups.com>.

> To post to this group, send email to common...@googlegroups.com
> <mailto:common...@googlegroups.com>.
> Visit this group at https://groups.google.com/group/common-crawl.
> For more options, visit https://groups.google.com/d/optout.

在 2018年10月19日星期五 UTC+8下午4:51:03，Sebastian Nagel写道：

> common-crawl...@googlegroups.com <mailto:common-crawl+unsub...@googlegroups.com>.

Sebastian Nagel

unread,

Nov 6, 2018, 5:03:27 AM11/6/18

to common...@googlegroups.com

Hi Jayce,

> I notice that you mention that the URLs are from moz.com <https://moz.com/>

We've got a seed donation from moz.com of 300 million URLs in May 2016,
that's only a small portion of the URLs and I don't know how many of them
are still reachable two years later. Anyway things have changed since
and most of the URLs are from other sources, see another post in this
discussion.

> However, when I used VirusTotal <https://www.virustotal.com/#/home/upload>

> to verify the security of the URLs, I found some of them were malicious.
> Therefore, I'd like to know whether you have checked the security of the URLs (safe,
> malicious, etc.). Sorry to bother you, but I don't have the resource to verify the
> security of all the URLs.

No, we haven't and clearly do not have the resources to do this, especially
because the notion of "malicious" and "safe" changes over time and we would
need to rerun the analysis from time to time to guarantee the safety of
all archives.

Well, it's a good question whether a broad sample web crawl should exclude
spam, malicious sites and all the other kinds of garbage and trash pages
in the internet. There has always been a smaller amount of such content
in the Common Crawl archives.

Any exclusion of "malicious sites" would also make the crawl archives less
usable for web security research. That's a common research topic done on
the Common Crawl data, cf.
https://scholar.google.de/scholar?q=commoncrawl+vulnerability

If anybody has done a large scale analysis of recent crawl archives,
would be interesting to hear about it.

Thanks,
Sebastian

On 11/6/18 9:58 AM, Jayce Wong wrote:
> Hi Sebastian,
>
> Thanks for your effort in providing the data in common crawl. It helps a lot.
>
> I'm trying to use URLs from common crawl to carry on my research, and I want to verify the security
> of the URLs.
>

> I notice that you mention that the URLs are from moz.com <https://moz.com/>, so generally they
> should be legitimate.
>
> However, when I used VirusTotal <https://www.virustotal.com/#/home/upload> to verify the security

> > and moz.com <http://moz.com> <http://moz.com>, and should be free from duplicates and spam.

> >
> > Since the donations and therefore updates of URL database aren't
> > on a regular basis now, it's likely that the mentioned URL is
> > just missing in our URL database. We know that we need more frequent
> > updates and working on it. But in any case, there will never
> > a guarantee that any host or domain is crawled entirely. We
> > have to sample for every crawl simply because of limited resources.
> > Also every monthly crawl data set should be a representative sample
> > of the web by its own. This may require to take only a sample of
> > the pages of one single host or domain.
> >
> > Regarding the duplicate URLs (http://www.ipc.com/):
> > That's probably because of outdated URLs which are redirected
> > to the home page. The crawler is distributed and as limitation
> > is not able to deduplicate redirect targets. We also hope to
> > get this fixed in the future.
> >
> > Best,
> > Sebastian
> >
> >
> >
> > --
> > You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> > To unsubscribe from this group and stop receiving emails from it, send an email to

> > common-crawl...@googlegroups.com <javascript:>
> <mailto:common-crawl...@googlegroups.com <javascript:>>.
> > To post to this group, send email to common...@googlegroups.com <javascript:>
> > <mailto:common...@googlegroups.com <javascript:>>.

> > Visit this group at https://groups.google.com/group/common-crawl

> <https://groups.google.com/group/common-crawl>.
> > For more options, visit https://groups.google.com/d/optout <https://groups.google.com/d/optout>.

> > and moz.com <http://moz.com> <http://moz.com>, and should be free from duplicates and spam.

> >
> > Since the donations and therefore updates of URL database aren't
> > on a regular basis now, it's likely that the mentioned URL is
> > just missing in our URL database. We know that we need more frequent
> > updates and working on it. But in any case, there will never
> > a guarantee that any host or domain is crawled entirely. We
> > have to sample for every crawl simply because of limited resources.
> > Also every monthly crawl data set should be a representative sample
> > of the web by its own. This may require to take only a sample of
> > the pages of one single host or domain.
> >
> > Regarding the duplicate URLs (http://www.ipc.com/):
> > That's probably because of outdated URLs which are redirected
> > to the home page. The crawler is distributed and as limitation
> > is not able to deduplicate redirect targets. We also hope to
> > get this fixed in the future.
> >
> > Best,
> > Sebastian
> >
> >
> >
> > --
> > You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> > To unsubscribe from this group and stop receiving emails from it, send an email to

> > common-crawl...@googlegroups.com <javascript:>
> <mailto:common-crawl...@googlegroups.com <javascript:>>.
> > To post to this group, send email to common...@googlegroups.com <javascript:>
> > <mailto:common...@googlegroups.com <javascript:>>.

> > Visit this group at https://groups.google.com/group/common-crawl

> <https://groups.google.com/group/common-crawl>.
> > For more options, visit https://groups.google.com/d/optout <https://groups.google.com/d/optout>.

> > and moz.com <http://moz.com> <http://moz.com>, and should be free from duplicates and spam.

> >
> > Since the donations and therefore updates of URL database aren't
> > on a regular basis now, it's likely that the mentioned URL is
> > just missing in our URL database. We know that we need more frequent
> > updates and working on it. But in any case, there will never
> > a guarantee that any host or domain is crawled entirely. We
> > have to sample for every crawl simply because of limited resources.
> > Also every monthly crawl data set should be a representative sample
> > of the web by its own. This may require to take only a sample of
> > the pages of one single host or domain.
> >
> > Regarding the duplicate URLs (http://www.ipc.com/):
> > That's probably because of outdated URLs which are redirected
> > to the home page. The crawler is distributed and as limitation
> > is not able to deduplicate redirect targets. We also hope to
> > get this fixed in the future.
> >
> > Best,
> > Sebastian
> >
> >
> >
> > --
> > You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> > To unsubscribe from this group and stop receiving emails from it, send an email to

> > common-crawl...@googlegroups.com <javascript:>
> <mailto:common-crawl...@googlegroups.com <javascript:>>.
> > To post to this group, send email to common...@googlegroups.com <javascript:>
> > <mailto:common...@googlegroups.com <javascript:>>.

> > Visit this group at https://groups.google.com/group/common-crawl

> <https://groups.google.com/group/common-crawl>.
> > For more options, visit https://groups.google.com/d/optout <https://groups.google.com/d/optout>.

>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to

> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.

Jayce Wong

unread,

Nov 6, 2018, 7:23:43 AM11/6/18

to Common Crawl

Hi Sebastian,

Thanks for your reply. That helps a lot.

Best,

Jayce

在 2018年11月6日星期二 UTC+8下午6:03:27，Sebastian Nagel写道：

Greg Lindahl

unread,

Nov 6, 2018, 10:10:28 PM11/6/18

to common...@googlegroups.com

On Tue, Nov 06, 2018 at 11:03:23AM +0100, Sebastian Nagel wrote:

> No, we haven't and clearly do not have the resources to do this, especially
> because the notion of "malicious" and "safe" changes over time and we would
> need to rerun the analysis from time to time to guarantee the safety of
> all archives.

I don't know that much about malicious websites, but back when I spent
a lot of time hanging out with the engineers at Yandex, they mentioned
that most malicious URLs were on generally-legit websites and didn't
remain malicious for that long -- the websites were being temporarily
exploited, and would eventually clean out the bad stuff. I don't think
there's any way that CC could possibly label things that were
malicious at the time they were crawled.

-- greg

Reply all

Reply to author

Forward