Hi Jayce,
> I notice that you mention that the URLs are from
moz.com <
https://moz.com/>
We've got a seed donation from
moz.com of 300 million URLs in May 2016,
that's only a small portion of the URLs and I don't know how many of them
are still reachable two years later. Anyway things have changed since
and most of the URLs are from other sources, see another post in this
discussion.
> However, when I used VirusTotal <
https://www.virustotal.com/#/home/upload>
> to verify the security of the URLs, I found some of them were malicious.
> Therefore, I'd like to know whether you have checked the security of the URLs (safe,
> malicious, etc.). Sorry to bother you, but I don't have the resource to verify the
> security of all the URLs.
No, we haven't and clearly do not have the resources to do this, especially
because the notion of "malicious" and "safe" changes over time and we would
need to rerun the analysis from time to time to guarantee the safety of
all archives.
Well, it's a good question whether a broad sample web crawl should exclude
spam, malicious sites and all the other kinds of garbage and trash pages
in the internet. There has always been a smaller amount of such content
in the Common Crawl archives.
Any exclusion of "malicious sites" would also make the crawl archives less
usable for web security research. That's a common research topic done on
the Common Crawl data, cf.
https://scholar.google.de/scholar?q=commoncrawl+vulnerability
If anybody has done a large scale analysis of recent crawl archives,
would be interesting to hear about it.
Thanks,
Sebastian
On 11/6/18 9:58 AM, Jayce Wong wrote:
> Hi Sebastian,
>
> Thanks for your effort in providing the data in common crawl. It helps a lot.
>
> I'm trying to use URLs from common crawl to carry on my research, and I want to verify the security
> of the URLs.
>
> I notice that you mention that the URLs are from
moz.com <
https://moz.com/>, so generally they
> should be legitimate.
>
> However, when I used VirusTotal <
https://www.virustotal.com/#/home/upload> to verify the security
> > and
moz.com <
http://moz.com> <
http://moz.com>, and should be free from duplicates and spam.
> >
> > Since the donations and therefore updates of URL database aren't
> > on a regular basis now, it's likely that the mentioned URL is
> > just missing in our URL database. We know that we need more frequent
> > updates and working on it. But in any case, there will never
> > a guarantee that any host or domain is crawled entirely. We
> > have to sample for every crawl simply because of limited resources.
> > Also every monthly crawl data set should be a representative sample
> > of the web by its own. This may require to take only a sample of
> > the pages of one single host or domain.
> >
> > Regarding the duplicate URLs (
http://www.ipc.com/):
> > That's probably because of outdated URLs which are redirected
> > to the home page. The crawler is distributed and as limitation
> > is not able to deduplicate redirect targets. We also hope to
> > get this fixed in the future.
> >
> > Best,
> > Sebastian
> >
> >
> >
> > --
> > You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> > To unsubscribe from this group and stop receiving emails from it, send an email to
> >
common-crawl...@googlegroups.com <javascript:>
> <mailto:
common-crawl...@googlegroups.com <javascript:>>.
> > To post to this group, send email to
common...@googlegroups.com <javascript:>
> > <mailto:
common...@googlegroups.com <javascript:>>.
> <
https://groups.google.com/group/common-crawl>.
> > For more options, visit
https://groups.google.com/d/optout <
https://groups.google.com/d/optout>.
> > and
moz.com <
http://moz.com> <
http://moz.com>, and should be free from duplicates and spam.
> >
> > Since the donations and therefore updates of URL database aren't
> > on a regular basis now, it's likely that the mentioned URL is
> > just missing in our URL database. We know that we need more frequent
> > updates and working on it. But in any case, there will never
> > a guarantee that any host or domain is crawled entirely. We
> > have to sample for every crawl simply because of limited resources.
> > Also every monthly crawl data set should be a representative sample
> > of the web by its own. This may require to take only a sample of
> > the pages of one single host or domain.
> >
> > Regarding the duplicate URLs (
http://www.ipc.com/):
> > That's probably because of outdated URLs which are redirected
> > to the home page. The crawler is distributed and as limitation
> > is not able to deduplicate redirect targets. We also hope to
> > get this fixed in the future.
> >
> > Best,
> > Sebastian
> >
> >
> >
> > --
> > You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> > To unsubscribe from this group and stop receiving emails from it, send an email to
> >
common-crawl...@googlegroups.com <javascript:>
> <mailto:
common-crawl...@googlegroups.com <javascript:>>.
> > To post to this group, send email to
common...@googlegroups.com <javascript:>
> > <mailto:
common...@googlegroups.com <javascript:>>.
> <
https://groups.google.com/group/common-crawl>.
> > For more options, visit
https://groups.google.com/d/optout <
https://groups.google.com/d/optout>.
> > and
moz.com <
http://moz.com> <
http://moz.com>, and should be free from duplicates and spam.
> >
> > Since the donations and therefore updates of URL database aren't
> > on a regular basis now, it's likely that the mentioned URL is
> > just missing in our URL database. We know that we need more frequent
> > updates and working on it. But in any case, there will never
> > a guarantee that any host or domain is crawled entirely. We
> > have to sample for every crawl simply because of limited resources.
> > Also every monthly crawl data set should be a representative sample
> > of the web by its own. This may require to take only a sample of
> > the pages of one single host or domain.
> >
> > Regarding the duplicate URLs (
http://www.ipc.com/):
> > That's probably because of outdated URLs which are redirected
> > to the home page. The crawler is distributed and as limitation
> > is not able to deduplicate redirect targets. We also hope to
> > get this fixed in the future.
> >
> > Best,
> > Sebastian
> >
> >
> >
> > --
> > You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> > To unsubscribe from this group and stop receiving emails from it, send an email to
> >
common-crawl...@googlegroups.com <javascript:>
> <mailto:
common-crawl...@googlegroups.com <javascript:>>.
> > To post to this group, send email to
common...@googlegroups.com <javascript:>
> > <mailto:
common...@googlegroups.com <javascript:>>.
> <
https://groups.google.com/group/common-crawl>.
> > For more options, visit
https://groups.google.com/d/optout <
https://groups.google.com/d/optout>.
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
>
common-crawl...@googlegroups.com <mailto:
common-crawl...@googlegroups.com>.