Mixture of questions

111 views

Skip to first unread message

Hugo Stegrell

unread,

May 11, 2023, 3:53:22 AM5/11/23

to Common Crawl

Hi all,

We are doing a thesis project analyzing the frequency of vulnerabilities online. We're using WAT+WET as our data source and would like to describe how Common Crawl works as accurately as possible. We've found most of the information in earlier topics here, in the FAQ, and in the source code, however, there are still a few points that are unclear to us.

How are the websites selected to be included in a crawl? Sources of URLs? From our understanding there is a database, is it public?
Is the Mozilla Public Suffix List still used? Saw it mentioned in earlier threads but no specifics on how it is used. Perhaps only for the Web Graphs nowadays?
Are URLs guaranteed only to be crawled once now? Read that before one URL might be in multiple segments but that might have changed now that the selection works differently.
What is WET short for? Is it an abbreviation? Multiple sources say WAT stands for Web Archive Transformation but we only found one source (Stanford Library) saying WET stands for "WARC Encapsulated Text".

Looking forward to your answers!

Best regards,

Hugo

Sebastian Nagel

unread,

May 11, 2023, 10:14:06 AM5/11/23

to common...@googlegroups.com

Hi Hugo,

> 1. How are the websites selected to be included in a crawl?

Based on the domain-level harmonic centrality ranks calculated on the
latest available hyperlink graphs. The rank defines how many URLs are
sampled for this particular domain.

> Sources of URLs?

Links on publicly visible pages and URLs from sitemaps
(https://sitemaps.org/).

> there is a database, is it public?

No. It's a Nutch CrawlDb which includes the URL, status information
and some metadata. Most of the information is available in the URL
index. However, the CrawlDb also includes the URLs which never were even
tried to fetch, URLs excluded by the robots.txt, where resolving the
host name failed, or those not sampled.

> 2. Is the Mozilla Public Suffix List still used?

Yes, it is and it's a crucial component called every time a URL or host
name is mapped to a domains (one level below the registry suffix). See
[1,2] for details.

> Perhaps only for the Web Graphs nowadays?

For the webgraphs, but also for the crawler or the statistics [3].

> 3. Are URLs guaranteed only to be crawled once now?

No. There is no guarantee. The amount of URL-level duplicates is about
0.5% in recent crawls. Every URL is guaranteed to be unique in the fetch
lists of a single crawl. But URLs may be redirected and redirects are
deduplicated only per segment.

> 4. What is WET short for?

> "WARC Encapsulated Text"

That's how it was called by Jordan Mendelson, see also [4,5,6].

Best,
Sebastian

[1] https://www.publicsuffix.org/
[2]
https://crawler-commons.github.io/crawler-commons/1.3/crawlercommons/domains/EffectiveTldFinder.html
[3] https://commoncrawl.github.io/cc-crawl-statistics/
[4] https://commoncrawl.org/2013/11/new-crawl-data-available/
[5]
https://github.com/commoncrawl/ia-web-commons/blob/master/src/main/java/org/archive/extract/WETExtractorOutput.java
[6]
https://github.com/Aloisius/ia-web-commons/commits/master/src/main/java/org/archive/extract/WETExtractorOutput.java

On 5/11/23 09:53, Hugo Stegrell wrote:
> Hi all,
>
> We are doing a thesis project analyzing the frequency of vulnerabilities
> online. We're using WAT+WET as our data source and would like to
> describe how Common Crawl works as accurately as possible. We've found
> most of the information in earlier topics here, in the FAQ, and in the
> source code, however, there are still a few points that are unclear to us.
>

> 1.

>
> How are the websites selected to be included in a crawl? Sources of
> URLs? From our understanding there is a database, is it public?
>

> 2.

>
> Is the Mozilla Public Suffix List still used? Saw it mentioned in
> earlier threads but no specifics on how it is used. Perhaps only for
> the Web Graphs nowadays?
>

> 3.

>
> Are URLs guaranteed only to be crawled once now? Read that before
> one URL might be in multiple segments but that might have changed
> now that the selection works differently.
>