Query for all domain names in .com tld

360 views
Skip to first unread message

Jason Boxman

unread,
Jan 16, 2024, 8:41:21 PM1/16/24
to Common Crawl
Hi,

I haven't been able to track down the column sizes for recent indexes, so I can't guess the cost of this on AWS Athena. For `.no` it was 119.29 MB. `.com` is doubtless more data than this. I want to avoid a surprise high bill. Is there a really tiny TLD to test with?

Based on what I've found, I have this query I have not yet run:

SELECT DISTINCT(url_host_registered_domain)
FROM "ccindex"."ccindex"
WHERE crawl = 'CC-MAIN-2023-50'
  AND subset = 'warc'
  AND url_host_tld = 'com'
  AND content_languages LIKE 'eng%'
  AND REGEXP_LIKE(url_path, '^/$')
  AND fetch_status = 200

The goal is to get a distinct list of all the crawled .com domain names that successfully returned an index page.

Do crawls include "parked" domains? If so, is there any way to omit them? I've seen some papers that explain approaches using DNS registration details or parsing the HTML itself for markers, but this seems highly complex.

Thanks!

Greg Lindahl

unread,
Jan 16, 2024, 8:45:59 PM1/16/24
to common...@googlegroups.com
Jason,

You can find how large every TLD is at
https://commoncrawl.github.io/cc-crawl-statistics/plots/tld/latestcrawl.html

We do not exclude parked domains. Parked domain signatures change
rapidly, and we tend to not crawl them because they often are not
linked by anyone else.

-- greg
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/f179e346-9406-4f47-968f-4693ace2df5bn%40googlegroups.com.

Jason Boxman

unread,
Jan 16, 2024, 10:06:41 PM1/16/24
to Common Crawl
Hi Greg,

Athena charges based on amount of data read. That page includes counts, but I can't find any size in MB/GB/TB per column anywhere. Is there somewhere I can look for this or a query that I can run to get this?

Thanks!

Tom Morris

unread,
Jan 16, 2024, 10:45:15 PM1/16/24
to common...@googlegroups.com
On Tue, Jan 16, 2024 at 10:06 PM Jason Boxman <jason....@gmail.com> wrote:

Athena charges based on amount of data read. That page includes counts, but I can't find any size in MB/GB/TB per column anywhere. Is there somewhere I can look for this or a query that I can run to get this?

If you look at the domain counts for .no and .com in the table that Greg pointed to, .com is approximately 144x the size, so multiplying your 120MB scan size for .no gives you 17.25 GB or 0.02 TB, so you're not going to put much of a dent in your 1 TB / month free quota (but check the math).

BTW, I think the query below will give you slightly better results (doesn't require eng to be the first language and uses equals instead of regex for path):

SELECT DISTINCT(url_host_registered_domain), content_languages

FROM "ccindex"."ccindex"
WHERE crawl = 'CC-MAIN-2023-50'
  AND subset = 'warc'
  AND url_host_tld = 'no'
  AND content_languages LIKE '%eng%'
  AND url_path = '/'
  AND fetch_status = 200

Tom

Tom Morris

unread,
Jan 16, 2024, 10:50:04 PM1/16/24
to common...@googlegroups.com
I decided to just run it for real on the .com TLD and it took 17.396 seconds and scanned 15.46 GB returning 9,443,803 domains with successfully fetched home pages. 

Tom

Jason Boxman

unread,
Jan 17, 2024, 2:13:30 PM1/17/24
to Common Crawl
Hi Tom,

How does this number relate to the total registered .com TLD domains, which as of earlier this month is ~ 157 million?

```
wc -l com.txt
 157793831 com.txt
```

If I ran the query against additional crawls, I imagine the number of distinct domains will increase?

Ultimately, my goal is to pull actual HTML pages for domains; I'm conducting a survey to see what portion of sites are using ad networks for monetization of content, and therefore I'll need to parse the actual HTML for the presence of ad network HTML snippets. I wanted to start with just a random sample of domains to assess feasibility, so the latest crawl is a good place to start. I might eventually want to expand to additional crawls to cover additional unique domain names in .com TLD.

I found the following query[0] as an approach for a random sampling, so perhaps I can combine it with the earlier query, and add in warc_filename, warc_record_offset, warc_record_length, warc_segment, content_mime_type (HTML-only). With additional columns, I expect this might be much more expensive than my original query.

```
SELECT url
FROM "ccindex"."ccindex"
TABLESAMPLE BERNOULLI (.5)
WHERE crawl = 'CC-MAIN-2020-34'
  AND (subset = 'warc' OR subset = 'crawldiagnostics')
```

Thanks!

Tom Morris

unread,
Jan 17, 2024, 7:56:01 PM1/17/24
to common...@googlegroups.com
On Wed, Jan 17, 2024 at 2:13 PM Jason Boxman <jason....@gmail.com> wrote:

How does this number relate to the total registered .com TLD domains, which as of earlier this month is ~ 157 million?

Is that a trick question? It's approximately 9.44/157 or ~6%.
 
If I ran the query against additional crawls, I imagine the number of distinct domains will increase?

Correct.
 
Ultimately, my goal is to pull actual HTML pages for domains; I'm conducting a survey to see what portion of sites are using ad networks for monetization of content, and therefore I'll need to parse the actual HTML for the presence of ad network HTML snippets.

I would definitely prototype your full pipeline to see what unexpected issues you encounter, but isn't the answer going to be, to the first approximation, "all"?
 
I wanted to start with just a random sample of domains to assess feasibility, so the latest crawl is a good place to start. I might eventually want to expand to additional crawls to cover additional unique domain names in .com TLD.

I found the following query[0] as an approach for a random sampling, so perhaps I can combine it with the earlier query, and add in warc_filename, warc_record_offset, warc_record_length, warc_segment, content_mime_type (HTML-only). With additional columns, I expect this might be much more expensive than my original query.

Reading the description of Bernoulli sampling[1] makes me think that it won't actually reduce the amount of data that is scanned -- and a little experimentation confirms that. I don't think you need the warc_segment in this query, unless you need it for other purposes, because the warc_filename includes the complete path with the segment ID in it. It looks like the additional columns approximately double the cost of the query.

Have you considered using the WAT files instead of the WARCs? They're 1/4 the size and have all the links pre-parsed out of the HTML, but they're not indexed so you'd need to read entire files and skip anything not of interest. I'm not sure the tradeoffs are in their favor, but it's worth considering.
 
Tom

Reply all
Reply to author
Forward
0 new messages