Hi Tom,
How does this number relate to the total registered .com TLD domains, which as of earlier this month is ~ 157 million?
```
wc -l com.txt
157793831 com.txt
```
If I ran the query against additional crawls, I imagine the number of distinct domains will increase?
Ultimately, my goal is to pull actual HTML pages for domains; I'm conducting a survey to see what portion of sites are using ad networks for monetization of content, and therefore I'll need to parse the actual HTML for the presence of ad network HTML snippets. I wanted to start with just a random sample of domains to assess feasibility, so the latest crawl is a good place to start. I might eventually want to expand to additional crawls to cover additional unique domain names in .com TLD.
I found the following query[0] as an approach for a random sampling, so perhaps I can combine it with the earlier query, and add in warc_filename, warc_record_offset, warc_record_length, warc_segment, content_mime_type (HTML-only). With additional columns, I expect this might be much more expensive than my original query.
```
SELECT url
FROM "ccindex"."ccindex"
TABLESAMPLE BERNOULLI (.5)
WHERE crawl = 'CC-MAIN-2020-34'
AND (subset = 'warc' OR subset = 'crawldiagnostics')
```
Thanks!