You do not have permission to delete messages in this group
Copy link
Report message
Sign in to report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to Common Crawl
Hi again,
I am searching for a list of unique hosts in the CommonCrawl crawls. I have used Athena with this query:
SELECT DISTINCT url_host_name, content_languages
FROM "ccindex"."ccindex"
WHERE subset = 'warc'
This resulted in 379.756.278 hosts. Then, I looked at the nodes file of Jul/Aug/Sep (thanks again to Sebastian Nagel for helping), and it has 538.570.861 hosts. My assumption is that the additional hosts have been seen in crawls as links but the pages have not been crawled. Am I right?
I had also downloaded all 300 URL index files of the last crawls and get to a different number, but I assume, based on what I have read here in this group, that a crawl will not include all known URLs.
Best
Tom
Tom Alby
unread,
Feb 3, 2021, 3:05:45 PM2/3/21
Reply to author
Sign in to reply to author
Forward
Sign in to forward
Delete
You do not have permission to delete messages in this group
Copy link
Report message
Sign in to report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to Common Crawl
- content_languages in the query, that would be more hosts, actually :)
Sebastian Nagel
unread,
Feb 3, 2021, 3:34:51 PM2/3/21
Reply to author
Sign in to reply to author
Forward
Sign in to forward
Delete
You do not have permission to delete messages in this group
Copy link
Report message
Sign in to report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to common...@googlegroups.com
Hi Tom,
the number of distinct host names in a monthly crawl is currently
around 50k. The webgraph includes all hosts of 3 crawls either
visited by the crawler or seen in outgoing links. As a further subtlety,
the hostnames in the graphs are normalized and cleaned up:
- www. prefix removed
- only host names with valid TLD suffix (eg. IP addresses removed)
while the column url_host_name in the index holds exactly the string
returned by the method java.net.URL.getHost().
> that a crawl will not include all known URLs.
Yes. No way, the web is too big, we need to sample.