List of root domains?

318 views
Skip to first unread message

Aleks B

unread,
Apr 3, 2017, 9:35:58 AM4/3/17
to Common Crawl
Hello all!

Could anyone please tell me when is the last time a list of root domains have been made public in CommonCrawl or someone using their data and perhaps where could I find it?

Thank you!

Best,
Aleks

Sebastian Nagel

unread,
Apr 3, 2017, 2:45:25 PM4/3/17
to common...@googlegroups.com
Hi Aleks,

you'll find the most recent list of TLDs (better "public suffixes") here:
https://commoncrawl.s3.amazonaws.com/crawl-analysis/CC-MAIN-2017-13/stats/part-00000.gz
e.g.
["tld","college","CC-MAIN-2017-13"] [499,495,35,32]
["tld","cologne","CC-MAIN-2017-13"] [493,483,48,45]
["tld","com","CC-MAIN-2017-13"] [1705938827,1682767670,21694698,20589406]
["tld","com.af","CC-MAIN-2017-13"] [1699,1695,32,26]
["tld","com.ag","CC-MAIN-2017-13"] [183,180,14,13]
["tld","com.bm","CC-MAIN-2017-13"] [2,2,1]
["tld","com.ci","CC-MAIN-2017-13"] 1

Counts are by [page, URL, host, domain], repeated trailing numbers are skipped.

The list contains also the 500 most frequent pay-level domains, e.g.:
["domain","blogspot.ca","CC-MAIN-2017-13"] [1208865,1205653,64557]
["domain","diaperswappers.com","CC-MAIN-2017-13"] [592719,592716,2]
["domain","engadget.com","CC-MAIN-2017-13"] [434253,339742,13]
Counts include [page, URL, host].

A full list of hosts and pay-level domains is available at
s3://commoncrawl/crawl-analysis/CC-MAIN-2017-13/count/
e.g.,
[3,"adobepress.com",22] [4491,4469,1]
[3,"adobeps.ru",22] [714,714,2]
[3,"adobepueblo.com",22] 2

For details about the data and how it was extracted, have a look at
https://github.com/commoncrawl/cc-crawl-statistics/

Best,
Sebastian
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
> To post to this group, send email to common...@googlegroups.com
> <mailto:common...@googlegroups.com>.
> Visit this group at https://groups.google.com/group/common-crawl.
> For more options, visit https://groups.google.com/d/optout.

Adarsh Samuels

unread,
Apr 24, 2017, 3:43:54 AM4/24/17
to Common Crawl
Hi Sebastian,

I'm unable to access the link that you have provided that lists all domains crawled in a month, can you please provide an alternative? Will it include when the website was crawled and if it were revisited?
- As a follow up, can you also provide some context towards how page ranking is done (Read from the forum that this determines the probability of a page being re-crawled)
- How often will a website be crawled, how often will the same website be crawled in successive months and what rules dictate this and what would be the average overlap of the websites crawled in successive months?

Thank you for your time

Adarsh

Sebastian Nagel

unread,
Apr 24, 2017, 5:54:21 AM4/24/17
to common...@googlegroups.com
Hi Adarsh,

> I'm unable to access the link that you have provided that lists all domains crawled in a month,
> can you please provide an alternative?

> https://commoncrawl.s3.amazonaws.com/crawl-analysis/CC-MAIN-2017-13/stats/part-00000.gz
I've double-checked that this link is valid.

> s3://commoncrawl/crawl-analysis/CC-MAIN-2017-13/count/
To list and fetch data on S3 it's best to install the the AWS CLI from
https://aws.amazon.com/cli/
the listing is then shown by
aws --no-sign-request s3 ls --recursive s3://commoncrawl/crawl-analysis/CC-MAIN-2017-13/count/
and files can be fetched via "aws s3 cp ..."

It's only 10 files numbered sequentially, so you may use these https links instead:
https://commoncrawl.s3.amazonaws.com/crawl-analysis/CC-MAIN-2017-13/count/part-00000.bz2
https://commoncrawl.s3.amazonaws.com/crawl-analysis/CC-MAIN-2017-13/count/part-00001.bz2
...
https://commoncrawl.s3.amazonaws.com/crawl-analysis/CC-MAIN-2017-13/count/part-00009.bz2

> how page ranking is done
Right now the rankings on page-level are somewhat inconsistent as the are based on various
sources, mainly
- on the Blekko rankings for old (but still available) pages
- newer pages are ranked by page-level OPIC seeded with host-level harmonic centrality ranks
We plan to make the host-level rankings available during the next weeks.

> - How often will a website be crawled, how often will the same website be crawled in successive
> months and what rules dictate this and what would be the average overlap of the websites crawled
> in successive months?

At present, we do not have site-level (host or domain) aggregations of re-crawl frequencies or
overlaps. Page-level overlaps are shown on
https://commoncrawl.github.io/cc-crawl-statistics/plots/crawloverlap
the bare numbers of hosts, domains and TLDs (public suffixes) are shown on
https://commoncrawl.github.io/cc-crawl-statistics/plots/crawlsize

Overlaps aggregated on host or page level could be calculated from the host/domain counts.
You're welcome to do the calculations, it would be great if you could share the code to achieve
metrics and plots!

Thanks,
Sebastian



On 04/24/2017 09:43 AM, Adarsh Samuels wrote:
> Hi Sebastian,
>
> I'm unable to access the link that you have provided that lists all domains crawled in a month, can
> you please provide an alternative? Will it include when the website was crawled and if it were
> revisited?
> - As a follow up, can you also provide some context towards how page ranking is done (Read from the
> forum that this determines the probability of a page being re-crawled)
> - How often will a website be crawled, how often will the same website be crawled in successive
> months and what rules dictate this and what would be the average overlap of the websites crawled in
> successive months?
>
> Thank you for your time
>
> Adarsh
>
> On Tuesday, 4 April 2017 00:15:25 UTC+5:30, Sebastian Nagel wrote:
>
> Hi Aleks,
>
> you'll find the most recent list of TLDs (better "public suffixes") here:
> https://commoncrawl.s3.amazonaws.com/crawl-analysis/CC-MAIN-2017-13/stats/part-00000.gz
> <https://commoncrawl.s3.amazonaws.com/crawl-analysis/CC-MAIN-2017-13/stats/part-00000.gz>
> e.g.
> ["tld","college","CC-MAIN-2017-13"] [499,495,35,32]
> ["tld","cologne","CC-MAIN-2017-13"] [493,483,48,45]
> ["tld","com","CC-MAIN-2017-13"] [1705938827,1682767670,21694698,20589406]
> ["tld","com.af <http://com.af>","CC-MAIN-2017-13"] [1699,1695,32,26]
> ["tld","com.ag <http://com.ag>","CC-MAIN-2017-13"] [183,180,14,13]
> ["tld","com.bm <http://com.bm>","CC-MAIN-2017-13"] [2,2,1]
> ["tld","com.ci <http://com.ci>","CC-MAIN-2017-13"] 1
>
> Counts are by [page, URL, host, domain], repeated trailing numbers are skipped.
>
> The list contains also the 500 most frequent pay-level domains, e.g.:
> ["domain","blogspot.ca <http://blogspot.ca>","CC-MAIN-2017-13"] [1208865,1205653,64557]
> ["domain","diaperswappers.com <http://diaperswappers.com>","CC-MAIN-2017-13"]
> [592719,592716,2]
> ["domain","engadget.com <http://engadget.com>","CC-MAIN-2017-13"] [434253,339742,13]
> Counts include [page, URL, host].
>
> A full list of hosts and pay-level domains is available at
> s3://commoncrawl/crawl-analysis/CC-MAIN-2017-13/count/
> e.g.,
> [3,"adobepress.com <http://adobepress.com>",22] [4491,4469,1]
> [3,"adobeps.ru <http://adobeps.ru>",22] [714,714,2]
> [3,"adobepueblo.com <http://adobepueblo.com>",22] 2
>
> For details about the data and how it was extracted, have a look at
> https://github.com/commoncrawl/cc-crawl-statistics/
> <https://github.com/commoncrawl/cc-crawl-statistics/>
>
> Best,
> Sebastian
>
> On 04/03/2017 03:35 PM, Aleks B wrote:
> > Hello all!
> >
> > Could anyone please tell me when is the last time a list of root domains have been made public in
> > CommonCrawl or someone using their data and perhaps where could I find it?
> >
> > Thank you!
> >
> > Best,
> > Aleks
> >
> > --
> > You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> > To unsubscribe from this group and stop receiving emails from it, send an email to
> > common-crawl...@googlegroups.com <javascript:>
> <mailto:common-crawl...@googlegroups.com <javascript:>>.
> > To post to this group, send email to common...@googlegroups.com <javascript:>
> > <mailto:common...@googlegroups.com <javascript:>>.
> <https://groups.google.com/group/common-crawl>.
> > For more options, visit https://groups.google.com/d/optout <https://groups.google.com/d/optout>.
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
Reply all
Reply to author
Forward
0 new messages