Hi Adarsh,
> I'm unable to access the link that you have provided that lists all domains crawled in a month,
> can you please provide an alternative?
>
https://commoncrawl.s3.amazonaws.com/crawl-analysis/CC-MAIN-2017-13/stats/part-00000.gz
I've double-checked that this link is valid.
> s3://commoncrawl/crawl-analysis/CC-MAIN-2017-13/count/
To list and fetch data on S3 it's best to install the the AWS CLI from
https://aws.amazon.com/cli/
the listing is then shown by
aws --no-sign-request s3 ls --recursive s3://commoncrawl/crawl-analysis/CC-MAIN-2017-13/count/
and files can be fetched via "aws s3 cp ..."
It's only 10 files numbered sequentially, so you may use these https links instead:
https://commoncrawl.s3.amazonaws.com/crawl-analysis/CC-MAIN-2017-13/count/part-00000.bz2
https://commoncrawl.s3.amazonaws.com/crawl-analysis/CC-MAIN-2017-13/count/part-00001.bz2
...
https://commoncrawl.s3.amazonaws.com/crawl-analysis/CC-MAIN-2017-13/count/part-00009.bz2
> how page ranking is done
Right now the rankings on page-level are somewhat inconsistent as the are based on various
sources, mainly
- on the Blekko rankings for old (but still available) pages
- newer pages are ranked by page-level OPIC seeded with host-level harmonic centrality ranks
We plan to make the host-level rankings available during the next weeks.
> - How often will a website be crawled, how often will the same website be crawled in successive
> months and what rules dictate this and what would be the average overlap of the websites crawled
> in successive months?
At present, we do not have site-level (host or domain) aggregations of re-crawl frequencies or
overlaps. Page-level overlaps are shown on
https://commoncrawl.github.io/cc-crawl-statistics/plots/crawloverlap
the bare numbers of hosts, domains and TLDs (public suffixes) are shown on
https://commoncrawl.github.io/cc-crawl-statistics/plots/crawlsize
Overlaps aggregated on host or page level could be calculated from the host/domain counts.
You're welcome to do the calculations, it would be great if you could share the code to achieve
metrics and plots!
Thanks,
Sebastian
On 04/24/2017 09:43 AM, Adarsh Samuels wrote:
> Hi Sebastian,
>
> I'm unable to access the link that you have provided that lists all domains crawled in a month, can
> you please provide an alternative? Will it include when the website was crawled and if it were
> revisited?
> - As a follow up, can you also provide some context towards how page ranking is done (Read from the
> forum that this determines the probability of a page being re-crawled)
> - How often will a website be crawled, how often will the same website be crawled in successive
> months and what rules dictate this and what would be the average overlap of the websites crawled in
> successive months?
>
> Thank you for your time
>
> Adarsh
>
> On Tuesday, 4 April 2017 00:15:25 UTC+5:30, Sebastian Nagel wrote:
>
> Hi Aleks,
>
> you'll find the most recent list of TLDs (better "public suffixes") here:
>
https://commoncrawl.s3.amazonaws.com/crawl-analysis/CC-MAIN-2017-13/stats/part-00000.gz
> <
https://commoncrawl.s3.amazonaws.com/crawl-analysis/CC-MAIN-2017-13/stats/part-00000.gz>
> e.g.
> ["tld","college","CC-MAIN-2017-13"] [499,495,35,32]
> ["tld","cologne","CC-MAIN-2017-13"] [493,483,48,45]
> ["tld","com","CC-MAIN-2017-13"] [1705938827,1682767670,21694698,20589406]
> ["tld","
com.af <
http://com.af>","CC-MAIN-2017-13"] [1699,1695,32,26]
> ["tld","
com.ag <
http://com.ag>","CC-MAIN-2017-13"] [183,180,14,13]
> ["tld","
com.bm <
http://com.bm>","CC-MAIN-2017-13"] [2,2,1]
> ["tld","
com.ci <
http://com.ci>","CC-MAIN-2017-13"] 1
>
> Counts are by [page, URL, host, domain], repeated trailing numbers are skipped.
>
> The list contains also the 500 most frequent pay-level domains, e.g.:
> ["domain","
blogspot.ca <
http://blogspot.ca>","CC-MAIN-2017-13"] [1208865,1205653,64557]
> ["domain","
diaperswappers.com <
http://diaperswappers.com>","CC-MAIN-2017-13"]
> [592719,592716,2]
> ["domain","
engadget.com <
http://engadget.com>","CC-MAIN-2017-13"] [434253,339742,13]
> Counts include [page, URL, host].
>
> A full list of hosts and pay-level domains is available at
> s3://commoncrawl/crawl-analysis/CC-MAIN-2017-13/count/
> e.g.,
> [3,"
adobepress.com <
http://adobepress.com>",22] [4491,4469,1]
> [3,"
adobeps.ru <
http://adobeps.ru>",22] [714,714,2]
> [3,"
adobepueblo.com <
http://adobepueblo.com>",22] 2
>
> For details about the data and how it was extracted, have a look at
>
https://github.com/commoncrawl/cc-crawl-statistics/
> <
https://github.com/commoncrawl/cc-crawl-statistics/>
>
> Best,
> Sebastian
>
> On 04/03/2017 03:35 PM, Aleks B wrote:
> > Hello all!
> >
> > Could anyone please tell me when is the last time a list of root domains have been made public in
> > CommonCrawl or someone using their data and perhaps where could I find it?
> >
> > Thank you!
> >
> > Best,
> > Aleks
> >
> > --
> > You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> > To unsubscribe from this group and stop receiving emails from it, send an email to
> >
common-crawl...@googlegroups.com <javascript:>
> <mailto:
common-crawl...@googlegroups.com <javascript:>>.
> > To post to this group, send email to
common...@googlegroups.com <javascript:>
> > <mailto:
common...@googlegroups.com <javascript:>>.
> <
https://groups.google.com/group/common-crawl>.
> > For more options, visit
https://groups.google.com/d/optout <
https://groups.google.com/d/optout>.
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
>
common-crawl...@googlegroups.com <mailto:
common-crawl...@googlegroups.com>.