i want all domains and subdomains

1,347 views
Skip to first unread message

miraz sarker

unread,
Apr 6, 2022, 4:46:49 AM4/6/22
to Common Crawl
hello sir i want all the domain names and subdomains common crawl have but i cant extract the data cause i dont have that much storage or mahine to extract could you please help me to get all the domains and subdomain names commoncrawl have?

Sebastian Nagel

unread,
Apr 6, 2022, 5:11:49 AM4/6/22
to common...@googlegroups.com
Hi Miraz,

please have a look at our webgraph data sets - here the latest one:

https://commoncrawl.org/2022/03/host-and-domain-level-web-graphs-oct-nov-jan-2021-2022/

The *-ranks.txt.gz (alternatively, only vertices files) include all
host and/or domain names the crawler has visited or seen in outlinks
during 3 monthly crawls. That's only a few Gigabytes.

Best,
Sebastian

miraz sarker

unread,
Apr 6, 2022, 5:17:15 AM4/6/22
to Common Crawl
sir i dont need hosts would you plese send me a filted list which only have domain and subdomains

Sebastian Nagel

unread,
Apr 6, 2022, 5:39:41 AM4/6/22
to common...@googlegroups.com
Hi Miraz,

could give a couple of examples what you exactly mean by "domain"
and "subdomain"?

The domain-level graph includes only registered domains including
the number of hosts for each domain.
A "domain" is everything one level below a registry suffix:
example.com
example.co.uk
Both "com" and "co.uk" are suffixes defined in the ICANN section
of the public suffix list.

The host-level webgraph includes all host names:
www1.example.com
www2.sub.example.co.uk
The term "host" is used following the URL syntax spec [1],
but IP addresses are removed. Also a leading "www." is stripped.

Best,
Sebastian

[1] https://en.wikipedia.org/wiki/URL#Syntax
> --
> You received this message because you are subscribed to the Google
> Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to common-crawl...@googlegroups.com
> <mailto:common-crawl...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/common-crawl/d5fa0504-1192-4bdf-af73-55dc15735b9fn%40googlegroups.com
> <https://groups.google.com/d/msgid/common-crawl/d5fa0504-1192-4bdf-af73-55dc15735b9fn%40googlegroups.com?utm_medium=email&utm_source=footer>.

miraz sarker

unread,
Apr 6, 2022, 5:46:23 AM4/6/22
to Common Crawl

no sir i checked the list and i found that the list are like 
ch.myspreadshop.allthecamo

but the main domain is 

i need the main domain names not like this  ch.myspreadshop.allthecamo

Sebastian Nagel

unread,
Apr 6, 2022, 6:54:28 AM4/6/22
to common...@googlegroups.com
Hi Miraz,

host and domain names are given in reverse domain name notation,
see the references below.

Here a couple "one-liners" to un-reverse them:

(Perl)
perl -lne 'print join(".", reverse split /\./)'

(Python 3)
python -c "import sys; [
print('.'.join(line.rstrip().split('.')[::-1])) for line in sys.stdin ]"

(Python 2)
python -c "import sys; print '\n'.join([
'.'.join(line.rstrip().split('.')[::-1]) for line in sys.stdin ])"


Putting everything together, the domain list is written by the
following Linux command:

zcat cc-main-2021-22-oct-nov-jan-domain-ranks.txt.gz \
| cut -f5 \
| tail -n+2 \
| perl -lne 'print join(".", reverse split /\./)' \
| gzip \
>cc-main-2021-22-oct-nov-jan-domains.txt.gz

If you could share your preferred programming language and environment
I might be able to give you more detailed support.

Best,
Sebastian

[1] https://en.wikipedia.org/wiki/Reverse_domain_name_notation
[2] https://groups.google.com/g/common-crawl/c/5WnRFlRoHco/m/XQllPrbRGQAJ

On 4/6/22 11:46, miraz sarker wrote:
>
> no sir i checked the list and i found that the list are like 
> ch.myspreadshop.allthecamo
>
> but the main domain is 
> allthecamo.myspreadshop.ch
>
> i need the main domain names not like this  ch.myspreadshop.allthecamo
>
>
> On Wednesday, April 6, 2022 at 3:39:41 PM UTC+6 Sebastian Nagel wrote:
>
> Hi Miraz,
>
> could give a couple of examples what you exactly mean by "domain"
> and "subdomain"?
>
> The domain-level graph includes only registered domains including
> the number of hosts for each domain.
> A "domain" is everything one level below a registry suffix:
> example.com <http://example.com>
> example.co.uk <http://example.co.uk>
> Both "com" and "co.uk <http://co.uk>" are suffixes defined in the
> ICANN section
> of the public suffix list.
>
> The host-level webgraph includes all host names:
> www1.example.com <http://www1.example.com>
> www2.sub.example.co.uk <http://www2.sub.example.co.uk>

miraz sarker

unread,
Apr 6, 2022, 9:27:57 AM4/6/22
to Common Crawl
dear sir thanks for your help but i need littlebit more help whould you worte a python script for me where i will give the common crawler input list and it will extract the domain properly?

miraz sarker

unread,
Apr 7, 2022, 9:39:26 AM4/7/22
to Common Crawl
hello sir please help me

Sebastian Nagel

unread,
Apr 8, 2022, 10:03:56 AM4/8/22
to common...@googlegroups.com
Hi Miraz,

see below.

Best,
Sebastian


import gzip

input_file = 'cc-main-2021-22-oct-nov-jan-domain-ranks.txt.gz'

for line in gzip.open(input_file, mode='rt', encoding='ascii'):
domain_reversed = line.split('\t')[4]
domain = '.'.join(domain_reversed.split('.')[::-1])
print(domain)
> > allthecamo.myspreadshop.ch <http://allthecamo.myspreadshop.ch>
> >
> > i need the main domain names not like this 
> ch.myspreadshop.allthecamo
> >
> >
> > On Wednesday, April 6, 2022 at 3:39:41 PM UTC+6 Sebastian
> Nagel wrote:
> >
> > Hi Miraz,
> >
> > could give a couple of examples what you exactly mean by "domain"
> > and "subdomain"?
> >
> > The domain-level graph includes only registered domains including
> > the number of hosts for each domain.
> > A "domain" is everything one level below a registry suffix:
> > example.com <http://example.com> <http://example.com
> <http://example.com>>
> > example.co.uk <http://example.co.uk> <http://example.co.uk
> <http://example.co.uk>>
> > Both "com" and "co.uk <http://co.uk> <http://co.uk
> <http://co.uk>>" are suffixes defined in the
> > ICANN section
> > of the public suffix list.
> >
> > The host-level webgraph includes all host names:
> > www1.example.com <http://www1.example.com>
> <http://www1.example.com <http://www1.example.com>>
> > www2.sub.example.co.uk <http://www2.sub.example.co.uk>
> <http://www2.sub.example.co.uk <http://www2.sub.example.co.uk>>
> --
> You received this message because you are subscribed to the Google
> Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to common-crawl...@googlegroups.com
> <mailto:common-crawl...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/common-crawl/0b4c4b67-a5b9-49ff-8152-7dfccf7b9c05n%40googlegroups.com
> <https://groups.google.com/d/msgid/common-crawl/0b4c4b67-a5b9-49ff-8152-7dfccf7b9c05n%40googlegroups.com?utm_medium=email&utm_source=footer>.

Stephane Coulondre

unread,
Apr 9, 2022, 5:50:28 AM4/9/22
to Common Crawl
Sebastian is undoubtedly very patient ...

Bpm Tips

unread,
Jun 15, 2022, 2:24:10 PM6/15/22
to Common Crawl

 If you need to run certain spark sql queries on the columnar index let us know we can publicly post the query results in csv format.

e.g. the list of all domains with count of number of urls is available at the following link.

query used 

val sqlDF = sqlContext.sql("SELECT distinct url_host_name as domain, count(*) as size from urls order by size desc")

Ashish Rai

unread,
Jun 18, 2022, 5:42:09 PM6/18/22
to Common Crawl
Hi, 
Is this one belongs to all index or only certain index ?

Sebastian Nagel

unread,
Jun 20, 2022, 2:32:53 PM6/20/22
to common...@googlegroups.com
Hi,

given the numbers in the download link it's about a single crawl,
presumably the May crawl.

Best,
Sebastian

ibmbp...@gmail.com

unread,
Jun 21, 2022, 10:04:12 AM6/21/22
to common...@googlegroups.com
Yes it is may 2022 crawl only.

Anil
--
You received this message because you are subscribed to a topic in the Google Groups "Common Crawl" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/common-crawl/XC2QmOE-sdI/unsubscribe.
To unsubscribe from this group and all its topics, send an email to common-crawl...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/ed936107-ca43-d2b2-6845-6b3913c6af5e%40commoncrawl.org.

Reply all
Reply to author
Forward
0 new messages