Hi Miraz,
host and domain names are given in reverse domain name notation,
see the references below.
Here a couple "one-liners" to un-reverse them:
(Perl)
perl -lne 'print join(".", reverse split /\./)'
(Python 3)
python -c "import sys; [
print('.'.join(line.rstrip().split('.')[::-1])) for line in sys.stdin ]"
(Python 2)
python -c "import sys; print '\n'.join([
'.'.join(line.rstrip().split('.')[::-1]) for line in sys.stdin ])"
Putting everything together, the domain list is written by the
following Linux command:
zcat cc-main-2021-22-oct-nov-jan-domain-ranks.txt.gz \
| cut -f5 \
| tail -n+2 \
| perl -lne 'print join(".", reverse split /\./)' \
| gzip \
>cc-main-2021-22-oct-nov-jan-domains.txt.gz
If you could share your preferred programming language and environment
I might be able to give you more detailed support.
Best,
Sebastian
[1]
https://en.wikipedia.org/wiki/Reverse_domain_name_notation
[2]
https://groups.google.com/g/common-crawl/c/5WnRFlRoHco/m/XQllPrbRGQAJ
On 4/6/22 11:46, miraz sarker wrote:
>
> no sir i checked the list and i found that the list are like
> ch.myspreadshop.allthecamo
>
> but the main domain is
>
allthecamo.myspreadshop.ch
>
> i need the main domain names not like this ch.myspreadshop.allthecamo
>
>
> On Wednesday, April 6, 2022 at 3:39:41 PM UTC+6 Sebastian Nagel wrote:
>
> Hi Miraz,
>
> could give a couple of examples what you exactly mean by "domain"
> and "subdomain"?
>
> The domain-level graph includes only registered domains including
> the number of hosts for each domain.
> A "domain" is everything one level below a registry suffix:
>
example.com <
http://example.com>
>
example.co.uk <
http://example.co.uk>
> Both "com" and "
co.uk <
http://co.uk>" are suffixes defined in the
> ICANN section
> of the public suffix list.
>
> The host-level webgraph includes all host names:
>
www1.example.com <
http://www1.example.com>
>
www2.sub.example.co.uk <
http://www2.sub.example.co.uk>