Why the reverse order domain names ?

68 views
Skip to first unread message

Ruben Wolff

unread,
Feb 5, 2019, 6:07:13 PM2/5/19
to Common Crawl
I was just wondering why you guys reversed to order of the domain names ? 




P.S.
if any one is here its easy to put them back in normal order


import sys

for line in iter(sys.stdin.readline, ""):
    linearr
= line.split('\t')
    rev_host
= linearr[4]
    host
=".".join(list(reversed(rev_host.split("."))))
    linearr
[4]=host
    outline
= '\t'.join(linearr)
    sys
.stdout.write(outline)



save that to revrev.py

then you can do

cat cc-main-2018-aug-sep-oct-domain-ranks.txt | python3 revrev.py > revrev-cc-main-2018-aug-sep-oct-domain-ranks.txt


Sebastian Nagel

unread,
Feb 6, 2019, 6:42:52 AM2/6/19
to common...@googlegroups.com
Hi Ruben,

domain names in reversed order turn out to be quite practical.

First, if you sort a list of reversed order domain names domains
within one top-level domain, subdomains/hosts of one domain
are grouped together and are nicely aligned for better readability.

The reversed domain names may also help reading the webgraph
rankings when a filter is applied, e.g. search for
uk.ac.
in the ranking table on
http://commoncrawl.org/2018/11/web-graphs-aug-sep-oct-2018/
and the results are nicely aligned and you can focus on the
informative third "column":
uk.ac.ox
uk.ac.cam
uk.ac.ucl
uk.ac.ed
uk.ac.lse

I guess the better readability was also the reason why Java has
chosen reversed domain names for package names, cf.
https://en.wikipedia.org/wiki/Reverse_domain_name_notation

Btw., the Wikipedia article provides code for a longer list
of programming languages to (un)reverse domain names.


Second, if the nodes of the webgraph are sorted by reversed
domain name, the edges tend to be more "local" because hyperlinks
tend to link pages within one domain (e.g., products or
subdivisions of a company) or within one top-level domain
(sharing the same language or geographical region).

Locality is an important factor when storing graphs efficiently, see
https://www.ics.uci.edu/~djp3/classes/2008_01_01_INF141/Materials/p595-boldi.pdf
or
https://pdfs.semanticscholar.org/c1aa/08cb4a5f1311c945073b9c3c07a590a948d5.pdf


Third, reversed domain names are frequently used as keys/ids when web data is stored,
the CDX index uses them
org,commoncrawl)/the-data/tutorials
and also Google's big table paper mentions "reversed URLs" as row ids:
https://static.googleusercontent.com/media/research.google.com/en/archive/bigtable-osdi06.pdf


The real reason, of course, is laziness: could have reversed the domain
names again before publishing the webgraphs and rankings but this would
mean an additional step.


Best,
Sebastian


On 2/6/19 12:07 AM, Ruben Wolff wrote:
> I was just wondering why you guys reversed to order of the domain names ? 
>
> like in these files http://commoncrawl.org/2018/11/web-graphs-aug-sep-oct-2018/  
>
>
>
> P.S.
> if any one is here its easy to put them back in normal order
> |
>
>
> importsys
>
> forline initer(sys.stdin.readline,""):
>     linearr =line.split('\t')
>     rev_host =linearr[4]
>     host=".".join(list(reversed(rev_host.split("."))))
>     linearr[4]=host
>     outline ='\t'.join(linearr)
>     sys.stdout.write(outline)
>
> |
>
>
> save that to revrev.py
>
> then you can do
>
> |
> cat cc-main-2018-aug-sep-oct-domain-ranks.txt |python3 revrev.py
>>revrev-cc-main-2018-aug-sep-oct-domain-ranks.txt
> |
>
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
> To post to this group, send email to common...@googlegroups.com
> <mailto:common...@googlegroups.com>.
> Visit this group at https://groups.google.com/group/common-crawl.
> For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages