Hi Ruben,
domain names in reversed order turn out to be quite practical.
First, if you sort a list of reversed order domain names domains
within one top-level domain, subdomains/hosts of one domain
are grouped together and are nicely aligned for better readability.
The reversed domain names may also help reading the webgraph
rankings when a filter is applied, e.g. search for
uk.ac.
in the ranking table on
http://commoncrawl.org/2018/11/web-graphs-aug-sep-oct-2018/
and the results are nicely aligned and you can focus on the
informative third "column":
uk.ac.ox
uk.ac.cam
uk.ac.ucl
uk.ac.ed
uk.ac.lse
I guess the better readability was also the reason why Java has
chosen reversed domain names for package names, cf.
https://en.wikipedia.org/wiki/Reverse_domain_name_notation
Btw., the Wikipedia article provides code for a longer list
of programming languages to (un)reverse domain names.
Second, if the nodes of the webgraph are sorted by reversed
domain name, the edges tend to be more "local" because hyperlinks
tend to link pages within one domain (e.g., products or
subdivisions of a company) or within one top-level domain
(sharing the same language or geographical region).
Locality is an important factor when storing graphs efficiently, see
https://www.ics.uci.edu/~djp3/classes/2008_01_01_INF141/Materials/p595-boldi.pdf
or
https://pdfs.semanticscholar.org/c1aa/08cb4a5f1311c945073b9c3c07a590a948d5.pdf
Third, reversed domain names are frequently used as keys/ids when web data is stored,
the CDX index uses them
org,commoncrawl)/the-data/tutorials
and also Google's big table paper mentions "reversed URLs" as row ids:
https://static.googleusercontent.com/media/research.google.com/en/archive/bigtable-osdi06.pdf
The real reason, of course, is laziness: could have reversed the domain
names again before publishing the webgraphs and rankings but this would
mean an additional step.
Best,
Sebastian
On 2/6/19 12:07 AM, Ruben Wolff wrote:
> I was just wondering why you guys reversed to order of the domain names ?
>
> like in these files
http://commoncrawl.org/2018/11/web-graphs-aug-sep-oct-2018/
>
>
>
> P.S.
> if any one is here its easy to put them back in normal order
> |
>
>
> importsys
>
> forline initer(sys.stdin.readline,""):
> linearr =line.split('\t')
> rev_host =linearr[4]
> host=".".join(list(reversed(rev_host.split("."))))
> linearr[4]=host
> outline ='\t'.join(linearr)
> sys.stdout.write(outline)
>
> |
>
>
> save that to revrev.py
>
> then you can do
>
> |
> cat cc-main-2018-aug-sep-oct-domain-ranks.txt |python3 revrev.py
>>revrev-cc-main-2018-aug-sep-oct-domain-ranks.txt
> |
>
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
>
common-crawl...@googlegroups.com <mailto:
common-crawl...@googlegroups.com>.
> To post to this group, send email to
common...@googlegroups.com
> <mailto:
common...@googlegroups.com>.
> Visit this group at
https://groups.google.com/group/common-crawl.
> For more options, visit
https://groups.google.com/d/optout.