Hi Radek,
I ran a job on the October 2014 set in EC2 recently. This is what I did:
1. I processed all the .wat files, scanning for WARC-Target-URI. When I found one, I parsed out the domain (or SLD, this is how I defined it: http://<SLD>/) and added it to my collection.
2. My modified code rolled up all the SLDs found. The code in the repo has a minimum count threshold, which I set to 0
Number of urls parsed (without errors):
This
page states that it contains 3.8 billion web pages, which is less than my 11 billion url count.
I found 9 million unique SLDs. The 2012 paper found 41.4 million distinct SLDs from 3.8 billion pages, so I seem to be off by a factor 4 or 5
Now, I am very surprised by this result, I was expecting a lot more unique SLDs. If anyone out there has a count of unique domains/SLDs for this dataset, please let me know what your numbers are.
In conclusion, I think that my numbers are off, and I would really like someone to check my numbers and/or my code. If someone is interested in the code, I can fork the repo.
Thanks,
Henrik