Unique URL count estimate (or duplication percentage) in Common Crawl Archive?

Radek Szamrej

unread,

Dec 15, 2014, 1:00:41 PM12/15/14

to common...@googlegroups.com

Does anyone know an estimated (or exact) number of *unique* URLs (or duplication percentage) in the Common Crawl preferably in October 2014 Archive?

Henrik Kjallbring

unread,

Dec 15, 2014, 5:16:20 PM12/15/14

to common...@googlegroups.com

Hi Radek,

I ran a job on the October 2014 set in EC2 recently. This is what I did:

1. I processed all the .wat files, scanning for WARC-Target-URI. When I found one, I parsed out the domain (or SLD, this is how I defined it: http://<SLD>/) and added it to my collection.

2. My modified code rolled up all the SLDs found. The code in the repo has a minimum count threshold, which I set to 0

Number of urls parsed (without errors):

11,174,689,482

This page states that it contains 3.8 billion web pages, which is less than my 11 billion url count.

I found 9 million unique SLDs. The 2012 paper found 41.4 million distinct SLDs from 3.8 billion pages, so I seem to be off by a factor 4 or 5

Now, I am very surprised by this result, I was expecting a lot more unique SLDs. If anyone out there has a count of unique domains/SLDs for this dataset, please let me know what your numbers are.

I started with code from this github repo: https://github.com/AKSHAYUBHAT/CommonCrawl, and tweaked the code to make it work the way I needed it to.

In conclusion, I think that my numbers are off, and I would really like someone to check my numbers and/or my code. If someone is interested in the code, I can fork the repo.

Thanks,

Henrik

Stephen Merity

unread,

Dec 15, 2014, 6:47:10 PM12/15/14

to common...@googlegroups.com

Hi Radek,

I don't have exact figures for you, unfortunately, though someone else might have done work on this. I'm curious if you mean unique URLs by only considering the URL itself or via the considering the content that two different URLs might contain.

The duplication percentage for two exact URLs being in a single crawl archive should be quite low. The URL list is made unique before the crawling process is begun in our preparation stage. The only situation in which the exact same URL should be crawled twice is if the crawler follows a redirect from a previous URL. Computing this should be a fairly trivial run over the data.

Calculating such a duplication percentage when it comes to content however, rather than just the URL itself, is a far more involved question. For a good introduction to the challenges, I recommend reading "Do Not Crawl in the DUST: Different URLs with Similar Text", "Detecting Near-Duplicates for Web Crawling", and the numerous papers that have built on that work over the years. From that, there are numerous techniques and a wide range of definitions over duplicate content to consider.

On Mon, Dec 15, 2014 at 10:00 AM, Radek Szamrej <rstech...@gmail.com> wrote:

Does anyone know an estimated (or exact) number of *unique* URLs (or duplication percentage) in the Common Crawl preferably in October 2014 Archive?

--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To post to this group, send email to common...@googlegroups.com.
Visit this group at http://groups.google.com/group/common-crawl.
For more options, visit https://groups.google.com/d/optout.

--

Regards,

Stephen Merity

Data Scientist @ Common Crawl

Pavel Smrz

unread,

Dec 16, 2014, 3:35:33 AM12/16/14

to common...@googlegroups.com

Hi Radek,

The number of exact URL duplicities (actually, multiplicities) was pretty high in the August crawl, only about 2/3 of URLs were unique.
Pages such as http://adage.com/abstract?article_id=290167 appeared 112 times, http://9gag.com/ appeared in 111 files etc.

Regards

Pavel

--
Pavel Smrz
Associate professor
Faculty of Information Technology
Brno University of Technology
Bozetechova 2, 61266 Brno
Czech Republic

Akshay Bhat

unread,

Dec 16, 2014, 4:58:30 AM12/16/14

to common...@googlegroups.com

Hi Henrik

There was an error in my code where it counted all occurrences of target-url in the WAT files.

The way each page is stored, this specific field appears thrice, Request, Header and Response.

This itself if not counted correctly would explain a threefold increase.

Akshay Bhat

unread,

Dec 16, 2014, 5:01:06 AM12/16/14

to common...@googlegroups.com

Reading your post again, I am sure that the 3 fold error is due to counting "WARC-Target-URI' which repeats in Request,Response,Header.

On Monday, December 15, 2014 5:16:20 PM UTC-5, Henrik Kjallbring wrote:

Radek Szamrej

unread,

Dec 16, 2014, 5:11:46 AM12/16/14

to common...@googlegroups.com

Hi Guys,

thanks for the answers so far...

I have been processing October 2014 WET files and the was reporting 2 805 571 803 WET items (pages) pushed to storage (I was throwing out some of the pages not meeting our filtering criteria based on text-length).

For each document I was generating hashed ID using Murmur3 (128bit) hash from the URL of the document and pushing it into our storage.

In the storage (which was forcing uniqueness of those IDs) I ended up having only 1 203 794 016 documents.

I was wondering if this low amount of unique IDs was caused by so many hash collisions or it actually reflects more the actual amount of unique URLs in the CommonCrawl data.

BTW. My several tests based on 20 and 381 WET files having respectively 1.1M and 28M documens shown 0 hash collisions in our case.