Common Crawl now has a URL index!

1,783 views
Skip to first unread message

Lisa Green

unread,
Jan 8, 2013, 5:53:38 PM1/8/13
to common...@googlegroups.com
Thanks to Scott Robertson of triv.io, Common Crawl now has a URL index! Read all about it on Scott's guest blog post http://commoncrawl.org/common-crawl-url-index/

This is a very valuable tool and we are very grateful to Scott for donating his time and skill to create it!

Mat Kelcey

unread,
Jan 8, 2013, 6:11:06 PM1/8/13
to common...@googlegroups.com
nice work scott!


On 8 January 2013 14:53, Lisa Green <li...@commoncrawl.org> wrote:
Thanks to Scott Robertson of triv.io, Common Crawl now has a URL index! Read all about it on Scott's guest blog post http://commoncrawl.org/common-crawl-url-index/

This is a very valuable tool and we are very grateful to Scott for donating his time and skill to create it!

--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To view this discussion on the web visit https://groups.google.com/d/msg/common-crawl/-/Ap_IQXLaOAEJ.
To post to this group, send email to common...@googlegroups.com.
To unsubscribe from this group, send email to common-crawl...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/common-crawl?hl=en.

Keiw Kw

unread,
Jan 10, 2013, 1:23:18 AM1/10/13
to common...@googlegroups.com
That is a great job done, thank you Scott!

Just a quick question about index size - current file is 217Gb but back-of-the envelope calculations says it should be around 437Gb.  

The question is: does the file format use a compression by any means? (It is not mentioned in the docs explicitly) or what is the reason of the index file size twice less then expected?

Excuse me if i'm missing something obvious here.

Thanks is advance.

Scott Robertson

unread,
Jan 11, 2013, 11:24:25 PM1/11/13
to common...@googlegroups.com
Your most certainly welcome!


That is a great job done, thank you Scott!

You're most certainly welcome!
 
Just a quick question about index size - current file is 217Gb but back-of-the envelope calculations says it should be around 437Gb.  


Your calculations are correct, I only indexed about half of the corpus. Wanted to get it out and get some feedback, which has already been rolling in.   
 
The question is: does the file format use a compression by any means? (It is not mentioned in the docs explicitly) or what is the reason of the index file size twice less then expected?


No compression yet, though it's on my todo list. It'll be a fun challenge to figure out the best way to compress this file while maintaing fixed sized blocks for random access.


sup...@tabguitarlessons.com

unread,
Jan 13, 2013, 7:05:14 AM1/13/13
to common...@googlegroups.com
Awesome work Scott, can't wait for the full index. Can't imagine any simple way of compressing the dataset that's used for random access - but how about providing a compressed downloadable version to save bandwidth when people are taking a full copy?... there's bound to be quite a few of us who will...
Very grateful for this contribution, it's going to make the archive 1000s of percent more accessible.
Rich.
 
--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To view this discussion on the web visit https://groups.google.com/d/msg/common-crawl/-/ImnW7Z0rgq4J.

To post to this group, send email to common...@googlegroups.com.
To unsubscribe from this group, send email to common-crawl...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/common-crawl?hl=en.


I am using the Free version of SPAMfighter.
SPAMfighter has removed 17892 of my spam emails to date.

Do you have a slow PC? Try a free scan!

Tom Morris

unread,
Jan 14, 2013, 5:47:26 PM1/14/13
to common...@googlegroups.com
On Fri, Jan 11, 2013 at 11:24 PM, Scott Robertson <srobe...@codeit.com> wrote:

> No compression yet, though it's on my todo list. It'll be a fun challenge to
> figure out the best way to compress this file while maintaing fixed sized
> blocks for random access.

Since the records are already variable size and you have them
collated, it seems like you could easily implement a simple common
prefix compression scheme ie count of the number of characters shared
with the previous URL, followed by the remainder of the characters.
That could potentially save a ton of space.

Also, the segment date is redundant and doesn't really need to be
saved for each URL.

Tom

Lisa Green

unread,
Jan 15, 2013, 7:12:36 PM1/15/13
to common...@googlegroups.com
Check out this cool blog post by Jason Ronallo about how he used the index! http://jronallo.github.com/blog/common-crawl-url-index/

Scott Robertson

unread,
Jan 15, 2013, 7:48:23 PM1/15/13
to common...@googlegroups.com

Since the records are already variable size and you have them
collated, it seems like you could easily implement a simple common
prefix compression scheme ie count of the number of characters shared
with the previous URL, followed by the remainder of the characters.
That could potentially save a ton of space.

Also, the segment date is redundant and doesn't really need to be
saved for each URL.


Both very good suggestions, I'm itching to have some free time to take a stab at this. 

Pablo Abbate

unread,
Jan 24, 2013, 4:12:22 PM1/24/13
to common...@googlegroups.com
Great work!

I just have a question, when I get the header, I'm seeing 

0000 0100 540a 0000

Does it mean that the block size = 256 and the index block count = 1409941504 ?

Thanks!

Keiw Kw

unread,
Jan 27, 2013, 7:23:31 AM1/27/13
to common...@googlegroups.com
Hi Pablo,
AFIAK and as the docs states these 8 bytes are encoded in little-endian order which means that block size is 0x1000 = 65536 and the index block count is 0x0a54 = 2644

Amit Ambardekar

unread,
Jan 28, 2013, 7:10:28 AM1/28/13
to common...@googlegroups.com
Here is Ruby gem to access common-crawl-index


Please send feedback/pull requests on issues you find.

How can I find out how fresh is the crawl used for index? What would be typical timeframe for updates to the index? Any way user can generate this index themselves using the latest crawl?

Also, thanks for this awesome addition. It will find many uses.

Regards
Amit

--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.

To post to this group, send email to common...@googlegroups.com.

Scott Robertson

unread,
Jan 29, 2013, 8:08:38 PM1/29/13
to common...@googlegroups.com
This is very cool! Thanks so much, hoping we can get a few more languages contributed.

I believe the segment id records the date the crawl started and the arc date records when the page was archived. Fine grain detail is captured with the page.


As for generating it yourself stay tuned ;)


--
-- Scott

"There was a time when the internet answered all my questions. Now it just repeats them. - SDR"

Dan N

unread,
Apr 15, 2013, 1:30:03 PM4/15/13
to common...@googlegroups.com
Hello,
Great to read this thread.  This URL index seems to show great promise in terms of making the Common Crawl data more approachable.

I am interested in URL data only. In other words, I probably won't need anything else BUT the URL index.
I would like to identify URLs including certain patterns like (/string/string2/ or /string?=string2 ) regardless of the domain? Also a list of domains featuring such a string in any of their URLs?  Any suggestions appreciated on what might be a good approach or if the data schema lends itself nicely to this.

The Python and Ruby modules present on GitHub don't seem to be built for that.  All the examples features seem to indicate they allow searching by prefix ("all URLs starting by").  

Cheers,
Dan

Scott Robertson

unread,
Apr 15, 2013, 8:32:58 PM4/15/13
to common...@googlegroups.com



I am interested in URL data only. In other words, I probably won't need anything else BUT the URL index.
I would like to identify URLs including certain patterns like (/string/string2/ or /string?=string2 ) regardless of the domain? Also a list of domains featuring such a string in any of their URLs?  Any suggestions appreciated on what might be a good approach or if the data schema lends itself nicely to this.


The data is stored in sorted order, so as long as your subquery includes a common prefix you can use the pbtree library to filter the results to the set of urls with the prefix. Then apply your regular expresion or whatever.  

If you're parsing urls w/o a common prefix you're going to have to read through the whole darn index. Doing that sequentially on a single machine could take a very long time. 






Jaime Sanchez

unread,
Oct 2, 2015, 3:07:08 PM10/2/15
to Common Crawl
Hi Scott, amazing job! 

When do you think the index will contain the full corpus instead of half?

Tom Morris

unread,
Oct 2, 2015, 6:48:24 PM10/2/15
to common...@googlegroups.com
On Fri, Oct 2, 2015 at 3:07 PM, Jaime Sanchez <jaime....@socrata.com> wrote:
Hi Scott, amazing job! 

When do you think the index will contain the full corpus instead of half?

Read forward a couple of years in the archive (or check CommonCrawl blog posts for earlier in 2015).

The index in the attached announcement is obsolete and has been replaced with a new index which catalogs all the URLs in each month's crawl.

You can find it at: http://index.commoncrawl.org

Tom
 

--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To post to this group, send email to common...@googlegroups.com.

srb...@gmail.com

unread,
Jul 6, 2017, 5:44:51 AM7/6/17
to Common Crawl


hi,
i need dataset for web crawler. how to get it from common crawl.

Aigerim Serikbekova

unread,
Jul 10, 2017, 6:10:28 AM7/10/17
to common...@googlegroups.com
Hey,
I need to write the program in Javascreept to analyse the data in Common Crawl (sensitive data like URLs, username, passwords, what software programs, personal informations if it is possible) have to get from Common Crawl. Do you know how I can do it that? what kind of information can be leaked (even if you are protected)? Which method can protect the data? 
Thanks,
Aika

On 6 July 2017 at 10:44, <srb...@gmail.com> wrote:


hi,
i need dataset for web crawler. how to get it from common crawl.

--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl+unsubscribe@googlegroups.com.

To post to this group, send email to common...@googlegroups.com.

Sebastian Nagel

unread,
Jul 10, 2017, 7:45:22 AM7/10/17
to Common Crawl
Hi Aika,

please, start a "New Topic" and do not "hijack" existing topics/threads.

Thanks,
Sebastian
Reply all
Reply to author
Forward
This conversation is locked
You cannot reply and perform actions on locked conversations.
0 new messages