Meanpath Jan 2014 Torrent - 1.6TB of crawl data from 115m websites.

632 views
Skip to first unread message

Adam Seabrook

unread,
Jan 7, 2014, 4:55:42 AM1/7/14
to common...@googlegroups.com
Hi Data People :)

Yesterday we released a small 1.6 TB crawl of just the front page on 115 million websites that may be useful to those of you who do not need the full Common Crawl index.

You can read more about it here:

We are huge fans of Common Crawl so want to try and support the mission where we can with occasional releases of crawl data.

Cheers,
Adam Seabrook
CEO

Pete Warden

unread,
Jan 7, 2014, 1:47:45 PM1/7/14
to common...@googlegroups.com
That's fantastic, thanks Adam!


--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To post to this group, send email to common...@googlegroups.com.
Visit this group at http://groups.google.com/group/common-crawl.
For more options, visit https://groups.google.com/groups/opt_out.



--
Check out Jetpac City Guides iPhone app - Just launched!

CTO Jetpac
Follow me on twitter @petewarden

Lisa Green

unread,
Jan 8, 2014, 4:04:55 PM1/8/14
to common...@googlegroups.com
We are super excited about working with meanpath and excited about their donation! Look for a blog post on the Common Crawl website next week.
Lisa

fightsw...@gmail.com

unread,
Jan 16, 2014, 8:35:16 AM1/16/14
to common...@googlegroups.com
Great job,
I did almost similar thing last year - I crawled all frontpages of .com, .net and .org domains, but it took me 2 months to crawl them all :). This is much appreciated

Thanks again,
Michal 

fightsw...@gmail.com

unread,
Jan 16, 2014, 8:45:16 AM1/16/14
to common...@googlegroups.com
OK seems I was too optimistic, the torrent seems to be dead, is there any other way to download source data?

Thanks in advance 

Adam Seabrook

unread,
Jan 16, 2014, 6:02:51 PM1/16/14
to common...@googlegroups.com
We had a few people say that but when we test it from a few remote computers it downloads fine. I did notice that it takes 10-15mins just to start downloading which may be due to the size of the torrent?

fightsw...@gmail.com

unread,
Jan 17, 2014, 3:56:02 AM1/17/14
to common...@googlegroups.com
May I ask which torrent client did you use? I've tested  5, 3 of them (tixati, mtorrent, bittorent) can't even open the torrent file claiming metadata corrupted or too large, other 2 FDM and ancient ABC are able to open it but never find any seeds nor peers regardless how long I keep the app open....
Thanks for help

Adam Seabrook

unread,
Jan 17, 2014, 3:57:49 AM1/17/14
to common...@googlegroups.com
I am using transmission on OSX. I will reply to this thread when we have another download option. Possibly on S3.

--
Cheers,
Adam
http://adamseabrook.com
http://au.linkedin.com/in/adamseabrook

Adam Gotterer

unread,
Mar 8, 2014, 7:26:51 PM3/8/14
to common...@googlegroups.com
Any update on the file location? I've also tried to download the torrent and when I opened it in my torrent app it says the file is too big (25MB).

Adam Seabrook

unread,
Mar 8, 2014, 9:50:00 PM3/8/14
to common...@googlegroups.com
Hi Adam,

We gave up trying to torrent it as it was running into too many weird client issues. A full crawl dump is being synced over to http://archive.meanpath.com which should be complete in 4-5 hours at the current transfer speed. 

Adam Seabrook

unread,
Mar 10, 2014, 3:55:49 AM3/10/14
to common...@googlegroups.com
Upload is now complete:
and a more recent one which contains extra fields such as BGP number and the AS Path

Veaceslav Ustiugov

unread,
May 17, 2014, 2:32:27 AM5/17/14
to common...@googlegroups.com, ad...@proshortlist.com
I want to download a torrent, but the link is not working, how can I download this torrent, please help me.
Message has been deleted

Adam Seabrook

unread,
Jan 6, 2016, 2:51:15 AM1/6/16
to Common Crawl
Hi Veaceslav,

We did have the torrent and http live for a few months but the bandwidth costs were becoming excessive so had to move it to a paid download. It does not look like there is anyone seeding the torrent now.

If you still require this data it is available but we charge a small fee to recoup the download cost. Alternatively you can get free access to CommonCrawl which has a much deeper crawl than we provide.

Lori Bell

unread,
Dec 1, 2016, 2:04:00 PM12/1/16
to Common Crawl
meanpath.com is not available, what happened?
Reply all
Reply to author
Forward
0 new messages