Any other WARC archives like Common Crawl?

40 views
Skip to first unread message

Liz Ron

unread,
Aug 30, 2016, 6:43:10 AM8/30/16
to Common Crawl
I've been using Common Crawl files for my research at the univ and was wondering if there are any other services (commercial) or free like common crawl which have crawl files? (WARC) or any other format?

Thanks,
Liz.

Ivan Habernal

unread,
Aug 30, 2016, 6:47:23 AM8/30/16
to Common Crawl
Hi Liz,

Have a look at ClueWeb, it's been extensively used in information retrieval research: http://www.lemurproject.org/clueweb12/

Best,

Ivan

Liz Ron

unread,
Aug 30, 2016, 7:00:27 AM8/30/16
to Common Crawl
Thanks Ivan!

I'm looking for more updated data, like from the recent two~ years. Any hints would be appreciated.
Liz.

OneSpeedFast

unread,
Aug 30, 2016, 3:08:31 PM8/30/16
to common...@googlegroups.com
Hi Liz,

I`d be interested in hearing more about the sort of data you need. I have been releasing tools upon request on w3metrics.com.

Some of the big ones I have so far are:

  • Web host distribution - tracking # of domains on a host
  • IP distribution - tracking # of domains on an IP
  • SEO/HTML related metrics
  • Tracking domain squatters
  • Mobile vs Desktop sites
And a bunch of other ones - including some filters for people to use in their own spidering projects. If I have data that will work on your project, i`m happy to setup a API or Feed of some sort.

Carlos

--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl+unsubscribe@googlegroups.com.
To post to this group, send email to common...@googlegroups.com.
Visit this group at https://groups.google.com/group/common-crawl.
For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages