I'm launching a startup that I think a lot of you will be interested
in. The company, Webscaled, is a marketplace for datasets from ongoing
Web crawls. Soon, I'll offer access to a diverse catalog of fresh,
regularly updated datasets.
An example dataset would be the link graph. I am selling the link
graph in chunks of 1 billion edges. The dataset includes the source
and destination URLs, the anchor and title text of the link, any rel
or rev values, and so on.
Other datasets include:
-The top 1,000 [HTML editors, CMSs, forum and blog software, etc.]
-Lists of sites using X technology (AdSense, Feedburner chiclet, etc.)
-Frequency of namespace/uri pairing in XML and RDF/XML documents
-How many sites are using which advertising platform, widget, etc.
-Frequency of specific Doctypes and other HTML elements
-Lists of sites of X genre (forums, blogs, ecommerce, etc.)
-Social graph data
-Bi- and Tri-gram datasets (and other NLP-related datasets) extracted
from sentences that appear in the content portions of Web documents.
-Analysis of affiliate program usage
There are many other datasets, and they will be available soon. If you
want to learn more about what I'm doing you can join the mailing list
Thanks for your time,