Call for ideas: top million crawl

117 views
Skip to first unread message

Greg Lindahl

unread,
Nov 1, 2016, 5:03:02 PM11/1/16
to common...@googlegroups.com
I'm writing a new crawler based on Python's newish async/await
syntax. It's intended to be modular enough that it will do a good job
on a variety of crawl tasks, from "crawl this one site" to "crawl the
frontpage of 10 million sites" to (eventually) bulk crawling billions
of pages.

Along the way, I thought it would be fun to produce some kind of
dataset on a monthly basis. Common Crawl already has a news crawl,
so... how about a monthly crawl of the frontpages of the top million
Alexa sites? You can see the kind of information I could collect at
http://buildwith.com/ -- a lot of these stuff is relatively easy to
collect, like "meta generator" or the presence of Open Graph or
Twitter cards markup.

I don't have warc output yet, but eventually I'd add the frontpage
html, robots.txt, and response headers to the dataset.

One thing that I know I'm not very good at is detecting "one page"
websites - does anyone have tips? Or other suggestions.

The code is at: https://github.com/cocrawler/cocrawler

-- greg

Ken Krugler

unread,
Nov 1, 2016, 6:02:28 PM11/1/16
to common...@googlegroups.com
Hi Greg,

On Nov 1, 2016, at 2:02pm, Greg Lindahl <lin...@pbm.com> wrote:

I'm writing a new crawler based on Python's newish async/await
syntax. It's intended to be modular enough that it will do a good job
on a variety of crawl tasks, from "crawl this one site" to "crawl the
frontpage of 10 million sites" to (eventually) bulk crawling billions
of pages.

Along the way, I thought it would be fun to produce some kind of
dataset on a monthly basis. Common Crawl already has a news crawl,
so... how about a monthly crawl of the frontpages of the top million
Alexa sites? You can see the kind of information I could collect at
http://buildwith.com/

I think you meant http://builtwith.com

-- a lot of these stuff is relatively easy to
collect, like "meta generator" or the presence of Open Graph or
Twitter cards markup.

One interesting metric is the rate of the page changing (after applying your favorite get-rid-of-cruft algorithm).

— Ken


I don't have warc output yet, but eventually I'd add the frontpage
html, robots.txt, and response headers to the dataset.

One thing that I know I'm not very good at is detecting "one page"
websites - does anyone have tips? Or other suggestions.

The code is at: https://github.com/cocrawler/cocrawler

-- greg



--------------------------
Ken Krugler
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr



Christian Lund

unread,
Nov 2, 2016, 6:18:00 AM11/2/16
to Common Crawl
Hi Greg,

Looks like an interesting project.

Ranking and Categorisation are 2 data sources at the very top of our wish list.

Common Search are doing some interesting things on the Ranking part (https://about.commonsearch.org/2016/07/our-first-public-datasets-host-level-webgraph-and-pagerank/). We will be integrating their results in one of the next updates and I am looking forward to seeing what you come up with.

Categorisation on the other hand is a very scarce resource and DMOZ is (in my opinion) so outdated it has very little relevance. A few of the large SEO service providers offer this (most notably similarWeb), but for our purposes the hefty cost of these solutions outweigh the utility. So if a simpler, openSource project was available, then I think this would gain foothold very quickly, not least amongst the SEO community.

Greg Lindahl

unread,
Nov 2, 2016, 11:07:55 PM11/2/16
to common...@googlegroups.com
Since this is just a monthly crawl of 1 page per site, I'm not going
to help with either ranking or categorization.

-- greg
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
> To post to this group, send email to common...@googlegroups.com.
> Visit this group at https://groups.google.com/group/common-crawl.
> For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages