Top level Domains / Sub domains with links to the corpus data

98 views
Skip to first unread message

Simon Burfield

unread,
May 8, 2018, 5:53:40 PM5/8/18
to Common Crawl
Hi all :)  Great work on the CommonCrawl!

I have been looking at the main index and the 3 billion urls (or is 32 billion), and downloading the raw HTML from the corpus. I am currently building a Search Engine as a fun project (Burf.co)

I see there is a Domain level graph etc, does CommonCrawl have a list of domains with corresponding links to the raw page data (filename, offset, length etc)

Currently, I am downloading as much data as I can but what I want to do is just take the domains, and then navigate their structure myself.
 

Greg Lindahl

unread,
May 8, 2018, 7:01:41 PM5/8/18
to common...@googlegroups.com
On Tue, May 08, 2018 at 02:53:40PM -0700, Simon Burfield wrote:
> Hi all :) Great work on the CommonCrawl!
>
> I have been looking at the main index and the 3 billion urls (or is 32
> billion), and downloading the raw HTML from the corpus. I am currently
> building a Search Engine as a fun project (Burf.co)
>
> I see there is a Domain level graph etc, does CommonCrawl have a list of
> domains with corresponding links to the raw page data (filename, offset,
> length etc)

Hi, Simon!

The main index has what you're looking for. You can use it to retrieve
all of the captures for a given host, and there's enough information
to fetch the raw html data.

If you'd like to see a worked out example (in Python) for iterating
over the monthly indexes and fetching raw data, check out:

https://github.com/cocrawler/cdx_toolkit

I haven't finished the API for fetching raw data yet, but the code to
do the actual fetching is present.

-- greg

Simon Burfield

unread,
May 8, 2018, 7:22:28 PM5/8/18
to Common Crawl
Hi Greg

Nice to meet you.  The problem with the main index is it has everything in it, I was hoping there was just a smaller version of it with just the top level domains in.  I am currently scanning it for .pdf and its been running for days!

I guess in java I could just extract the domain from the url and see if I see it before

Thanks Simon

Sebastian Nagel

unread,
May 9, 2018, 7:11:32 AM5/9/18
to common...@googlegroups.com
Hi Simon,

there is a columnar index [1] which allows you to access all fields of the index
(e.g. TLD and MIME type) as columns. A query to get PDF URLs (plus location, offset,
length) will run less than one minute. Filtering by MIME type takes long with the
"main" index (index.commoncrawl.org). However, looking up a single URL or domain
(or even a smaller TLD such as .no) is fast. Greg's tool is the perfect tool to
run such a query over multiple indexes.

> does CommonCrawl have a list of domains with corresponding links
> to the raw page data (filename, offset, length etc)
No, resp. the main index is exactly what you need: it's sorted by domain
which means that all captures of one domain are easy to retrieve. Also
iterating over the domains would sufficiently fast (given that you need
all captures/URLs). There are 30 million domains every month, it wouldn't
be efficient to split the index into 30 million parts.

> I guess in java I could just extract the domain from the url and see if I see it before
Take the SURT key in the index, the domain is a prefix:
http://subdomain.example.com/index.html
as SURT:
com.example.subdomain)/index.html


> what I want to do is just take the domains, and then navigate their structure myself.
You want to crawl the domains yourself?

Lists of domains are easy to extract
- from the columnar index
- the statistics counts (cf. [2])
s3://commoncrawl/crawl-analysis/CC-MAIN-*/count/part-*.bz2
- the domain-level web graph ([3], you mentioned it)


> I am currently building a Search Engine as a fun project (Burf.co)
You may have a look at commonsearch [4] (on hold now) and catnoir [5].


Best,
Sebastian

[1] http://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/
[2] https://github.com/commoncrawl/cc-crawl-statistics
[3] http://commoncrawl.org/2018/05/webgraphs-feb-mar-apr-2018/
[4] https://web.archive.org/web/20171020165245/https://about.commonsearch.org/
[5] https://www.chatnoir.eu/


On 05/09/2018 01:22 AM, Simon Burfield wrote:> Hi Greg
> https://github.com/cocrawler/cdx_toolkit <https://github.com/cocrawler/cdx_toolkit>

Simon Burfield

unread,
May 9, 2018, 7:36:39 AM5/9/18
to Common Crawl
Hi Sebastian

Thank you for replying!

Do you have any JAVA examples of using the columnar index [1] for examples, I am not using AWS or any cloud provider, doing it with my own servers.

I will investigate the SURT key, not sure what it is yet.  I have been hacking this project (successfully) to get the data I want https://github.com/centic9/CommonCrawlDocumentDownload

If the main index is in order e.g the root homepage is listed first, then I should be able to process the main list fairly easily.

Thanks
Simon 

Sebastian Nagel

unread,
May 9, 2018, 9:58:38 AM5/9/18
to common...@googlegroups.com
> Do you have any JAVA examples of using the columnar index [1] for examples

If your servers are able to run a Spark job, there is one example which exports
a view given as SQL query as a new Parqet table:

https://github.com/commoncrawl/cc-index-table/blob/master/src/main/java/org/commoncrawl/spark/examples/CCIndexExport.java

> https://github.com/centic9/CommonCrawlDocumentDownload

Ok, that's why you've started with PDF pages. :)

> If the main index is in order e.g the root homepage is listed first

Yes, it comes first if the path is "/" of the homepage
but not if it's "/index.php" or similar.

Best,
Sebastian

On 05/09/2018 01:36 PM, Simon Burfield wrote:
> Hi Sebastian
>
> Thank you for replying!
>
> Do you have any JAVA examples of using the columnar index [1] for examples, I am not using AWS or
> any cloud provider, doing it with my own servers.
>
> I will investigate the SURT key, not sure what it is yet.  I have been hacking this project
> (successfully) to get the data I want https://github.com/centic9/CommonCrawlDocumentDownload
>
> If the main index is in order e.g the root homepage is listed first, then I should be able to
> process the main list fairly easily.
>
> Thanks
> Simon 
> *
> *
> On Wednesday, 9 May 2018 12:11:32 UTC+1, Sebastian Nagel wrote:
>
> Hi Simon,
>
> there is a columnar index [1] which allows you to access all fields of the index
> (e.g. TLD and MIME type) as columns. A query to get PDF URLs (plus location, offset,
> length) will run less than one minute. Filtering by MIME type takes long with the
> "main" index (index.commoncrawl.org <http://index.commoncrawl.org>). However, looking up a
> single URL or domain
> (or even a smaller TLD such as .no) is fast. Greg's tool is the perfect tool to
> run such a query over multiple indexes.
>
> > does CommonCrawl have a list of domains with corresponding links
> > to the raw page data (filename, offset, length etc)
> No, resp. the main index is exactly what you need: it's sorted by domain
> which means that all captures of one domain are easy to retrieve. Also
> iterating over the domains would sufficiently fast (given that you need
> all captures/URLs). There are 30 million domains every month, it wouldn't
> be efficient to split the index into 30 million parts.
>
> > I guess in java I could just extract the domain from the url and see if I see it before
> Take the SURT key in the index, the domain is a prefix:
>   http://subdomain.example.com/index.html <http://subdomain.example.com/index.html>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
> To post to this group, send email to common...@googlegroups.com
> <mailto:common...@googlegroups.com>.
> Visit this group at https://groups.google.com/group/common-crawl.
> For more options, visit https://groups.google.com/d/optout.

Greg Lindahl

unread,
May 9, 2018, 4:43:01 PM5/9/18
to common...@googlegroups.com
On Wed, May 09, 2018 at 03:58:33PM +0200, Sebastian Nagel wrote:

> > https://github.com/centic9/CommonCrawlDocumentDownload
>
> Ok, that's why you've started with PDF pages. :)

This project reminds me of an extraction I did back in 2010 or so for
a security researcher, who wanted a random sample of Microsoft
document types from the web to measure how many were infected with
known viruses and how many used obsolete versions which were being
EOLed by Microsoft.

Great use for Common Crawl data!

-- greg

Reply all
Reply to author
Forward
0 new messages