Hi Simon,
there is a columnar index [1] which allows you to access all fields of the index
(e.g. TLD and MIME type) as columns. A query to get PDF URLs (plus location, offset,
length) will run less than one minute. Filtering by MIME type takes long with the
"main" index (
index.commoncrawl.org). However, looking up a single URL or domain
(or even a smaller TLD such as .no) is fast. Greg's tool is the perfect tool to
run such a query over multiple indexes.
> does CommonCrawl have a list of domains with corresponding links
> to the raw page data (filename, offset, length etc)
No, resp. the main index is exactly what you need: it's sorted by domain
which means that all captures of one domain are easy to retrieve. Also
iterating over the domains would sufficiently fast (given that you need
all captures/URLs). There are 30 million domains every month, it wouldn't
be efficient to split the index into 30 million parts.
> I guess in java I could just extract the domain from the url and see if I see it before
Take the SURT key in the index, the domain is a prefix:
http://subdomain.example.com/index.html
as SURT:
com.example.subdomain)/index.html
> what I want to do is just take the domains, and then navigate their structure myself.
You want to crawl the domains yourself?
Lists of domains are easy to extract
- from the columnar index
- the statistics counts (cf. [2])
s3://commoncrawl/crawl-analysis/CC-MAIN-*/count/part-*.bz2
- the domain-level web graph ([3], you mentioned it)
> I am currently building a Search Engine as a fun project (Burf.co)
You may have a look at commonsearch [4] (on hold now) and catnoir [5].
Best,
Sebastian
[1]
http://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/
[2]
https://github.com/commoncrawl/cc-crawl-statistics
[3]
http://commoncrawl.org/2018/05/webgraphs-feb-mar-apr-2018/
[4]
https://web.archive.org/web/20171020165245/https://about.commonsearch.org/
[5]
https://www.chatnoir.eu/
On 05/09/2018 01:22 AM, Simon Burfield wrote:> Hi Greg
>
https://github.com/cocrawler/cdx_toolkit <
https://github.com/cocrawler/cdx_toolkit>