Need tech help validating if Common Crawl will work for our project

J Curry

unread,

Jul 13, 2016, 1:01:53 PM7/13/16

to Common Crawl

Hello all,

I have a project that could use some contract help. We are looking to study the changing landscape of the financial services industry and have about 12,000 root domains. We have a current crawl application, but need the historic data to support our analysis.

I can't spare internal resources, so is there someone that can helps us validate if this data will have the domain coverage and content we need?

We are primarily interested in knowing: 1) What % of our domain list is covered in Common Crawl; 2) How deep the crawl is for each domain; and 3) History for each domain.

Our plan would be to extract the WARC data for the relevant content into our mysql DB so we can process them through our existing search application.

Let me know what you think. Thanks!

Curry

Ivan Habernal

unread,

Jul 13, 2016, 1:12:40 PM7/13/16

to Common Crawl

Hi Curry,

I can't spare internal resources, so is there someone that can helps us validate if this data will have the domain coverage and content we need?

I'm afraid unless someone has a copy of CommonCrawl on a local cluster, processing the data always costs money (spinning up a Hadoop cluster, for example; then some transfer costs, if you need it locally in your DB).

We are primarily interested in knowing: 1) What % of our domain list is covered in Common Crawl; 2) How deep the crawl is for each domain; and 3) History for each domain.

You can start by having a look at the extracted domains we did on some late 2015 crawl within our DKProC4Corpus:

https://zoidberg.ukp.informatik.tu-darmstadt.de/jenkins/job/DKPro%20C4Corpus/org.dkpro.c4corpus$dkpro-c4corpus-doc/doclinks/1/#_list_of_urls_from_commoncrawl

Hope it helps,

Ivan

J Curry

unread,

Jul 13, 2016, 1:47:35 PM7/13/16

to Common Crawl

Thanks Ivan. Just to clarify, I don't have human resources to put on this...but I would be willing to pay someone to help us.

Thanks,

Curry

Sebastian Nagel

unread,

Jul 14, 2016, 10:19:41 AM7/14/16

to common...@googlegroups.com

Hi Curry,

you could also use the Common Crawl index, it supports
- domain queries
http://index.commoncrawl.org/CC-MAIN-2016-26-index?url=nasdaq.com&matchType=domain
- and wild-cards
http://index.commoncrawl.org/CC-MAIN-2016-26-index?url=*.ecb.europa.eu

Sebastian

--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To post to this group, send email to common...@googlegroups.com.
Visit this group at https://groups.google.com/group/common-crawl.
For more options, visit https://groups.google.com/d/optout.

J Curry

unread,

Jul 14, 2016, 10:50:25 AM7/14/16

to Common Crawl

Thanks Sebastian. That may be the route we take.

Curry

Tom Morris

unread,

Jul 14, 2016, 11:07:39 AM7/14/16

to common...@googlegroups.com

On Thu, Jul 14, 2016 at 10:19 AM, Sebastian Nagel <seba...@commoncrawl.org> wrote:

you could also use the Common Crawl index, it supports

- domain queries
http://index.commoncrawl.org/CC-MAIN-2016-26-index?url=nasdaq.com&matchType=domain
- and wild-cards
http://index.commoncrawl.org/CC-MAIN-2016-26-index?url=*.ecb.europa.eu

While this would work from a functional point of view, it'd be a pretty big load on the (not very beefy) API server to query all URLs for 12,000 root domains across dozens of crawls.

These types of requests come up often enough that it seems like a merged secondary index by hostname & crawl date with a count of the number of unique URLs might be a useful secondary data product. That would capture both the domain-specific and the longitudinal aspects that we see from people who are just starting out. It could be built starting from the current set of crawl-specific indexes, so wouldn't be that expense from a processing point of view.

Tom

Sebastian Nagel

unread,

Jul 14, 2016, 11:30:11 AM7/14/16

to common...@googlegroups.com

Hi Tom,

you are definitely right and we felt also the need to have some basic metrics available without crunching large amounts of data.

I've run a count and statistics job the last days over all indexes (from 15 monthly crawl archives since end of 2014).
The output is here:

s3://commoncrawl/crawl-analysis/

Source code for counting:

https://github.com/commoncrawl/cc-crawl-statistics

Thanks to Christian Buck for the pointer [1] to HyperLogLog which made the counting of unique URLs and content digests much
faster (2 hours for one index on a 4-node cluster). The error is evidently negligible (<1%).

Comments on the data and ideas what metrics to include are always welcome!

Please note that this data is not "officially released" - location and format may change.

Best,

Sebastian

[1 ] https://github.com/christianbuck/CorpusMining/blob/master/metadata/count_uniq_urls.py

Sebastian Nagel

unread,

Jul 14, 2016, 11:45:34 AM7/14/16

to common...@googlegroups.com

Hi Curry, hi Tom,

> it'd be a pretty big load on the (not very beefy) API server to query all URLs for 12,000 root domains across dozens of crawls.

The last 7 days we had 250.000 requests on the index server.

So, I wouldn't care about a few 10.000s extra to get a first estimate.

But yes, the faster approach could be a simple "grep" over the condensed host and domain counts.

Below a draft for the June crawl.

Best,

Sebastian

$ cat financial_domains.txt
nasdaq.com
ecb.europa.eu
bankofengland.co.uk
federalreserve.gov

$ for i in `seq 0 9`; do
    aws s3 cp s3://commoncrawl/crawl-analysis/CC-MAIN-2016-26/count/part-0000$i.bz2 - \
       | bzip2 -dc | grep '^\[2,' | grep -Ff financial_domains.txt;
   done
[2, "articlefeeds.nasdaq.com", 14]      4172
[2, "y10online.federalreserve.gov", 14] 1
[2, "fundamentals.nasdaq.com", 14]      7
[2, "www.bankofengland.co.uk", 14]      1661
[2, "ir.nasdaq.com", 14]        27
[2, "m.nasdaq.com", 14] 2690
[2, "mktvideo.nasdaq.com", 14] 2
[2, "www.nasdaq.com", 14]       32852
[2, "www.oldbankofengland.co.uk", 14]   1
[2, "www.hknasdaq.com", 14]     1
[2, "community.nasdaq.com", 14] 11686
[2, "hknasdaq.com", 14] 1
[2, "oig.federalreserve.gov", 14]       28
[2, "business.nasdaq.com", 14] 8
[2, "structurelists.federalreserve.gov", 14]    1
[2, "m.nasdaq.com.ipaddress.com", 14]   1
[2, "sdw.ecb.europa.eu", 14]    295
[2, "www.federalreserve.gov", 14]       83823
[2, "data.nasdaq.com", 14]      20
[2, "www.ecb.europa.eu", 14]    3314

On 14 July 2016 at 17:07, Tom Morris <tfmo...@gmail.com> wrote:

J Curry

unread,

Jul 14, 2016, 12:12:28 PM7/14/16

to Common Crawl

Sabasitan / Tom -

Thanks guys!...I'll try this out.

Curry

Reply all

Reply to author

Forward