Need tech help validating if Common Crawl will work for our project

99 views
Skip to first unread message

J Curry

unread,
Jul 13, 2016, 1:01:53 PM7/13/16
to Common Crawl
Hello all,

I have a project that could use some contract help. We are looking to study the changing landscape of the financial services industry and have about 12,000 root domains. We have a current crawl application, but need the historic data to support our analysis.

I can't spare internal resources, so is there someone that can helps us validate if this data will have the domain coverage and content we need?

We are primarily interested in knowing: 1) What % of our domain list is covered in Common Crawl; 2) How deep the crawl is for each domain; and 3) History for each domain.

Our plan would be to extract the WARC data for the relevant content into our mysql DB so we can process them through our existing search application.

Let me know what you think. Thanks!

Curry

Ivan Habernal

unread,
Jul 13, 2016, 1:12:40 PM7/13/16
to Common Crawl
Hi Curry,
 
I can't spare internal resources, so is there someone that can helps us validate if this data will have the domain coverage and content we need?

I'm afraid unless someone has a copy of CommonCrawl on a local cluster, processing the data always costs money (spinning up a Hadoop cluster, for example; then some transfer costs, if you need it locally in your DB).
 
We are primarily interested in knowing: 1) What % of our domain list is covered in Common Crawl; 2) How deep the crawl is for each domain; and 3) History for each domain.

You can start by having a look at the extracted domains we did on some late 2015 crawl within our DKProC4Corpus:

https://zoidberg.ukp.informatik.tu-darmstadt.de/jenkins/job/DKPro%20C4Corpus/org.dkpro.c4corpus$dkpro-c4corpus-doc/doclinks/1/#_list_of_urls_from_commoncrawl

Hope it helps,

Ivan

J Curry

unread,
Jul 13, 2016, 1:47:35 PM7/13/16
to Common Crawl
Thanks Ivan. Just to clarify, I don't have human resources to put on this...but I would be willing to pay someone to help us.  

Thanks, 

Curry

Sebastian Nagel

unread,
Jul 14, 2016, 10:19:41 AM7/14/16
to common...@googlegroups.com
Hi Curry,

you could also use the Common Crawl index, it supports
- domain queries
   http://index.commoncrawl.org/CC-MAIN-2016-26-index?url=nasdaq.com&matchType=domain
- and wild-cards
   http://index.commoncrawl.org/CC-MAIN-2016-26-index?url=*.ecb.europa.eu

Sebastian

--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To post to this group, send email to common...@googlegroups.com.
Visit this group at https://groups.google.com/group/common-crawl.
For more options, visit https://groups.google.com/d/optout.

J Curry

unread,
Jul 14, 2016, 10:50:25 AM7/14/16
to Common Crawl
Thanks Sebastian. That may be the route we take.

Curry

Tom Morris

unread,
Jul 14, 2016, 11:07:39 AM7/14/16
to common...@googlegroups.com
On Thu, Jul 14, 2016 at 10:19 AM, Sebastian Nagel <seba...@commoncrawl.org> wrote:

you could also use the Common Crawl index, it supports


While this would work from a functional point of view, it'd be a pretty big load on the (not very beefy) API server to query all URLs for 12,000 root domains across dozens of crawls.

These types of requests come up often enough that it seems like a merged secondary index by hostname & crawl date with a count of the number of unique URLs might be a useful secondary data product. That would capture both the domain-specific and the longitudinal aspects that we see from people who are just starting out. It could be built starting from the current set of crawl-specific indexes, so wouldn't be that expense from a processing point of view.

Tom

Sebastian Nagel

unread,
Jul 14, 2016, 11:30:11 AM7/14/16
to common...@googlegroups.com
Hi Tom,

you are definitely right and we felt also the need to have some basic metrics available without crunching large amounts of data.
I've run a count and statistics job the last days over all indexes (from 15 monthly crawl archives since end of 2014).
The output is here:
  s3://commoncrawl/crawl-analysis/
Source code for counting:
Thanks to Christian Buck for the pointer [1] to HyperLogLog which made the counting of unique URLs and content digests much
faster (2 hours for one index on a 4-node cluster). The error is evidently negligible (<1%).

Comments on the data and ideas what metrics to include are always welcome!
Please note that this data is not "officially released" - location and format may change.

Best,
Sebastian

Sebastian Nagel

unread,
Jul 14, 2016, 11:45:34 AM7/14/16
to common...@googlegroups.com
Hi Curry, hi Tom,


> it'd be a pretty big load on the (not very beefy) API server to query all URLs for 12,000 root domains across dozens of crawls.

The last 7 days we had 250.000 requests on the index server.
So, I wouldn't care about a few 10.000s extra to get a first estimate.

But yes, the faster approach could be a simple "grep" over the condensed host and domain counts.
Below a draft for the June crawl.

Best,
Sebastian

$ cat financial_domains.txt
nasdaq.com
ecb.europa.eu
bankofengland.co.uk
federalreserve.gov

$ for i in `seq 0 9`; do
    aws s3 cp s3://commoncrawl/crawl-analysis/CC-MAIN-2016-26/count/part-0000$i.bz2 - \
       | bzip2 -dc | grep '^\[2,' | grep -Ff financial_domains.txt;
   done
[2, "articlefeeds.nasdaq.com", 14]      4172
[2, "y10online.federalreserve.gov", 14] 1
[2, "fundamentals.nasdaq.com", 14]      7
[2, "www.bankofengland.co.uk", 14]      1661
[2, "ir.nasdaq.com", 14]        27
[2, "m.nasdaq.com", 14] 2690
[2, "mktvideo.nasdaq.com", 14]  2
[2, "www.nasdaq.com", 14]       32852
[2, "www.oldbankofengland.co.uk", 14]   1
[2, "www.hknasdaq.com", 14]     1
[2, "community.nasdaq.com", 14] 11686
[2, "hknasdaq.com", 14] 1
[2, "oig.federalreserve.gov", 14]       28
[2, "business.nasdaq.com", 14]  8
[2, "structurelists.federalreserve.gov", 14]    1
[2, "m.nasdaq.com.ipaddress.com", 14]   1
[2, "sdw.ecb.europa.eu", 14]    295
[2, "www.federalreserve.gov", 14]       83823
[2, "data.nasdaq.com", 14]      20
[2, "www.ecb.europa.eu", 14]    3314

On 14 July 2016 at 17:07, Tom Morris <tfmo...@gmail.com> wrote:

J Curry

unread,
Jul 14, 2016, 12:12:28 PM7/14/16
to Common Crawl
Sabasitan / Tom -

Thanks guys!...I'll try this out.

Curry
Reply all
Reply to author
Forward
0 new messages