I have a project that could use some contract help. We are looking to study the changing landscape of the financial services industry and have about 12,000 root domains. We have a current crawl application, but need the historic data to support our analysis.
I can't spare internal resources, so is there someone that can helps us validate if this data will have the domain coverage and content we need?
We are primarily interested in knowing: 1) What % of our domain list is covered in Common Crawl; 2) How deep the crawl is for each domain; and 3) History for each domain.
Our plan would be to extract the WARC data for the relevant content into our mysql DB so we can process them through our existing search application.
Let me know what you think. Thanks!
Curry