Hi Vincent,
Common Crawl is just "an open repository of web crawl data".
Because the crawler only visits openly accessible sites and
also respects robots.txt rules, there will be no content
included from any intranets.
What you're looking for is web crawler software. Well,
Common Crawl uses Apache Nutch [1] and Stormcrawler [2].
Both would fit your use case:
- crawl web pages accessible via intranet
(whether open web or intranet depends only on the
configuration of inclusion/exclusion rules aka.
URL filters in Nutch/Stormcrawler terminology)
- index content into ES, Solr or other indexing backends
- Docker images or a Dockerfile to build an image are available
Best,
Sebastian
[1]
https://nutch.apache.org/
[2]
http://stormcrawler.net/