New and looking for information on Common Crawl

Skip to first unread message


Aug 19, 2022, 5:45:33 AMAug 19
to Common Crawl
Hello everyone,

I have just arrived on the group, and I am looking for information on the use of common-craw to study the different "Web Scraping".

If I want to be able to take data from an intranet website and keep it just for me on an ES or Solr without using AWS, is that possible?
I couldn't find any app or docker image to do something about my problem.

Can I use Common Crawl or is it not possible? 

Sincerely greetings 


Sebastian Nagel

Aug 19, 2022, 8:43:11 AMAug 19
Hi Vincent,

Common Crawl is just "an open repository of web crawl data".
Because the crawler only visits openly accessible sites and
also respects robots.txt rules, there will be no content
included from any intranets.

What you're looking for is web crawler software. Well,
Common Crawl uses Apache Nutch [1] and Stormcrawler [2].
Both would fit your use case:
- crawl web pages accessible via intranet
(whether open web or intranet depends only on the
configuration of inclusion/exclusion rules aka.
URL filters in Nutch/Stormcrawler terminology)
- index content into ES, Solr or other indexing backends
- Docker images or a Dockerfile to build an image are available



Ed Summers

Aug 19, 2022, 9:14:46 AMAug 19
(shameless & slightly off-topic plug)

If a browser based crawler is of interest you might also want to checkout browsertrix-crawler [1] from the Webrecorder project [2]. It can be especially helpful when archiving sites that use JavaScript to dynamically pull in content.

browsertrix-crawler is open source and is designed to be run via Docker. It supports “profiles” for logging in to particular sites, which can be handy depending on what you are crawling (e.g. an intranet).

The web component [3], also from Webrecorder, can also make embedding your web archive in a website a snap, without requiring server-side software (just JavaScript and your archived content). For some examples of that check out the Stanford Digital Publications Web Archives [4].


> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> To view this discussion on the web visit

Reply all
Reply to author
0 new messages