New and looking for information on Common Crawl

Vincent

unread,

Aug 19, 2022, 5:45:33 AM8/19/22

to Common Crawl

Hello everyone,

I have just arrived on the group, and I am looking for information on the use of common-craw to study the different "Web Scraping".

If I want to be able to take data from an intranet website and keep it just for me on an ES or Solr without using AWS, is that possible?
I couldn't find any app or docker image to do something about my problem.

Can I use Common Crawl or is it not possible?

Thanks.

Sincerely greetings

Vincent

Sebastian Nagel

unread,

Aug 19, 2022, 8:43:11 AM8/19/22

to common...@googlegroups.com

Hi Vincent,

Common Crawl is just "an open repository of web crawl data".
Because the crawler only visits openly accessible sites and
also respects robots.txt rules, there will be no content
included from any intranets.

What you're looking for is web crawler software. Well,
Common Crawl uses Apache Nutch [1] and Stormcrawler [2].
Both would fit your use case:
- crawl web pages accessible via intranet
(whether open web or intranet depends only on the
configuration of inclusion/exclusion rules aka.
URL filters in Nutch/Stormcrawler terminology)
- index content into ES, Solr or other indexing backends
- Docker images or a Dockerfile to build an image are available

Best,
Sebastian

[1] https://nutch.apache.org/
[2] http://stormcrawler.net/

Ed Summers

unread,

Aug 19, 2022, 9:14:46 AM8/19/22

to common...@googlegroups.com

(shameless & slightly off-topic plug)

If a browser based crawler is of interest you might also want to checkout browsertrix-crawler [1] from the Webrecorder project [2]. It can be especially helpful when archiving sites that use JavaScript to dynamically pull in content.

browsertrix-crawler is open source and is designed to be run via Docker. It supports “profiles” for logging in to particular sites, which can be handy depending on what you are crawling (e.g. an intranet).

The replayweb.page web component [3], also from Webrecorder, can also make embedding your web archive in a website a snap, without requiring server-side software (just JavaScript and your archived content). For some examples of that check out the Stanford Digital Publications Web Archives [4].

//Ed

[1] https://github.com/webrecorder/browsertrix-crawler
[2] https://webrecorder.net
[3] https://replayweb.page
[4] https://archive.supdigital.org/

> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/ade29e13-5f88-e0fb-59a3-cb0439a8dd67%40commoncrawl.org.

Reply all

Reply to author

Forward