New and looking for information on Common Crawl

37 views
Skip to first unread message

Vincent

unread,
Aug 19, 2022, 5:45:33 AMAug 19
to Common Crawl
Hello everyone,

I have just arrived on the group, and I am looking for information on the use of common-craw to study the different "Web Scraping".

If I want to be able to take data from an intranet website and keep it just for me on an ES or Solr without using AWS, is that possible?
I couldn't find any app or docker image to do something about my problem.

Can I use Common Crawl or is it not possible? 
Thanks.

Sincerely greetings 

Vincent

Sebastian Nagel

unread,
Aug 19, 2022, 8:43:11 AMAug 19
to common...@googlegroups.com
Hi Vincent,

Common Crawl is just "an open repository of web crawl data".
Because the crawler only visits openly accessible sites and
also respects robots.txt rules, there will be no content
included from any intranets.

What you're looking for is web crawler software. Well,
Common Crawl uses Apache Nutch [1] and Stormcrawler [2].
Both would fit your use case:
- crawl web pages accessible via intranet
(whether open web or intranet depends only on the
configuration of inclusion/exclusion rules aka.
URL filters in Nutch/Stormcrawler terminology)
- index content into ES, Solr or other indexing backends
- Docker images or a Dockerfile to build an image are available

Best,
Sebastian

[1] https://nutch.apache.org/
[2] http://stormcrawler.net/

Ed Summers

unread,
Aug 19, 2022, 9:14:46 AMAug 19
to common...@googlegroups.com
(shameless & slightly off-topic plug)

If a browser based crawler is of interest you might also want to checkout browsertrix-crawler [1] from the Webrecorder project [2]. It can be especially helpful when archiving sites that use JavaScript to dynamically pull in content.

browsertrix-crawler is open source and is designed to be run via Docker. It supports “profiles” for logging in to particular sites, which can be handy depending on what you are crawling (e.g. an intranet).

The replayweb.page web component [3], also from Webrecorder, can also make embedding your web archive in a website a snap, without requiring server-side software (just JavaScript and your archived content). For some examples of that check out the Stanford Digital Publications Web Archives [4].

//Ed

[1] https://github.com/webrecorder/browsertrix-crawler
[2] https://webrecorder.net
[3] https://replayweb.page
[4] https://archive.supdigital.org/
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/ade29e13-5f88-e0fb-59a3-cb0439a8dd67%40commoncrawl.org.

Reply all
Reply to author
Forward
0 new messages