commoncrawl vs etc

Skip to first unread message


Nov 19, 2019, 12:37:25 PM11/19/19
to Common Crawl
As a beginner to this, what's the key difference between common-crawl and are there any other similar projects?

There are quite some archiving initiatives: but is the one I knew trying to archive the whole internet, until I found common-crawl a few days ago which seems doing similar things.


Sebastian Nagel

Nov 20, 2019, 7:06:58 AM11/20/19
Hi Shawn,

Common Crawl is targeted to programmers, data scientists, researchers
working with web data. The focus on "web data" explains why Common Crawl archives only the HTML.
Without the page dependencies (images, videos, JavaScript, CSS) which determine the visual
appearance, the
page captures are not really useful for "end user" browsing the archives. Instead we try to support
"data users" and provide
software libraries and code examples to process the data inside
the page captures. Web archiving in the sense of preserving cultural heritage wasn't the initial
objective of Common Crawl. However, we share a lot with the web archiving community, esp. the WARC
format and all tools around it.

> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> <>.
> To view this discussion on the web visit
> <>.

Reply all
Reply to author
0 new messages