commoncrawl vs archive.org etc

813 views
Skip to first unread message

Shawn

unread,
Nov 19, 2019, 12:37:25 PM11/19/19
to Common Crawl
As a beginner to this, what's the key difference between common-crawl and archive.org? are there any other similar projects?

There are quite some archiving initiatives: https://en.wikipedia.org/wiki/List_of_Web_archiving_initiatives but archive.org is the one I knew trying to archive the whole internet, until I found common-crawl a few days ago which seems doing similar things.

Thanks,
Shawn

Sebastian Nagel

unread,
Nov 20, 2019, 7:06:58 AM11/20/19
to common...@googlegroups.com
Hi Shawn,

Common Crawl is targeted to programmers, data scientists, researchers
working with web data. The focus on "web data" explains why Common Crawl archives only the HTML.
Without the page dependencies (images, videos, JavaScript, CSS) which determine the visual
appearance, the
page captures are not really useful for "end user" browsing the archives. Instead we try to support
"data users" and provide
software libraries and code examples to process the data inside
the page captures. Web archiving in the sense of preserving cultural heritage wasn't the initial
objective of Common Crawl. However, we share a lot with the web archiving community, esp. the WARC
format and all tools around it.

Best,
Sebastian
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/common-crawl/9b48f4e0-260a-4764-b225-895d71fcc024%40googlegroups.com
> <https://groups.google.com/d/msgid/common-crawl/9b48f4e0-260a-4764-b225-895d71fcc024%40googlegroups.com?utm_medium=email&utm_source=footer>.

Reply all
Reply to author
Forward
0 new messages