The construction of a web archive begins with the definition of just what the archive’s purpose should be. While some web archives have very specific inclusion criteria and focus on very narrow topics for which there is a known and limited universe of content to preserve, other web archiving initiatives set out to simply archive what they can through a vast web of sources and donors and without any overarching collection strategy or user community to help guide them, in marked difference to the approach typically taken in the library and archival communities to physical collections.
Some of the largest collections of archived web content are thus multi-petabyte datasets compiled over years or even decades through criteria, seed lists, crawler designs and explicit and inadvertent design decisions that have long ago been lost to time or which are considered proprietary and cannot be shared.
http://bit.ly/2hWIoHK
http://bit.ly/2hWIoHK+