Areason to preserve the physical book that has been digitized is that it is the authentic and original version that can be used as a reference in the future. If there is ever a controversy about the digital version, the original can be examined. A seed bank such as the Svalbard Global Seed Vault is seen as an authoritative and safe version of crops we are growing. Saving physical copies of digitized books might at least be seen in a similar light as an authoritative and safe copy that may be called upon in the future.
There is also a connection between digitized collections and physical collections. The libraries we scan in, rarely want more digital books than the digital versions that we scan from their collections. This struck us as strange until we better understood the craftsmanship required in putting together great collections of books, whether physical or digital. As we are archiving the books, we are carefully recording with the physical book what the identifier for the virtual version, and attaching information to the digital version of where the physical version resides.
Therefore we have determined that we will keep a copy of the books we digitize if they are not returned to another library. Since we are interested in scanning one copy of every book ever published, we are starting to collect as many books as we can.
We hope that there will be many archives of physical books and other materials as they will be used and preserved in different ways based on the organizations they reside in. Universities will have different access policies from national libraries, say, and mostly likely different access policies from the Internet Archive. With many copies in diverse organizations and locations we are more likely to serve different communities over time.
Internet Archive is building a physical archive for the long term preservation of one copy of every book, record, and movie we are able to attract or acquire. Because we expect day-to-day access to these materials to occur through digital means, the our physical archive is designed for long-term preservation of materials with only occasional, collection-scale retrieval. Because of this, we can create optimized environments for physical preservation and organizational structures that facilitate appropriate access. A seed bank might be conceptually closest to what we have in mind: storing important objects in safe ways to be used for redundancy, authority, and in case of catastrophe.
The goal is to preserve one copy of every published work. The universe of unique titles has been estimated at close to one hundred million items. Many of these are rare or unique, so we do not expect most of these to come to the Internet Archive; they will instead remain in their current libraries. But the opportunity to preserve over ten million items is possible, so we have designed a system that will expand to this level. Ten million books is approximately the size of a world-class university library or public library, so we see this as a worthwhile goal. If we are successful, then this set of cultural materials will last for centuries and could be beneficial in ways that we cannot predict.
To start this project, the Internet Archive solicited donations of several hundred thousand books in dozens of languages in subjects such as history, literature, science, and engineering. Working with donors of books has been rewarding because an alternative for many of these books was the used book market or being destroyed. We have found everyone involved has a visceral repulsion to destroying books. The Internet Archive staff helped some donors with packing and transportation, which sped projects and decreased wear and tear on the materials.
To link the digital version of a book to the physical version, care is taken to catalog each book and note their physical locations so that future access could be enabled. Most books are cataloged by finding a record in existing library catalogs for the same edition. If no such catalog record can be found, then it is cataloged briefly in the Open Library. Links are made from the paper version to the digital version by printing identifying and catalog data on a slip of acid free paper that is inserted in the book. Linking from the digital version to the paper version is done through encoding the location into the database records and identifiers into the resulting digital book versions. The digital versions have been replicated and the catalog data has been shared.
Most of these first books have been digitized with funding from stimulus money for jobs programs and funding from the Kahle/Austin Foundation. This served to build the core collection of modern books for the blind and dyslexic. Many of these digital books are also available to be digitally borrowed through the Open Library website.
This was a change from our previous mass digitization procedures when a library would deliver and retrieve books from our scanning centers. Where the libraries would have already done the sorting and de-duplication of books, we now need to do these functions ourselves. The process to identify titles that have not been preserved already is now in place, but is in active development to improve efficiency. The thorough work of libraries in cataloging materials is key in this process because we can leverage this for these books. Identifiers such as ISBN, LCCN, and OCLC ids have helped determine which books are duplicates.
This physical archive is designed to help resist insects and rodents, control temperature and humidity, slow acidification of the paper, protected from fire, water and intrusion, contain possible contamination, and endure possible uneven maintenance over time. For these reasons the books are stored in isolated environments with a regulated airflow that depends on few active components.
Thank you to Tom McCarty, Robert Miller, Sean Fagan, Internet Archive staff, San Francisco Public Library leadership, Alibris, HHS of the City of San Francisco, and the Kahle/Austin Foundation for being leaders on this project.
Yes, the Library of Congress is doing a great job. We believe we have a role by being a very different organization. We believe we will provide access to different groups and different approaches to preservation which can be valuable.
I have valued books since my mother taught me to read long before I started school, and I am so grateful that you are preserving the physical objects that seem to be getting less and less important to so many people.
But if it were possible to a) personally donate small sets of or individual books and b) have a system by which it could be determined if you already had the book, I think this project would become very rich in books indeed.
Good idea. Reducing the relative humidity that far has been recommended. We are going to be working on cost effective ways of doing dropping the humidity and will try to get that low. As we understand the Library of Congress does, when we want to access the books again after they have been in such a dry environment, it will need to slowly re-hydrate. Pretty nifty.
In regard to the June 7th comments referencing the Library of Congress, we at the Library would like to provide the following point of clarification on our temperature and relative humidity (RH) controls: The Library of Congress has a state-of-the-art, specialized facility in Fort Meade, Maryland, designed to efficiently hold library collections at a controlled temperature and relative humidity optimized for the long-term preservation of the collections. The specific temperature and relative humidity (RH) for the storage of different materials (50 degrees F and 30% RH for books and paper materials; 35 degrees F and 30% RH for black and white photographs, microfilm, and microfiche; 25 degrees F and 25% RH for photographic negatives, transparencies, and color prints) effectively prolong the useful life of the collections (by reducing the rate of chemical degradation) and follow existing ISO standards. Collections are removed from the cool module and from the cold and freezing vaults to a staging area, which allows the materials to acclimate to a warmer temperature without the formation of condensation ( )
There are abandoned limestone mines in the Bluffs of the Mississippi River not far from St Louis. These would be better than abandoned coal or metal mines (no toxic vapours or waters). They also have lower seismic risk than Richmond.
Important works could and should be stored redundantly in places that vary in climate and politics, but for the great mass of literature in an archive that large, it might be sufficient to just store parts of it in different locations so risk was statistically distributed. In a sense, there is already a great deal of redundancy built into the stuff mankind has written even without duplicating individual works.
Jim A: The brittle book problem is a result of residual acid in the wood pulp paper left as part of the manufacturing process. The Internet Archive is specifically using acid-free paper to avoid that problem.
Yes they use acid free paper for the cataloging process but the books themselves are still acid paper. However that is why they must place them in a place with tightly controled temprature and humidity to help prevent the paper from breaking down.
As someone else mentioned, having one copy of all items is a single point of failure. Have you considered having 2 or 3 copies (assuming the items are not so rare as to be unique, or at least not have too many unique items)? And making sure the copies are distributed across, say, continents?
I wish there was a similar project to collect and preserve film content. Nitrate and celluloid film deteriorates very quickly compared to paper, and we are losing our original cultural, social and historical resources daily. In a few decades, a digital copy may be the only access we have to those resources.
Thank you for sharing your costs. Are they for renting space or do you own the salt mine? As for our costs, we do not know yet. We are spending on the upfront design and build in order to try to decrease the ongoing costs, but this is still just hypothetical.
3a8082e126