Cia Internet Archive

0 views

Skip to first unread message

Stella Kreuter

unread,

Aug 5, 2024, 9:02:29 AM8/5/24

to lethethobound

Heritrixsometimes spelled heretrix, or misspelled or mis-said as heratrix/heritix/ heretix/heratix) is an archaic word for heiress (woman who inherits). Since our crawler seeks to collect and preserve the digital artifacts of our culture for the benefit of future researchers and generations, this name seemed apt.

(If you see a different User-Agent in your logs that still says 'heritrix', it may be someone else using this open-source software. In such a case, even if we can't directly change how your site is crawled, we are happy to help you interpret your logs and identify, contact, or block the source of any troublesome crawling.)

Release 3.0.0 is a major release, the first of the Heritrix3 ("H3") series. It includes new features and issue fixes, and a significant reworking of the configuration system and user interface based on current and expected needs.

Heritrix3 is currently suitable for advanced users and projects that are either customizing Heritrix (with Java or other scripting code) or embedding Heritrix in a larger system. Please review the Current Limitations to help determine if Heritrix3 or a current Heritrix1 (1.14.4 or later) release is best suited for your needs.

The next major release will be 2.2 in 2009, which is planned to include updates to the Heritrix 2 configuration system and checkpointing functionality, and tools easing transition from 1.14.x to Heritrix 2.2.

Release 1.14.0 adds a number of small features to the Heritrix 1.x line, most notably upgrading support for the WARC archived-web-content format to version 0.17 (ISO Committee Draft). This release also includes 41 bug fixes or other incremental improvements, including several based on community contributions or requests.

Release 1.12.0 is the first of several planned releases enhancing Heritrix with "smart crawler" functionality. In this release, the theme has been offering new options to reduce the amount of duplicate content crawled and stored when recrawling sites at regular intervals. A number of other enhancements and bug fixes are also included. See the Release Notes for details.

This is primarily a bug-fix release, with a couple of new features, provided before a number of significant changes to the Heritrix project that will require developer and crawl operator adjustments. Post-1.10.2, Heritrix source code control, issue tracking, and build process will migrate to new systems. Also, updates to core classes, especially with regard to the settings architecture, will noticeably break backward compatibility with 1.10.2 and prior crawler settings files and formats. See Release Notes for details.

The Deduplicator is a add-on module for Heritrix that allows sequential snapshot crawls to leverage information about previous iterations to avoid storing (or even downloading) duplicate data. See the mailing list announcement for details.

Release 1.10.0 adds new configuration options, experimental new protocol and format support, and lots of fixes (43 tracked bugs have been fixed and 35 feature requests added). Requires JDK 1.5.x. See Release Notes for detail.

Release 1.6.0 offers improved remote control and monitoring via JMX, a crawl-checkpointing facility, and experimental support for bloom filter already-included testing, partitioning a crawl across multiple independent crawlers, and per-host/domain/queue-grouping collection quotas. Performance and stability in large crawls is also improved. Among tracked issues, it includes 39 requested enhancements and fixes 96 reported bugs. See Heritrix Release Notes for detail and Known Limitations: e.g. Again you will need to tweak your old order files to make them work with the new release.

Much improved memory usage, new experimental scoping/filter model, and a new revisiting frontier. Over 90 bugs fixed. See Heritrix Release Notes for detail and Known Limitations: e.g. You cannot use your old order files with the new release.

Added IP-based politeness, configurable URI-canonicalization, and mid-fetch abort. Lots of Bug fixes. See Heritrix Release Notes for detail and Known Limitations (In particular, https fetching requires SUN JDK and UI throws OOME if jobs run in series).

Added new prefix ('SURT') scope and filter, compression of recovery log, mass adding of URIs to running crawler, crawling via a http proxy, adding of headers to request, improved out-of-the-box defaults, hash of content to crawl log and to arcreader output, and many bug fixes. See Heritrix Release Notes for detail and known limitations.

Release for second heritrix workshop, Copenhagen 06/2004 (1.0.0 first release candidate). Added site-first prioritization, fixed link extraction of multibyte URIs, added metadata to arcs as xml, changed arc naming template, new user and developer manuals, added basic/digest auth and http post/get login facility, and added help to UI. Bug fixes. See Heritrix Release Notes for detail and known limitations.

Release (and branch heritrix-0_8 made at the heritrix-0_7_1 tag) because of concurrentmodificationexceptions if tens of seeds supplied and to fix domain-scope leakage. Also, made continuous build publically available, incorporated integration selftest into build, made it a maven-build only (ant-build no longer supported), added day/night configurations (refinements), ameliorated too-many-open files, added exploit of http-header content-type charset creating character streams, and heritrix now crawls ssl sites. UI improvements include red start by bad configuration, precompilation, and delineation of advanced settings. See Heritrix Release Notes for detail.

Release made in advance of radical frontier changes. Added bandwidth throttle, operator 'diary', settable robots expiration, crawler cookie pre-population, and changing of certain options mid-crawl. Many UI improvements including UI display of critical exceptions, UI desccription of job-order options, and improved reporting. Optimizations. Updated httpclient lib to 2.0 release and jmx libs to 1.2.1. See Heritrix Release Notes for detail.

Release made for heritrix workshop, San Francisco, 02/2004. New MBEAN-based configuration, extensive UI revamp, first unit tests and integration selftest framework added, pooling of ARCWriters, new cmd-line start scripts, httpclient lib update (2.0RC3) and bugfixes. See Heritrix Release Notes for detail.

The Internet Archive is an American nonprofit digital library founded in 1996 by Brewster Kahle.[1][2][4] It provides free access to collections of digitized materials including websites, software applications, music, audiovisual, and print materials. The Archive also advocates for a free and open Internet. As of February 4, 2024[update], the Internet Archive held more than 44 million print materials, 10.6 million videos, 1 million software programs, 15 million audio files, 4.8 million images, 255,000 concerts, and over 835 billion web pages in its Wayback Machine.[5] Its mission is committing to provide "universal access to all knowledge".[5]

The Internet Archive allows the public to upload and download digital material to its data cluster, but the bulk of its data is collected automatically by its web crawlers, which work to preserve as much of the public web as possible. Its web archive, the Wayback Machine, contains hundreds of billions of web captures.[6][7] The Archive also oversees numerous book digitization projects, collectively one of the world's largest book digitization efforts.

Brewster Kahle founded the Archive in May 1996, around the same time that he began the for-profit web crawling company Alexa Internet.[8][9] The earliest known archived page on the site was saved on May 10, 1996, at 2:42 pm UTC (7:42 am PDT). By October of that year, the Internet Archive had begun to archive and preserve the World Wide Web in large amounts.[10][11][12][13][14] The archived content became more easily available to the general public in 2001, through the Wayback Machine.

In late 1999, the Archive expanded its collections beyond the web archive, beginning with the Prelinger Archives. Now, the Internet Archive includes texts, audio, moving images, and software. It hosts a number of other projects: the NASA Images Archive, the contract crawling service Archive-It, and the wiki-editable library catalog and book information site Open Library. Soon after that, the Archive began working to provide specialized services relating to the information access needs of the print-disabled; publicly accessible books were made available in a protected Digital Accessible Information System (DAISY) format.[15]

Most societies place importance on preserving artifacts of their culture and heritage. Without such artifacts, civilization has no memory and no mechanism to learn from its successes and failures. Our culture now produces more and more artifacts in digital form. The Archive's mission is to help preserve those artifacts and create an Internet library for researchers, historians, and scholars.

In August 2012, the Archive announced[17] that it has added BitTorrent to its file download options for more than 1.3 million existing files, and all newly uploaded files.[18][19] This method is the fastest means of downloading media from the Archive, as files are served from two Archive data centers, in addition to other torrent clients which have downloaded and continue to serve the files.[18][20] On November 6, 2013, the Internet Archive's headquarters in San Francisco's Richmond District caught fire,[21] destroying equipment and damaging some nearby apartments.[22] According to the Archive, it lost a side-building housing one of 30 of its scanning centers; cameras, lights, and scanning equipment worth hundreds of thousands of dollars; and "maybe 20 boxes of books and film, some irreplaceable, most already digitized, and some replaceable".[23] The nonprofit Archive sought donations to cover the estimated $600,000 in damage.[24]

In November 2016, Kahle announced that the Internet Archive was building the Internet Archive of Canada, a copy of the Archive to be based somewhere in Canada. The announcement received widespread coverage due to the implication that the decision to build a backup archive in a foreign country was because of the upcoming presidency of Donald Trump.[27][28][29] Kahle was quoted as saying: