Hi Elise -
Checking the integrity of archives is important and, as you
probably now, requires to retrieve and check the stored files and
verify their contents independent of the storage solution.
AWS, like other "cloud" providers, uses the "Hotel California"
model - making it "free" to upload, but costly to store and
retrieve files. In this model, you can checkout (or quit) anytime
you'd like, but you can never leave, especially when you are
constrained by small or inflexible budgets and are maintaining
large archives.
I am maintaining archives in the order of 1TB and have just
recently moved out of AWS because of concerns of vendor lock-in
and related budget issues. Now, I am using a custom solution (with
integrity checks) in combination with georedundant mirrors and
offline backups on a hard disk I can put in a drawer.
How did you factor in the cost of periodic data retrieval to
independently verify the integrity of your archives? How large is
your archive? 1TB? 100TB?
Curious to hear your thoughts,
--
You received this message because you are subscribed to the Google Groups "Digital Curation" group.
To unsubscribe from this group and stop receiving emails from it, send an email to digital-curati...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/digital-curation/ff4e5322-6d2a-4afa-bd39-220bfde7d3f7n%40googlegroups.com.
Hey Elise -
I am using rsync for mirroring and a specialized home grown tool
called "Preston" (https://preston.guoda.bio) to reliably
reference, clone and verify a large, ever growing, collection of
biodiversity datasets and their origins.
Hosting is via https://hetzner.de where I rented a 2x1.5 TB
server with unlimited bandwidth for ~ $20 / mo. Mirrors include
the Internet Archive, Zenodo, a server at an academic institution
and periodic incremental downloads to local hard disks (bought
around the corner at a local retailer) via a customer grade
internet connection for offline access.
Happy to share more info if you'd like.
thx,
-jorrit
Hi Elise,
With regard to S3 and Glacier, AWS provides entity tags (eTags) to help with fixity checking via the AWS CLI. They’re somewhat tricky to use, as the hash returned may be not always be in the form of an MD5 digest. This depends on how the object was added to the S3 bucket, the type of encryption used, and if a multipart upload took place (objects over 5GB).
Because of this added complexity, the service we’ve built in-house to perform fixity checking does not depend on eTags. It instead works with a SHA-256 digest that is stored as metadata alongside the object on ingest. We’ve also done this because of a need to work across cloud storage providers (AWS, Qumulo, and Wasabi). So regardless of where an object is pulled from for checking, the same implementation can be used to fetch the object + metadata and compare the digest value to that of the local copy as well as a known truth in a central database.
Hope this info is helpful,
Best,
Eric
–
Eric Lopatin | Product Manager, Digital Preservation
California Digital Library | University of California Curation Center (UC3)
--
You received this message because you are subscribed to the Google Groups "Digital Curation" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
digital-curati...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/digital-curation/14644a32-4174-dc84-fe28-f674f286bf08%40xs4all.nl.
To view this discussion on the web visit https://groups.google.com/d/msgid/digital-curation/0632d19b-9ecb-492f-a500-7a34f144b489n%40googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/digital-curation/BF3E86C7-8C28-493D-A891-307F8A15C3BC%40ticklefish.org.