Integrity Checks in AWS

211 views
Skip to first unread message

Elise Tanner

unread,
Feb 1, 2021, 10:35:16 AMFeb 1
to Digital Curation
We have been using DuraCloud services for a number of years now. I'm currently looking into our options to reduce costs as our storage needs grow. I'm new to AWS and am finding the wealth of AWS information online quite overwhelming. 

I'm trying to find out if AWS S3 and Glacier can provide data integrity reports the same way DuraCloud does? If so, does it require a special utility or use of command line? Is the process automated in any way or completely manual?

Would using an external packing tool that integrates integrity checking, like Bagger, work with AWS?

Any insight about AWS and integrity checking/monitoring would be very helpful. Thanks!

Best,
Elise

  --
Elise Tanner | Director of Digital Projects and Initiatives
Center for Arkansas History and Culture
University of Arkansas at Little Rock
501-320-5770 | emta...@ualr.edu | ualr.edu/cahc
facebook.com/ualrcahc | twitter.com/ualrcahc
she/her/hers

Jorrit Poelen

unread,
Feb 1, 2021, 11:09:45 AMFeb 1
to digital-...@googlegroups.com, Elise Tanner

Hi Elise -

Checking the integrity of archives is important and, as you probably now, requires to retrieve and check the stored files and verify their contents independent of the storage solution.

AWS, like other "cloud" providers, uses the "Hotel California" model - making it "free" to upload, but costly to store and retrieve files. In this model, you can checkout (or quit) anytime you'd like, but you can never leave, especially when you are constrained by small or inflexible budgets and are maintaining large archives.

I am maintaining archives in the order of 1TB and have just recently moved out of AWS because of concerns of vendor lock-in and related budget issues. Now, I am using a custom solution (with integrity checks) in combination with georedundant mirrors and offline backups on a hard disk I can put in a drawer.

How did you factor in the cost of periodic data retrieval to independently verify the integrity of your archives? How large is your archive? 1TB? 100TB?

Curious to hear your thoughts,

-jorrit
--
You received this message because you are subscribed to the Google Groups "Digital Curation" group.
To unsubscribe from this group and stop receiving emails from it, send an email to digital-curati...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/digital-curation/ff4e5322-6d2a-4afa-bd39-220bfde7d3f7n%40googlegroups.com.

Elise Tanner

unread,
Feb 1, 2021, 12:08:10 PMFeb 1
to Jorrit Poelen, digital-...@googlegroups.com
Thanks for the info, Jorrit.

I didn't factor in the retrieval costs related to verifying the integrity, so it's good to know that I need to do that. We are looking for a combination of S3 and Glacier storage. I'm not sure how much we will need at S3, but a minimum of 6TBs to start. We would ideally like to get everything into dark storage, which I estimate in the 20TB-30TB range right now.

Would you mind sharing the custom solution you use now? Thanks again.

Best,
Elise
 

Jorrit Poelen

unread,
Feb 1, 2021, 12:10:44 PMFeb 1
to Elise Tanner, digital-...@googlegroups.com

Hey Elise -

I am using rsync for mirroring and a specialized home grown tool called "Preston" (https://preston.guoda.bio) to reliably reference, clone and verify a large, ever growing, collection of biodiversity datasets and their origins.

Hosting is via https://hetzner.de where I rented a 2x1.5 TB server with unlimited bandwidth for ~ $20 /  mo. Mirrors include the Internet Archive, Zenodo, a server at an academic institution and periodic incremental downloads to local hard disks (bought around the corner at a local retailer) via a customer grade internet connection for offline access.

Happy to share more info if you'd like.

thx,

-jorrit

Eric Lopatin

unread,
Feb 1, 2021, 6:27:26 PMFeb 1
to digital-...@googlegroups.com, Elise Tanner

Hi Elise,

 

With regard to S3 and Glacier, AWS provides entity tags (eTags) to help with fixity checking via the AWS CLI. They’re somewhat tricky to use, as the hash returned may be not always be in the form of an MD5 digest. This depends on how the object was added to the S3 bucket, the type of encryption used, and if a multipart upload took place (objects over 5GB).

 

Because of this added complexity, the service we’ve built in-house to perform fixity checking does not depend on eTags. It instead works with a SHA-256 digest that is stored as metadata alongside the object on ingest. We’ve also done this because of a need to work across cloud storage providers (AWS, Qumulo, and Wasabi). So regardless of where an object is pulled from for checking, the same implementation can be used to fetch the object + metadata and compare the digest value to that of the local copy as well as a known truth in a central database.

 

Hope this info is helpful,

 

Best,

Eric

 

Eric Lopatin | Product Manager, Digital Preservation

California Digital Library | University of California Curation Center (UC3)

Image removed by sender. 

--
You received this message because you are subscribed to the Google Groups "Digital Curation" group.
To unsubscribe from this group and stop receiving emails from it, send an email to digital-curati...@googlegroups.com.

andrew....@aptrust.org

unread,
Feb 2, 2021, 8:21:40 AMFeb 2
to Digital Curation
Hi Elise,

APTrust maintains about 150TB of storage on AWS, and we implement a practice similar to what Eric describes: calculate our own fixity, tag each object with our calculated fixity, and keep a copy of all known fixities in our database.

We do this by running a relatively cheap AWS server in the same region as our S3 storage. That lets us avoid most of the data egress charges, since the S3 data never leaves the region in which it's stored.

We do regular checks every 90 days on S3 items, but there's no feasible way to do as many checks on our Glacier storage. Simply accessing Glacier data involves extra steps and extra costs.

For your second copies, you should look into Wasabi in addition to AWS. It has two big advantages: it's about 80% cheaper than S3, and they do not charge you to access your data. That makes it easy to check fixity.

Also, BTW, a few months ago, Amazon introduced Serverless Fixity for Digital Preservation Compliance. This was in response to requests from the preservation community. AWS sent reps to a meeting at the Library of Congress. They asked lots of questions and took lots of notes. They seem to understand the need in the DP community, but I can't vouch for their solution, because I haven't tried it.

One thing I do know about their solution is that the costs are unpredictable. Each time you run a fixity check, you pay for pulling the S3 data and for each second the fixity checker runs. If you have a lot of files, that can really add up. We decided to stick with our little server because it's costs are predictable and easy to budget.

Elise Tanner

unread,
Feb 2, 2021, 1:58:27 PMFeb 2
to Digital Curation
Thank you, everyone, very much. This is extremely helpful. The digipres community is good people.

Esmé Cowles

unread,
Feb 15, 2021, 10:30:31 AMFeb 15
to digital-...@googlegroups.com
Princeton keeps a third copy of our digital objects in cloud storage. We do continuous fixity checking on local copies, but take a slightly different approach to deal with cloud fixity checking.

We use Google Coldline because we thought it was the best tradeoff of very cheap storage, but very fast retrieval (at a price) if/when we want to retrieve the data. And when we do fixity checking, we do them in a serverless Google Cloud Function[1] to avoid egress charges and the overhead of running a server all the time.

I agree that retrieving and fixity checking everything in cold storage on a routine basis is cost-prohibitive (it's been a while since we looked at the numbers closely, but I believe that retrieving content from Glacier or Coldline yearly essentially negated any cost savings of the cheaper storage). So the compromise we arrived at was retrieving a sample of our data and checking fixity every day, aiming to fixity check about 10% of the repository content every year. The downside is that we don't check all of our cloud data on any routine basis. But on the upside, it's much cheaper, and we are checking content all the time. So we feel like this would catch widespread corruption, or any problem with the retrieval/fixity-checking code that would prevent us from retrieving objects to restore them.

-Esmé
Esmé Cowles, <esco...@princeton.edu>
Asst. Director for Library IT
Princeton University Library


Jason Casden

unread,
Feb 15, 2021, 1:32:33 PMFeb 15
to digital-...@googlegroups.com
This is a great question! At UNC at Chapel Hill, we started storing a third/fourth (depending on how you count) copy of our preservation assets on Glacier last year. We are using the open-source Longleaf preservation utility[0][1] that we developed to transfer files from multiple management systems into Glacier and our other storage systems. We are still treating this as a somewhat experimental project and haven't sufficiently answered the fixity/data integrity question yet. I can say something about our thinking so far, though.

As others have mentioned, recurring fixity checks on cold cloud storage can be quite expensive. Since we are currently limiting our use of S3 to a kind of off-site emergency backup, we chose S3 Glacier (through a campus cloud services pilot) to reduce storage costs in exchange for (unlikely) expensive read operations. Since we regularly check fixity of our on-site copies of files and use the S3 utilities to store initial checksums alongside copies in Glacier, we think that risks to data integrity are fairly well managed right now. We're definitely curious about the Serverless Fixity for Digital Preservation Compliance service and, before we settled on Glacier, also considered deploying some kind of fixity checker to an AWS server.

We've had some conversations with AWS reps about data integrity concerns, but the S3 internal integrity check processes are still pretty opaque. One AWS rep recommended that we look into their compliance programs[2], certificates, and attestations [3] to see if we could find alternate assurances to our fixity checks. We haven't made it very far down this path, yet (what exactly does "Unlike traditional systems, that can require laborious data verification and manual repair, Amazon S3 Glacier performs regular, systematic data integrity checks and is built to be automatically self-healing" [4] mean?). I am very interested in exploring whether we can select different data integrity checking mechanisms for different types of storage platforms. Could we step back a bit from requiring specific methods of checking fixity to identify how different integrity assurance methods interact with other storage qualities (e.g. delayed writes, mutability, versioning, relationship to other storage platforms)? How can we provide acceptable integrity assurance for particular data on a specific kind of platform, relative to the entire collection of copies (e.g. how do our practices change if the cloud vendor is only holding a tertiary copy)? The sampling that Esmé described seems like a promising approach. We've also discussed whether faster, non-cryptographic hashes (e.g. xxHash [5]) could be used to provide adequate integrity checks much more quickly.

[0] https://unc-libraries.github.io/longleaf-preservation/
[1] https://lecture2go.uni-hamburg.de/l2go/-/get/v/24793
[2] https://aws.amazon.com/compliance/resources/
[3] https://aws.amazon.com/compliance/programs/
[4] https://aws.amazon.com/glacier/features/
[5] https://cyan4973.github.io/xxHash/

Thanks to everyone for sharing your work!

Jason

Jason Casden | he/him/his
Head, Software Development
The University of North Carolina at Chapel Hill University Libraries

Sarah Gentile

unread,
Feb 16, 2021, 5:13:22 PMFeb 16
to Digital Curation
Hello all,

Thanks for sharing this detailed information. I just want to give a plug for the NDSA Infrastructure Interest Group, held monthly. 

I will let their about explain the purpose of the group, but note that their next meeting, held Monday, Feb 22nd at 3pm. They are discussing fixity in the cloud in depth.

Many thanks,

Sarah Gentile - she/her/hers
Assistant Digital Preservation Specialist
The David Booth Conservation Department

The Museum of Modern Art
11 West 53 Street
New York, NY 10019
MoMA is now open with new hours and safety protocols. We look forward to seeing you at the Museum. For more information, please visit our website at moma.org

Alex Chan

unread,
Feb 16, 2021, 5:13:26 PMFeb 16
to Digital Curation
Hi Elise!

Wellcome Collection stores ~61TB of digital collections in the cloud, replicated across AWS and Azure. (A warm copy in S3 Standard-IA, a cold copy in S3 Glacier Deep Archive, another cold copy in Azure Blob Storage in a different geographic region.) I've written a bunch of blog posts about it [1–5]; in particular [1] and [5] are relevant to cloud storage and verification of files therein.

Some excerpts/notes related to your questions below.

Everything we store is packed as BagIt bags, and we use SHA-256 checksums in the manifest. These bags are created by workflow tools – e.g. Archivematica or Goobi – so they're independent of any cloud storage. We can track these checksums from the moment a file is created.

We verify data whenever we copy it, but not at rest. Copying is an opportunity to introduce errors (e.g. we copy the wrong file), but I trust data to be safe at rest. So we verify a bag when it's first ingested, and every time we replicate it to a new location, but we're not running any continuous checking. We trust AWS and Azure to keep our files safe, but that's only useful if we write the correct files! We compare using the SHA-256 checksums, and we run the verification from inside AWS to avoid paying for egress data transfer.

[Sidebar: how do we do this with Glacier? When you write to Glacier, you have two options: (1) write to Glacier immediately (2) write to standard storage, with a rule that moves it to Glacier after N days. We use option (2), so we can retrieve the files after writing and check they were written correctly.]

The externally-created SHA-256 checksums allow us to check our files were copied to the cloud correctly, and will allow us to check they're still correct in the far future when we move to something new. (Anyone fancy “nebula” computing?)

For additional protection, you can look at S3 Object Versioning, S3 Object Lock, and S3 MFA Deletion. Respectively:

* S3 Object Versioning stores every version of an object that gets stored, so if a file is overwritten accidentally, you can roll back to the correct version. This includes rolling back to deleted versions.
* S3 Object Lock allows you to enforce a Write-Only, Read-Many (WORM) model on your objects. This is designed for things like SEC compliance, if you need to guarantee that files won't be modified for N years.
* S3 MFA Deletion means that somebody can't delete an object without a hardware MFA device.

Those three features provide an additional level of reassurance "yes, this object really is going to stay safe for a long time".

I don't think it's worth running continuous verification of objects in S3/Azure Blob Storage, and especially not Glacier/Archive tiers – the continuous verification of objects is part of the service you're paying for, and Amazon and Microsoft have thrown a lot of smart people at this problem. They're both certificated for compliance with a variety of international, independent standards for information security, and the likelihood of me finding an issue that they missed is so small that it's not worth my time.

Corruption will happen in the underlying storage (thanks, physics), but part of the service you pay for is that the providers store multiple copies for you, and repair corruption without you having to worry about it. Both providers store huge volumes of data (think terabytes, not petabytes) for a lot of different use cases, and I've never heard of a single object that was corrupted at rest. If this did happen, that's a Cloud Armageddon event.

Costs-wise, we're spending a shade under $2–2.5k in our AWS account, which includes the storage for three replicas (~$1k/month), the compute for our storage service, and a handful of other things (e.g. staging resources) that we use in the same account.

To keep costs manageable, make sure you look at different S3 storage tiers [3]. Glacier is the one everyone remembers, but there are others – including Standard-IA and Deep Archive. The cost of cloud storage comes down to three things:

1) How much stuff do you have? Storing more stuff costs more money.
2) How much redundancy do you want? More redundancy/protection costs more money.
3) How quickly do you need to get stuff out? If you need to be able to retrieve things immediately, it’s quite expensive. If you’re willing to wait, your files can be pushed into cheaper storage that’s slower to access.

Obviously you're not going to compromise on (1), but you might be able to find cheaper storage based on (2). We keep the “warm” copy of our data in the Standard-IA tier, which is about $600/mo cheaper than the Standard tier for our volume of data.

Hope that's useful!

Cheers,
~ Alex (they/she)

Reply all
Reply to author
Forward
0 new messages