AWS Serverless Fixity

8 views
Skip to first unread message

Tallman, Nathan

unread,
Sep 1, 2020, 1:13:30 PM9/1/20
to lib.di...@lists.btaa.org, community, ma_tec...@metaarchive.org

Good afternoon,

 

This was demoed at the 2019 AWS Digital Preservation Summit, but it wasn’t public then. Looks like they have full documentation for it now, passing this along for any AWS users.

 

I haven’t fully dug into these docs yet, but I remember there being a gotcha with large files. The Lambda functions have time limits for for very large files, there might not be enough time to compute checksums. Maybe they’ve overcome that now. Also, this is not the same as AWS’s own internal fixity processes, it’s essentially an add on so digipres folks can get there MD5 or SHA1 checksums. I don’t think SHA2 is supported.

 

https://docs.aws.amazon.com/solutions/latest/serverless-fixity-for-digital-preservation-compliance/welcome.html

 

Best health,

Nathan

 

-- 

Nathan Tallman

Digital Preservation Librarian

Penn State University Libraries

(814) 865-0860

nt...@psu.edu

Schedule a Meeting

Andrew Diamond

unread,
Sep 1, 2020, 1:52:26 PM9/1/20
to Tallman, Nathan, lib.di...@lists.btaa.org, community, ma_tec...@metaarchive.org
Interesting. It looks like they have solved the lambda limitation that had previously made it impossible to calculate fixity on large files.

The problem had been that AWS lambda functions, which are used to implement "serverless" tasks, had a time limit of 5 minutes. If they didn't complete their work in 5 minutes, they failed. This is no problem when calculating fixity on smaller files. However, it's impossible to calculate checksums on very large files in five minutes. You can't even stream a 500 GB file from S3 to the lambda function in that short a time.

AWS's solution is described in the "Compute Checksum" section of their Solution Components page. Basically, they calculate checksums on 20 GB of data at a time. If the file has more data, a new lambda function picks up where the old one left off. That's a brilliantly simple solution, allowing them to run fixity checks on multi-terabyte files without any single worker ever violating the five minute timeout.

The checksum algorithms are currently limited to md5 and sha1, as Nathan pointed out. I'm sure they can add others without much trouble.

I figured AWS and other storage providers would get to this point someday. The closer they are to having their own native implementations of services offered by specialty providers like APTrust and DuraCloud, the easier it will be for them to recruit new customers directly and disintermediate the specialty provider.

On a technical level, it forces DDPs like us to periodically reevaluate whether our own implementations are still useful, reliable, and cost-effective. Services like AWS's new fixity checker have their upsides (easy deployments, low maintenance, no infrastructure requirements) and their downsides (very unpredictable costs, reliance on opaque, closed-source third party technologies). They also have some complexities that might not be apparent from Amazon's technical overview. For example, you still have to keep a database of all your checksums somewhere, and you still have run some system to record the results of Amazon's fixity checks.

If people are interested, we can discuss this in a future tech call, partner meeting, NDSA meeting, or wherever makes sense. To me, the existential questions this new service raises are more interesting than the technical questions. What value do DDPs provide over basic storage services like S3 and Glacier as they accumulate new add-ons like this? Can the preservation community explain the value of DDPs in this evolving technical landscape to the people who decide their budgets?

Andrew Diamond
Lead Developer, APTrust




--
You received this message because you are subscribed to the Google Groups "community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to community+...@aptrust.org.
To view this discussion on the web visit https://groups.google.com/a/aptrust.org/d/msgid/community/C4390EF6-17D6-4463-8E16-D508E9C2621C%40psu.edu.
Reply all
Reply to author
Forward
0 new messages