Hi Adam,
A followup to DSpace + S3. Last week, the code to add S3 storage layer was merged into DSpace's master code base, (for the upcoming DSpace 6 release).
S3 Pull Request on GitHub
DSpace 6.x Storage Layer documentation
Since your last email, here are some changes that have happened to the S3 code.
For running checksums, it is able to have the S3 service compute a checksum of the object, (remotely on S3), as opposed to having to GET each file, parse through the bits, and compute the hash. So, a 100GB repository, previously would have had to GET 100GB of assets, just to do the checksum, with S3 computing the hash, its just request/response with metadata. (Normal filesystem based assetstores will have to read all content to do the checksum.)
There is a bitstore migrate command that I've added, to allow your DSpace site to move all of your assets from one bitstore (assetstore) to another. i.e. local disk assetstore to S3 bucket s3://dspace-nist-prod. The transfer operation is fairly robust, I've disconnect network midtransfer, closed lid on MBP, sent exit code to operation, and it all seemed to work just fine. Since it is "new" code for an upcoming version of DSpace, I would encourage early adopters, and perhaps large adopters to test this out, and if you encounter anything, to share your feedback so that things can be fixed early in the cycle.
A known issue at this moment is that DSpace's S3 code only supports a maximum single-file size of 5GB. If this is important to sites, then feel free to submit a contribution to increase that. S3's maximum theoretical file size limit for a single file is 5 Terabytes, you just need to change the code we use to interface with it, to transfer content in chunks.
Performance is great, we have several sites in production using an earlier version of the S3 code, and haven't run into anything concerning.
We can foresee future optimizations. Such as if a user is requesting an object that, instead of S3 sending the binary to your DSpace application server, then through your webserver, then to the end user, that you could instead send the user a one-time signed URL to directly to access the file from S3. But then we might run into issues with crawlers indexing our content, and if the url for content keeps changing, then they might have trouble indexing content?? I suppose if you throw HTTP 305 Use Proxy, then actually, I have no idea, but I suppose this could be solved, to eek out a bit more performance, and reduce load on your servers.
Also, adding S3 as a storage layer implementation, means that we cleaned up / refactored the BitStore interface, so if you had some other desired cloud storage service, then other implementations could fit in easily.