Dspace and S3

952 views
Skip to first unread message

Adam Morey

unread,
Sep 3, 2015, 4:09:17 PM9/3/15
to DSpace Community
I'm trying to learn about incorporating AWS S3 with DSpace and would like to contact anyone using this in their current configuration.  I've read the longsight blog, http://duraspace.org/articles/2583 on contributing an S3 plugin to DSpace 6.  I've seen a few link to folks using duraspace with jungledisk, anyone tried it with Dspace?

We're contemplating migrating our repository (materialsdata.nist.gov) to Amazon AWS, but the costs only scale if we store the data in S3.

Any help would be appreciated.  

Is this a "not ready for production" config?  
What are the limitations of using S3 with Dspace?  File upload size restrictions?  Performance?  Indexing?

Adam

Peter Dietz

unread,
Sep 3, 2015, 4:42:28 PM9/3/15
to Adam Morey, DSpace Community
Hi Adam,

S3 is going to be a great asset for DSpace to ease the march to the cloud. Unlimited painless storage. We've launched new production sites (starting fresh with zero content) using the S3 storage, and haven't run into any issues. 

The only gotchas at this time are:
- There's no migration script to move your existing assetstore to S3. So, we have launched new production sites with S3 storage, but haven't yet migrated our existing sites. We (Longsight) are developing a migration script, but if anyone wants to contribute here, then feel free to jump in.

- If you are hosting DSpace outside of EC2, then you'll notice network latency as data has to go in and out of AWS. ( I noticed this when I did a batch import on my localhost laptop, and writing to local disk is very fast, but having to send it over the wire, through a vpn, etc was slower.)

It would be nice to do a proper benchmark comparison between DSpace in different architectures. 1M writes / 1M reads with S3 vs EBS (traditional file based asset store), and see if there is any noticeable difference. 

I haven't yet run into an issue, but I've been wondering if media-filter and indexing make too many calls to the assetstore.

We developed this first on the 5x branch, and then ported it to 6x branch. 

Another random thought is that S3 does allow you to write "metadata" when you write a file. So, we could store additional information about that asset (filename, filetype, parent item, ...), when we send it to S3.

Yet another random thought is that since S3 has the offline / cheaper Glacier storage. That perhaps in the future DSpace could mark certain types of content that is either infrequently accessed, or stored for archival purpose rather than end-user access. I'm thinking of a site that maybe uploads large video master copied, and perhaps serves a YouTube version instead of that mpeg. Or large datasets. But that would require some type of workflow to deal with the storage lifecycle (8 hours to retrieve, or policy that moves files to cold storage).

But overall S3 is a very stable product. The storage interface in DSpace is pretty clean / abstracted / minimal. And we're using the AWS SDK to use S3, as opposed to some third party wrapper.

I hope all of this helps, and feel free to reach out if you need more information or assistance.

________________
Peter Dietz
Longsight
www.longsight.com
pe...@longsight.com
p: 740-599-5005 x809

--
You received this message because you are subscribed to the Google Groups "DSpace Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dspace-communi...@googlegroups.com.
To post to this group, send email to dspace-c...@googlegroups.com.
Visit this group at http://groups.google.com/group/dspace-community.
For more options, visit https://groups.google.com/d/optout.

Peter Dietz

unread,
Jan 11, 2016, 4:59:04 PM1/11/16
to Adam Morey, DSpace Community
Hi Adam,

A followup to DSpace + S3. Last week, the code to add S3 storage layer was merged into DSpace's master code base, (for the upcoming DSpace 6 release).


Since your last email, here are some changes that have happened to the S3 code. 

For running checksums, it is able to have the S3 service compute a checksum of the object, (remotely on S3), as opposed to having to GET each file, parse through the bits, and compute the hash. So, a 100GB repository, previously would have had to GET 100GB of assets, just to do the checksum, with S3 computing the hash, its just request/response with metadata. (Normal filesystem based assetstores will have to read all content to do the checksum.)

There is a bitstore migrate command that I've added, to allow your DSpace site to move all of your assets from one bitstore (assetstore) to another. i.e. local disk assetstore to S3 bucket s3://dspace-nist-prod. The transfer operation is fairly robust, I've disconnect network midtransfer, closed lid on MBP, sent exit code to operation, and it all seemed to work just fine. Since it is "new" code for an upcoming version of DSpace, I would encourage early adopters, and perhaps large adopters to test this out, and if you encounter anything, to share your feedback so that things can be fixed early in the cycle.

A known issue at this moment is that DSpace's S3 code only supports a maximum single-file size of 5GB. If this is important to sites, then feel free to submit a contribution to increase that. S3's maximum theoretical file size limit for a single file is 5 Terabytes, you just need to change the code we use to interface with it, to transfer content in chunks.

Performance is great, we have several sites in production using an earlier version of the S3 code, and haven't run into anything concerning.

We can foresee future optimizations. Such as if a user is requesting an object that, instead of S3 sending the binary to your DSpace application server, then through your webserver, then to the end user, that you could instead send the user a one-time signed URL to directly to access the file from S3. But then we might run into issues with crawlers indexing our content, and if the url for content keeps changing, then they might have trouble indexing content?? I suppose if you throw HTTP 305 Use Proxy, then actually, I have no idea, but I suppose this could be solved, to eek out a bit more performance, and reduce load on your servers.

Also, adding S3 as a storage layer implementation, means that we cleaned up / refactored the BitStore interface, so if you had some other desired cloud storage service, then other implementations could fit in easily.


________________
Peter Dietz
Longsight
www.longsight.com
pe...@longsight.com
p: 740-599-5005 x809

Javier Távara

unread,
Jul 10, 2016, 4:28:18 PM7/10/16
to DSpace Community, adamn...@gmail.com
Hi Peter.
I'm interested in running the application server on a separated machine from the assets (more space, fast updates). I think using S3 (or something like Minio.io) could be a good start.
Do you have more information about S3 and DSpace 5? I think DSpace 6 it's not ready for production yet.
Is there another way besides S3 protocol?

Thank you very much.

Peter Dietz

unread,
Jul 11, 2016, 7:05:58 AM7/11/16
to Javier Távara, Adam Morey, DSpace Community

Hello Javier,

At Longsight,we developed the S3 storage layer for DSpace, originally for 5x, and use it for many of our clients in production. Having limitless storage and we don't have to manage disks is the best part.

Our 5x fork of DSpace (direct link to S3BitStore) is: https://github.com/LongsightGroup/DSpace/blob/longsight-5.4-x/dspace-api/src/main/java/org/dspace/storage/bitstore/impl/S3BitStore.java

We don't use a third party implementation of S3, but stick with AWS S3. If you were to use something that is not AWS, but compatible with it, you would need to alter the code to specify a different endpoint.

Javier Távara

unread,
Jul 11, 2016, 11:24:44 AM7/11/16
to DSpace Community, ja.t...@gmail.com, adamn...@gmail.com
Thank you for your answer, Peter. Changing the endpoint doesn't seem very difficult.

Do you think the S3 layer is fully functional on DSpace 6? I see commented out code in https://github.com/DSpace/DSpace/blob/master/dspace-api/src/main/java/org/dspace/storage/bitstore/S3BitStoreService.java
I'm sorry, I'm not a Java expert developer and I haven't set up a UNIX environment yet to make tests.

I have to update my DSpace 3 to get the REST API and S3 layer. I'm thinking on going directly to DSpace 6 now (the estable release will be launched soon anyway).

Regards.
Reply all
Reply to author
Forward
0 new messages