OSF's newest add-on: Amazon S3

68 views
Skip to first unread message

Chris Seto

unread,
Feb 26, 2014, 12:16:36 PM2/26/14
to openscienc...@googlegroups.com
Greetings -

The Open Science Framework (http://osf.io/) has just introduced a new add-on: Amazon's Simple Storage Service, otherwise known as S3.  This is the first of many data repositories that will be integrated with OSF.

What is S3? 

S3 is a simple and cheap, yet robust, cloud storage service offered by Amazon. It can be used to store anything from text notes to multi-gigabyte size files. It could be particularly useful for researchers that work with big data and need a cost-efficient storage and archiving solution.


Why have I not heard of this service before?

S3 is aimed at website developers mostly as it has required a bit of tech know-how to use. However many popular services, such as Dropbox and Duraspace, use S3 as a back-end for their storage services.


How does the Open Science Framework help me make use of S3?

The OSF's S3 add-on allows easy use of Amazon's services- it allows you to see what files you have stored, download or view them, and upload directly to your S3 account, all through a user friendly drag-n-drop interface.


That still did not answer my question.

Giant files no longer have to take up local space nor will you have to worry about making backups to ensure data safety. Any peers, as project collaborators, have easy access to your data as well as being able to contribute directly to it. Further, with Amazon S3 OSF add-on -- you can integrate data management with the rest of your workflow.  OSF tools interact with Amazon S3, and with all of the other services connected to OSF.  If you are a GitHub user, for example, adding both GitHub and S3 to a project makes all three services work together to support your research.


How can I get started?

Get an Amazon S3 account. Create and record you Access Key Id and Secret Access Key from here.  Then, in the OSF, go to the settings page and select Amazon S3.  Add the service by inserting the credentials.  S3 will be integrated with your project!  See an example here.


Enjoy!

Chris Seto

Junior Developer
Center for Open Science

Tom Roche

unread,
Feb 27, 2014, 10:15:37 AM2/27/14
to openscienc...@googlegroups.com

https://groups.google.com/d/msg/openscienceframework/2nnzi5nqYTA/8t6Cmruk5cYJ
> [Chris Seto Wed, 26 Feb 2014 09:16:36 -0800 (PST)]
> S3 is a simple and cheap

... depending on your needs. The S3 free storage is 5GB, which is nice, but that's the size of 5 day's land->air emissions for my North America simulation (of 2008). Inputs for the whole thing ~= 4.5 TB (and that's with the 24-layer meteorology--add another 1 TB for 35-layer met) for which Amazon charges ~200 $/mo (IIUC).

> This is the first of many data repositories that will be integrated with OSF.

Bring them on!

FWIW, Tom Roche <Tom_...@pobox.com>

Andrew Sallans

unread,
Feb 27, 2014, 10:21:36 AM2/27/14
to openscienc...@googlegroups.com, Tom Roche
Thanks for the feedback, Tom.  Out of curiosity, what are the main services that you use for data storage?

Andrew Sallans

Tom Roche

unread,
Feb 27, 2014, 10:48:23 AM2/27/14
to openscienc...@googlegroups.com

https://groups.google.com/d/msg/openscienceframework/2nnzi5nqYTA/0_LAdCEvY_QJ
>> [Tom Roche Thu, 27 Feb 2014 10:15:37 -0500]
>> The S3 free storage is 5GB, which is nice, but [inputs for my simulation] ~= 4.5 TB
>> (and that's with the 24-layer meteorology--add another 1 TB for 35-layer met)
>> for which Amazon charges ~200 $/mo (IIUC).

https://groups.google.com/d/msg/openscienceframework/2nnzi5nqYTA/3b3d7W7L_KwJ
> [Andrew Sallans Thu, 27 Feb 2014 07:21:36 -0800 (PST)]
> what are the main services that you use for data storage?

Currently, big disks @ EPA (aka "the boss"). Very un-open (can't even SSH *out*) but free to me.

Tire-kicking-ly I put some data on KNB, and may put more there, but Morpho (at least, the previous version) was fairly time-consuming (manually inputing metadata, not to mention data transfer), so I'm looking @ other options (e.g., Dataverse). or just procrastinating :-(

FWIW, Tom Roche <Tom_...@pobox.com>

Andrew Sallans

unread,
Feb 27, 2014, 11:07:51 AM2/27/14
to openscienc...@googlegroups.com, Tom Roche
Thanks for these additional details, Tom.  While closed-off storage options are obviously tough, we have been looking at some of the other options you mention.  We're well aware of KNB, Morpho, and more earth science oriented tools via relationships with DataONE (http://dataone.org) and will aim for some connections there in the future. If you know any students who would like to help make that happen, you might encourage them to apply for the DataONE internship to work with us on such issues this summer (#9 on this list, http://www.dataone.org/internships).  

We have Dataverse on the list as well, and appreciate hearing that it's of interest to you.

Best,
Andrew

Tom Roche

unread,
Feb 27, 2014, 12:21:16 PM2/27/14
to openscienc...@googlegroups.com

https://groups.google.com/d/msg/openscienceframework/2nnzi5nqYTA/f3wohCyIGGEJ
> [Tom Roche Thu, 27 Feb 2014 10:48:23 -0500]
> Morpho (at least, the previous version) was fairly time-consuming (manually inputing metadata, not to mention data transfer)

Dunno if this is already in-plan, but one thing I'd like to see OSF tool up (working with providers to enable as necessary) is CLI/scriptable data transfer, esp metadata transfer, to repositories. When attempting to repositorize hundreds (daily for a year, plus spinup) of often-multi-GB netCDF files

1. interacting with a GUI or web UI is painful and slow.

2. `tar` seems unattractive, since (I suspect)

* probability of transfer abend grows with {file size, transfer time}, for both uploaders (i.e., me) and downloaders (i.e., collaborators, replicators).

* downloaders will likely want subsets of the data

3. .tar.gz does not help here, since netCDF are already fairly compact binaries.

Implementation-wise, I'd favor HTTP APIs similar to those already used by BitBucket and GitHub, but only because the clusters on which I work only allow HTTP and SSL out.

Again, this may require work with the repos to provide necessary plumbing on their side. Along those lines (dunno if this is too off-topic), if anyone has pointers to currently-transfer-scriptable repositories, please pass. I have a proposed question about this @ the proposed Open Science Stack Exchange

http://area51.stackexchange.com/proposals/65426/open-science/

FWIW, Tom Roche <Tom_...@pobox.com>

Philip Durbin

unread,
Feb 27, 2014, 1:36:23 PM2/27/14
to openscienc...@googlegroups.com, dataverse...@googlegroups.com
Hi Tom,

Dataverse provides a scriptable "Data Deposit API" based on the
SWORDv2 protocol. Here are some examples with curl:

http://thedata.harvard.edu/guides/dataverse-api-main.html#data-deposit-api

I wrote the API implementation so any bugs are my fault. :)

Our primary use case when developing the API was integration between
Open Journal Systems (OJS) and Dataverse:
http://projects.iq.harvard.edu/ojs-dvn

COS generously hosted me and my boss back in September (hi, everyone!)
and we're working on an integration between OSF and Dataverse that
makes use of the API:
https://github.com/CenterForOpenScience/openscienceframework.org/issues/112

Actually, COS is even helping to develop a Python library to talk to
the Dataverse API (which I really, really appreciate, not being much
of a Pythonista): https://github.com/IQSS/dvn-client-python

But enough about Dataverse. Lots of other repositories support SWORD.
There's an official list at
http://swordapp.org/sword-v2/sword-v2-implementations/ and my
(slightly longer) list at
https://github.com/dvn/dvn-devguide-src/blob/master/features/api/data-deposit.mdwn#sword-v2-server-implementations

But enough about SWORD. Are there other protocols for this? Let me
know! Because some of the stuff we want to do are not covered by the
spec: http://swordapp.github.io/SWORDv2-Profile/SWORDProfile.html

Hope this helps,

Phil

p.s. Cc'ing the Dataverse community list on this.
> --
> You received this message because you are subscribed to the Google Groups "Open Science Framework" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to openscienceframe...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.



--
Philip Durbin
Software Developer for http://thedata.org
http://www.iq.harvard.edu/people/philip-durbin

Ruben Arslan

unread,
Feb 27, 2014, 2:20:02 PM2/27/14
to openscienc...@googlegroups.com
Hi,

while you’re at it:
is there also a developed & used protocol for record-wise (i.e. live) deposition and
retrieval of data and meta-data?
I’d consider implementing it to deposit data in my survey organiser formr.org
I’ll soon write more why I would prefer this to a local DB in some cases.

Best wishes,

Ruben
Reply all
Reply to author
Forward
0 new messages