On Tue, Apr 27, 2021 at 10:32:11AM +0300, Alan Orth wrote:
> Dear Philipp,
>
> That sounds painful! I don't know how to register the data directly at S3.
> Perhaps you could just upload it to an S3 bucket and add a link in the
> metadata rather than using some tight DSpace–S3 integration. At our
> institute we upload the data somewhere else and make a metadata-only
> accession to our DSpace repository, for example *The genome of
> Caenorhabditis bovis*:
I know of no way to do what was asked, using stock DSpace. I think
that it would not be difficult to extend or adapt the item importer's
"register" function to register data which reside in S3.
But I think that, once you have a way to get these large datasets
*into* DSpace, your users will face related issues when trying to get
those data *out of* DSpace. I would seriously consider Alan Orth's
advice above.
It would be good to think about how the consumers of these datasets
will process them, and specifically how processing facilities prefer
to access large volumes of data. Is S3 the best place from which to
share them? I believe that there are networks designed for and dedicated
to sharing large-scale research data at high speed.
> On Thu, Apr 22, 2021 at 6:36 PM Philipp Rehs <
phh...@gmail.com> wrote:
>
> > Hello,
> >
> > we are planing to publish some data which is processed on our hpc
> > system. The dataset are up to 100tb (already packed) and need to be
> > stored and published with dspace.
> >
> > I know it is possible to upload the file to the filesystem and assign it
> > to an item but this does not scale well with huge datasets.
> >
> > Is there any way to upload the data directly to S3 storage and assign
> > the object later? without uploading to dspace first.
--
Mark H. Wood
Lead Technology Analyst
University Library
Indiana University - Purdue University Indianapolis
755 W. Michigan Street
Indianapolis, IN 46202
317-274-0749
www.ulib.iupui.edu