Upload huge datasets

Philipp Rehs

unread,

Apr 22, 2021, 11:36:52 AM4/22/21

to DSpace Technical Support

Hello,

we are planing to publish some data which is processed on our hpc
system. The dataset are up to 100tb (already packed) and need to be
stored and published with dspace.

I know it is possible to upload the file to the filesystem and assign it
to an item but this does not scale well with huge datasets.

Is there any way to upload the data directly to S3 storage and assign
the object later? without uploading to dspace first.

Kind regards

Philipp Rehs

---------------------------

Zentrum für Informations- und Medientechnologie
Kompetenzzentrum für wissenschaftliches Rechnen und Speichern

Heinrich-Heine-Universität Düsseldorf
Universitätsstr. 1
Raum 25.41.00.51
40225 Düsseldorf / Germany
Tel: +49-211-81-15557

Alan Orth

unread,

Apr 27, 2021, 3:32:25 AM4/27/21

to Philipp Rehs, DSpace Technical Support

Dear Philipp,

That sounds painful! I don't know how to register the data directly at S3. Perhaps you could just upload it to an S3 bucket and add a link in the metadata rather than using some tight DSpace–S3 integration. At our institute we upload the data somewhere else and make a metadata-only accession to our DSpace repository, for example The genome of Caenorhabditis bovis:

https://hdl.handle.net/10568/107367

The data for this publication is deposited in the European Nucleotide Archive.

Regards,

--
All messages to this mailing list should adhere to the Code of Conduct: https://duraspace.org/about/policies/code-of-conduct/
---
You received this message because you are subscribed to the Google Groups "DSpace Technical Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dspace-tech...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dspace-tech/98c61d57-8bed-4eb6-a143-87f892c55a9an%40googlegroups.com.

--

Alan Orth
alan...@gmail.com
https://picturingjordan.com
https://englishbulgaria.net
https://mjanja.ch

Mark H. Wood

unread,

Apr 27, 2021, 8:57:09 AM4/27/21

to dspac...@googlegroups.com

On Tue, Apr 27, 2021 at 10:32:11AM +0300, Alan Orth wrote:
> Dear Philipp,
>
> That sounds painful! I don't know how to register the data directly at S3.
> Perhaps you could just upload it to an S3 bucket and add a link in the
> metadata rather than using some tight DSpace–S3 integration. At our
> institute we upload the data somewhere else and make a metadata-only

> accession to our DSpace repository, for example *The genome of
> Caenorhabditis bovis*:

>
> https://hdl.handle.net/10568/107367
>
> The data for this publication is deposited in the European Nucleotide
> Archive.

I know of no way to do what was asked, using stock DSpace. I think
that it would not be difficult to extend or adapt the item importer's
"register" function to register data which reside in S3.

But I think that, once you have a way to get these large datasets
*into* DSpace, your users will face related issues when trying to get
those data *out of* DSpace. I would seriously consider Alan Orth's
advice above.

It would be good to think about how the consumers of these datasets
will process them, and specifically how processing facilities prefer
to access large volumes of data. Is S3 the best place from which to
share them? I believe that there are networks designed for and dedicated
to sharing large-scale research data at high speed.

> On Thu, Apr 22, 2021 at 6:36 PM Philipp Rehs <phh...@gmail.com> wrote:
>
> > Hello,
> >
> > we are planing to publish some data which is processed on our hpc
> > system. The dataset are up to 100tb (already packed) and need to be
> > stored and published with dspace.
> >
> > I know it is possible to upload the file to the filesystem and assign it
> > to an item but this does not scale well with huge datasets.
> >
> > Is there any way to upload the data directly to S3 storage and assign
> > the object later? without uploading to dspace first.

--
Mark H. Wood
Lead Technology Analyst

University Library
Indiana University - Purdue University Indianapolis
755 W. Michigan Street
Indianapolis, IN 46202
317-274-0749
www.ulib.iupui.edu

signature.asc

Reply all

Reply to author

Forward

Upload huge datasets > 1TB

Philipp Rehs

Alan Orth

Mark H. Wood