Uploading Large Files (bigger than MaxFileUploadSize)

215 views
Skip to first unread message

Sherry Lake

unread,
Mar 11, 2016, 9:56:53 AM3/11/16
to Dataverse Users Community
How do we (would we) handle datafiles larger than 2GB (our MaxFileUploadSize setting)?

Thanks.
Sherry Lake

Philip Durbin

unread,
Mar 11, 2016, 10:11:25 AM3/11/16
to dataverse...@googlegroups.com
You're talking about the ":MaxFileUploadSizeInBytes" setting I assume. It's documented at http://guides.dataverse.org/en/4.2.4/installation/config.html#maxfileuploadsizeinbytes as follows...

Set MaxFileUploadSizeInBytes to “2147483648”, for example, to limit the size of files uploaded to 2 GB. Notes: - For SWORD, this size is limited by the Java Integer.MAX_VALUE of 2,147,483,647. (see: https://github.com/IQSS/dataverse/issues/2169 ) - If the MaxFileUploadSizeInBytes is NOT set, uploads, including SWORD may be of unlimited size.

curl -X PUT -d 2147483648 http://localhost:8080/api/admin/settings/:MaxFileUploadSizeInBytes

... does this help? I'm not sure what your question is. There's definitely a bug, mentioned above, but I don't want to assume that's what you're talking about.


--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.
To post to this group, send email to dataverse...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/a56c0985-2c7b-46fa-8b4c-d3acc8d89e56%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--

Sherry Lake

unread,
Mar 11, 2016, 10:31:04 AM3/11/16
to Dataverse Users Community, philip...@harvard.edu
Thanks, Phil.

Yes, we have set the MaxFileUplaodSizeinBytes to 2GB. But was wondering if a a researcher had a file larger than 2GB how would they put their file in our Dataverse. The Harvard User Guides have this line:

Please contact support@dataverse.org if you need to upload a file that is larger than 2GB.

I assume it would be some db entry thing we had to do.

--
Sherry



On Friday, March 11, 2016 at 10:11:25 AM UTC-5, Philip Durbin wrote:
You're talking about the ":MaxFileUploadSizeInBytes" setting I assume. It's documented at http://guides.dataverse.org/en/4.2.4/installation/config.html#maxfileuploadsizeinbytes as follows...

Set MaxFileUploadSizeInBytes to “2147483648”, for example, to limit the size of files uploaded to 2 GB. Notes: - For SWORD, this size is limited by the Java Integer.MAX_VALUE of 2,147,483,647. (see: https://github.com/IQSS/dataverse/issues/2169 ) - If the MaxFileUploadSizeInBytes is NOT set, uploads, including SWORD may be of unlimited size.

curl -X PUT -d 2147483648 http://localhost:8080/api/admin/settings/:MaxFileUploadSizeInBytes

... does this help? I'm not sure what your question is. There's definitely a bug, mentioned above, but I don't want to assume that's what you're talking about.

On Fri, Mar 11, 2016 at 9:56 AM, Sherry Lake <shla...@gmail.com> wrote:
How do we (would we) handle datafiles larger than 2GB (our MaxFileUploadSize setting)?

Thanks.
Sherry Lake

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-community+unsub...@googlegroups.com.

Philip Durbin

unread,
Mar 11, 2016, 10:54:23 AM3/11/16
to dataverse...@googlegroups.com
They wouldn't. They can't. The researcher would have to try to figure out how to break up their 2 GB files into smaller pieces or something. Obviously, this is a policy decision about how big of files you want to support.

Oh! It's actually a little weird for non-Harvard installations that the guides suggest emailing sup...@dataverse.org about big files since that email address goes to a ticketing system at Harvard. Please feel free to open an issue about this.

Now that I think about this some more I think the actual answer to your question... if you're wondering what in the world the support team at Harvard would do when a researcher sends an email about a big file... I believe that someone with root access to the server can overwrite a small placeholder file with the actual file from the researcher. This is a hack, of course, but I think this is how it happens in practice for some of the big files in the Harvard Dataverse. Here's a search for files that are bigger than 30 GB for example: https://dataverse.harvard.edu/dataverse/harvard?q=fileSizeInBytes%3A[32212254720+TO+*] ... I'm not sure if the replace-the-file hack was used for these specifically, though.

Is this helping? :)

On Fri, Mar 11, 2016 at 10:31 AM, Sherry Lake <shla...@gmail.com> wrote:
Thanks, Phil.

Yes, we have set the MaxFileUplaodSizeinBytes to 2GB. But was wondering if a a researcher had a file larger than 2GB how would they put their file in our Dataverse. The Harvard User Guides have this line:

Please contact support@dataverse.org if you need to upload a file that is larger than 2GB.

I assume it would be some db entry thing we had to do.

--
Sherry


On Friday, March 11, 2016 at 10:11:25 AM UTC-5, Philip Durbin wrote:
You're talking about the ":MaxFileUploadSizeInBytes" setting I assume. It's documented at http://guides.dataverse.org/en/4.2.4/installation/config.html#maxfileuploadsizeinbytes as follows...

Set MaxFileUploadSizeInBytes to “2147483648”, for example, to limit the size of files uploaded to 2 GB. Notes: - For SWORD, this size is limited by the Java Integer.MAX_VALUE of 2,147,483,647. (see: https://github.com/IQSS/dataverse/issues/2169 ) - If the MaxFileUploadSizeInBytes is NOT set, uploads, including SWORD may be of unlimited size.

curl -X PUT -d 2147483648 http://localhost:8080/api/admin/settings/:MaxFileUploadSizeInBytes

... does this help? I'm not sure what your question is. There's definitely a bug, mentioned above, but I don't want to assume that's what you're talking about.

On Fri, Mar 11, 2016 at 9:56 AM, Sherry Lake <shla...@gmail.com> wrote:
How do we (would we) handle datafiles larger than 2GB (our MaxFileUploadSize setting)?

Thanks.
Sherry Lake

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.

To post to this group, send email to dataverse...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Sherry Lake

unread,
Mar 11, 2016, 11:55:36 AM3/11/16
to dataverse...@googlegroups.com
Yes, the answer I was looking for is in your last paragraph describing: "what in the world the support team at Harvard would do".

I have changed our non-Harvard dataverse guides to point to our support team, not Harvard's. I just included that text from the Harvard Guide because it alluded to there being a way (with Harvar's dataverse) to to get larger files in the database.

Here is our UVa Dataverse User Guide: http://uvalib.github.io/dataverse-docs/

I update them via sphinx and then move them to github.

And we have opened our dataverse site to outside UVa: https://dataverse.lib.virginia.edu/
In anticipation for March 15th opening.

--
Sherry


--
You received this message because you are subscribed to a topic in the Google Groups "Dataverse Users Community" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/dataverse-community/DDIGmmRe8oE/unsubscribe.
To unsubscribe from this group and all its topics, send an email to dataverse-commu...@googlegroups.com.

To post to this group, send email to dataverse...@googlegroups.com.

Condon, Kevin M

unread,
Mar 11, 2016, 12:00:17 PM3/11/16
to dataverse...@googlegroups.com

One additional step to the placeholder file hack: you need to manually generate an MD5 for the bigger file (md5sum?) and then update the md5 column in the datafile table for that file.



Philip Durbin

unread,
Mar 11, 2016, 12:42:08 PM3/11/16
to dataverse...@googlegroups.com
Gotcha. Your Dataverse installation is looking good!

I did go ahead and create this issue, by the way: User Guide contains Harvard-specific information such as 2 GB file upload limit  - https://github.com/IQSS/dataverse/issues/3015

Kevin, that's for pointing out that with the placeholder hack you have to update the MD5 in the database (after calculating it by hand).


For more options, visit https://groups.google.com/d/optout.

Hanieh Rajabi

unread,
Mar 15, 2017, 11:26:52 AM3/15/17
to Dataverse Users Community
Hello all,
I would like to add a new question to this thread, We are going to test dataverse in Zurich University and here users are allowed to upload up to 50 GB file into system.
By change the MaxFileUploadSize at the moment users can do the upload but the MD5 calculation takes so long and sometimes just goes to time out.
Do you have any suggestion how can I deal with this issue?

Thanks
Hanieh

Philip Durbin

unread,
Mar 15, 2017, 12:01:06 PM3/15/17
to dataverse...@googlegroups.com
Unfortunately, we don't have a good answer for 50 GB files right now but please keep an eye on https://github.com/IQSS/dataverse/issues/3145 and what we call "4.8 Large Data Upload Integration" on the roadmap at http://dataverse.org/goals-roadmap-and-releases

If you are interested in beta testing this software, please let us know! Some of the code has already been written but it's quite rough at the moment.

I hope this helps,

Phil

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-community+unsub...@googlegroups.com.
To post to this group, send email to dataverse-community@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/45a8c5c1-b854-4854-a496-9a11aa2b3926%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Anders Conrad

unread,
Mar 16, 2017, 8:13:01 AM3/16/17
to Dataverse Users Community, philip...@harvard.edu
Regarding this issue as well as the thread about changing the location of the file store, I was wondering if it has ever been considered to abstract storage from Dataverse and refer to files, rather than actually manage them?

I am asking because we currently run a test setup with Dataverse on EC2 and file storage being supplied by Copenhagen University. We have mapped the storage into Dataverse through davfs (very ineffective) - going to try with sshfs shortly, and may eventually do nfs through ssh. But if this ever moves to a production setup, we are likely to get in trouble when ingesting 110 Tb of astrophysical data... Could probably be done faster directly to storage and linking from Dataverse.

I was just curious if such an architecture, similar to e.g. Fedora Commons, has even been considered, or if it would break essential integrity, such as versioning?

Cheers,
Anders


Den onsdag den 15. marts 2017 kl. 17.01.06 UTC+1 skrev Philip Durbin:
Unfortunately, we don't have a good answer for 50 GB files right now but please keep an eye on https://github.com/IQSS/dataverse/issues/3145 and what we call "4.8 Large Data Upload Integration" on the roadmap at http://dataverse.org/goals-roadmap-and-releases

If you are interested in beta testing this software, please let us know! Some of the code has already been written but it's quite rough at the moment.

I hope this helps,

Phil
On Wed, Mar 15, 2017 at 11:26 AM, Hanieh Rajabi <hanieh...@gmail.com> wrote:
Hello all,
I would like to add a new question to this thread, We are going to test dataverse in Zurich University and here users are allowed to upload up to 50 GB file into system.
By change the MaxFileUploadSize at the moment users can do the upload but the MD5 calculation takes so long and sometimes just goes to time out.
Do you have any suggestion how can I deal with this issue?

Thanks
Hanieh

On Friday, March 11, 2016 at 3:56:53 PM UTC+1, Sherry Lake wrote:
How do we (would we) handle datafiles larger than 2GB (our MaxFileUploadSize setting)?

Thanks.
Sherry Lake

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-community+unsub...@googlegroups.com.
To post to this group, send email to dataverse...@googlegroups.com.

Mercè Crosas

unread,
Mar 16, 2017, 8:24:28 AM3/16/17
to dataverse...@googlegroups.com, Philip Durbin
This is very important for Dataverse - I agree that it needs to expand storage options in this fashion to continue to grow. The work that we are doing to integrate with Swift object storage is a step towards this direction, but not yet sufficient. Anders, could you write a short use case describing what you would need? Is this something that you (or others that you could collaborate with from the Dataverse community) would be interested in contributing code to?

Merce


----------
Mercè Crosas, Ph.D., Chief Data Science and Technology Officer, IQSS, Harvard University
@mercecrosas mercecrosas.com

On Thu, Mar 16, 2017 at 8:13 AM, Anders Conrad <a...@kb.dk> wrote:
Regarding this issue as well as the thread about changing the location of the file store, I was wondering if it has ever been considered to abstract storage from Dataverse and refer to files, rather than actually manage them?

I am asking because we currently run a test setup with Dataverse on EC2 and file storage being supplied by Copenhagen University. We have mapped the storage into Dataverse through davfs (very ineffective) - going to try with sshfs shortly, and may eventually do nfs through ssh. But if this ever moves to a production setup, we are likely to get in trouble when ingesting 110 Tb of astrophysical data... Could probably be done faster directly to storage and linking from Dataverse.

I was just curious if such an architecture, similar to e.g. Fedora Commons, has even been considered, or if it would break essential integrity, such as versioning?

Cheers,
Anders

Den onsdag den 15. marts 2017 kl. 17.01.06 UTC+1 skrev Philip Durbin:
Unfortunately, we don't have a good answer for 50 GB files right now but please keep an eye on https://github.com/IQSS/dataverse/issues/3145 and what we call "4.8 Large Data Upload Integration" on the roadmap at http://dataverse.org/goals-roadmap-and-releases

If you are interested in beta testing this software, please let us know! Some of the code has already been written but it's quite rough at the moment.

I hope this helps,

Phil
On Wed, Mar 15, 2017 at 11:26 AM, Hanieh Rajabi <hanieh...@gmail.com> wrote:
Hello all,
I would like to add a new question to this thread, We are going to test dataverse in Zurich University and here users are allowed to upload up to 50 GB file into system.
By change the MaxFileUploadSize at the moment users can do the upload but the MD5 calculation takes so long and sometimes just goes to time out.
Do you have any suggestion how can I deal with this issue?

Thanks
Hanieh

On Friday, March 11, 2016 at 3:56:53 PM UTC+1, Sherry Lake wrote:
How do we (would we) handle datafiles larger than 2GB (our MaxFileUploadSize setting)?

Thanks.
Sherry Lake

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-community+unsubscribe...@googlegroups.com.
To post to this group, send email to dataverse...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-community+unsub...@googlegroups.com.

Sherry Lake

unread,
Mar 16, 2017, 8:33:23 AM3/16/17
to Dataverse Users Community, philip...@harvard.edu
Hi Mercè,

UVa would be interested in how to use Cloud storage or some other storage "not connected directly to dataverse". We have use cases at UVa where researchers are really worried about reproducibility so want to keep "everything", give it to the Library to archive, but only share (via Dataverse) a certain portion of those files. 

Unfortunately, I can't offer code contribution, just use cases, testing, and usability.

Perfect timing for this discussion.

--
Sherry

Sherry Lake | Scholarly Repository Librarian | University of Virginia Library | shL...@virginia.edu | 434.924.6730 | @shLakeUVA | Alderman Library, 160 N. McCormick Road, Charlottesville, VA 22903 | Alderman 563 | LinkedIn Profile | “Keeper of the Dataverse" 

Mercè Crosas

unread,
Mar 16, 2017, 8:39:15 AM3/16/17
to dataverse...@googlegroups.com, Philip Durbin
Great - uses cases and testing/usability help are welcome!


----------
Mercè Crosas, Ph.D., Chief Data Science and Technology Officer, IQSS, Harvard University
@mercecrosas mercecrosas.com

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-community+unsub...@googlegroups.com.
To post to this group, send email to dataverse-community@googlegroups.com.

Philip Durbin

unread,
Mar 16, 2017, 8:42:50 PM3/16/17
to dataverse...@googlegroups.com
Hmm, from a quick glance at a Fedora Commons architecture diagram* they seem to be using ModeShape.

Internally we've talked a tiny bit about ModeShape[1] and JackRabbit and JCR (JSR-283) but it's never been completely clear to me if it handles versioning very well, which is a critical feature of Dataverse. If you need to see and download the exact files as of an old dataset version in Dataverse, you just navigate to that version and start downloading. I can't imagine Dataverse without support for this. We expect many users to update their datasets over time and publish new versions, but the consumers of the data should be able to go back to previous versions if they need to.

As Merce mentioned, we could probably use more in terms of use cases and user stories. Code would help too so we can see a prototype of how it would all fit together.

Keep up the chatter. Thanks for everyone's thoughts.

Phil

1. Jon Crabtree from Odum mentioned ModeShape in his talk "Odum Institute iRODS Policies to Support Preservation" at the 2016 Dataverse Community meeting: http://projects.iq.harvard.edu/files/dcm2016/files/dataversecommunitymeeting2016odum-dfc-dataverse-irods-preservation_0.pptx via http://projects.iq.harvard.edu/dcm2016/meeting-agenda

* https://wiki.duraspace.org/download/attachments/79793442/f4-arch.png via https://wiki.duraspace.org/display/FEDORA471/Fedora+4.7.1+Documentation

To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-community+unsubscribe...@googlegroups.com.

To post to this group, send email to dataverse-community@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-community+unsub...@googlegroups.com.
To post to this group, send email to dataverse-community@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Anders Conrad

unread,
Mar 17, 2017, 7:48:36 AM3/17/17
to Dataverse Users Community, philip...@harvard.edu
I will be happy to write a use case - but not today as we are in the middle of a large project delivery :-) Just took the opportunity as I saw this thread unfolding.
I do appreciate Phil's concerns regarding versioning.

Coming back to this next week!
Anders
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-community+unsub...@googlegroups.com.
To post to this group, send email to dataverse...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-community+unsub...@googlegroups.com.
To post to this group, send email to dataverse...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages