S3 Direct Uploads and Max Upload Filesize

169 views
Skip to first unread message

Sherry Lake

unread,
Nov 28, 2023, 9:55:32 AM11/28/23
to Dataverse Users Community
Hello,

Since we have S3 direct uploads configured, should I increase our :MaxFileUploadSizeInBytes size? It is currently set to 6GB. According to this page we can, it can handle more - https://guides.dataverse.org/en/latest/developers/big-data-support.html

Are there API upload commands or upload features that bypass the :MaxFileUploadSizeInBytes limit?

For those using S3 direct uploads, what is you max upload file size?

I can see a scenario where we keep our limit "small",  but also have the ability to by pass that limit on a case by case basis.

Discussion, advice welcome.

Thanks,
Sherry Lake

Jim Myers

unread,
Nov 28, 2023, 10:29:20 AM11/28/23
to Dataverse Users Community

Sherry,

You can limit the max size per store (see the JSON example at https://guides.dataverse.org/en/latest/installation/config.html?highlight=maxfileuploadsizeinbytes#maxfileuploadsizeinbytes), and then assign stores per collection or per dataset to allow different projects/groups to have different limits. Harvard is doing this, (with stores pointing to the same bucket - not sure what the max is). QDR sets 2GB and a higher max (~20GB) on a store that is using a cheaper S3 option (storJ).

That said, right now the max file size is somewhat of a crude way to limit overall dataset size and Leonid is working on actual quotas right now. With that in place, setting the max file size to something very large, limited only by the max allowed in the store (AWS is 5 TB I think) and/or by reasonable upload times for your users (given their average bandwidth), etc. and using quotas would be a better approach/reduce the need for stores with different max sizes.


-- Jim

Philip Durbin

unread,
Nov 29, 2023, 3:35:46 PM11/29/23
to dataverse...@googlegroups.com
Here's the pull request Leonid is working on for per-collection storage quotas: https://github.com/IQSS/dataverse/pull/10144

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/56cb8c49-609d-413c-849e-1694f3f51780n%40googlegroups.com.


--

Sherry Lake

unread,
Feb 23, 2024, 8:30:41 AM2/23/24
to Dataverse Users Community
Going back to this discussion about max file sizes.

I'm experimenting on our test server V 5.14 (using S3 - with direct upload set)

This is our setting:     ":MaxFileUploadSizeInBytes": "2147483648",
and when uploading a larger file, via the UI, get this warning:
Screenshot 2024-02-23 at 7.38.25 AM.png

But I was able to upload this 6.2 GB file using DVUploader, this command:
java -jar DVUploader-v1.2.0beta3.jar -key=$API_TOKEN  -did=doi:10.80100/FK2/LJMVKJ -server=https://dataversedev.internal.lib.virginia.edu rf-model-large.joblib

How do I set limits on DVUploader, or the API "add" (which didn't work, due to space and timeout errors in the log?)?

Thanks,
Sherry

Philip Durbin

unread,
Feb 23, 2024, 11:21:05 AM2/23/24
to dataverse...@googlegroups.com
Hi Sherry,

Good catch. Please feel free to open an issue about this. At the very least, we'd be happy to investigate and confirm that the latest release isn't affected. (We're pretty sure it isn't, judging from the code*, which was refactored after 5.14.)

Thanks,

Phil



--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.

Sherry Lake

unread,
Feb 23, 2024, 11:38:42 AM2/23/24
to dataverse...@googlegroups.com
Hi Phil,

So you are saying that the maxfilesize setting is checked on DVUpload for DV versions after 5.14?

Then we will need to set a maxfilesize for S3 (command line direct upload) that is different from the UI? Or is the maxfilesize just on S3? 

I thought the UI created other bottlenecks for uploads of "large" files, but maybe not if direct upload is set?

Hmm.... I think I will need a white board and you and Jim at the DV Community Meeting to talk me through the scenarios.

Thanks,
Sherry

Jacek Chudzik

unread,
Nov 26, 2025, 5:37:41 AM (2 days ago) Nov 26
to Dataverse Users Community
Hi,

I'm struggling with, similar topic.

I'm on Dataverse v 6.4 and I'm trying to upload large file (10GB) to s3 storage. Page https://github.com/gdcc/python-dvuploader mentions that dvuploader uses direct upload but - from my tests - it seems that dv is still limited by the MaxFileUploadSizeInBytes. Only removing/giving a big number in MaxFileUploadSizeInBytes allows me to upload files vith dvuploader.

Perfect solution would be to have a limit on UI and no limit (part/chunksupload) via dvuploader. Is it possible? Am I missing something? Is there a way to force direct upload to s3 in dvuploader?

My jvm-options settings is as follows:
-Ddataverse.files.storage-driver-id=s3
-Ddataverse.files.s3.type=s3
-Ddataverse.files.s3.label=prod
-Ddataverse.files.s3.bucket-name=demo
-Ddataverse.files.s3.download-redirect=true
-Ddataverse.files.s3.upload-redirect=true
-Ddataverse.files.s3.path-style-access=true
-Ddataverse.files.s3.min-part-size=1073741824
-Ddataverse.files.s3.url-expiration-minutes=180
-Ddataverse.files.s3.url-expiration-minutes=360

Thanks for any suggestions.
Jacek

Range, Jan

unread,
Nov 26, 2025, 5:56:24 AM (2 days ago) Nov 26
to dataverse...@googlegroups.com

Hi Jacek,


thanks for the feedback.


Could you share the specific error message you’re encountering?


Python dvuploader defaults to direct uploads to S3 and checks if S3 upload is enabled for a collection. If it’s not enabled, it falls back to the regular upload and prints a log message to the console (see screenshot). Do you see this log message when you use it? This would help ensure there’s no bug and that the code isn’t using the regular upload.


All the best,

Jan


PastedGraphic-1.png


———————————

Jan Range
Research Data Software Engineer

University of Stuttgart
Stuttgart Center for Simulation Science (SC SimTech)
Cluster of Excellence EXC 2075 „Data-Integrated Simulation Science“ (SimTech)

Pfaffenwaldring 5a | Room 01.013 | 70569 Stuttgart Germany

Phone: 0049 711 685 60095
E-Mail: jan....@simtech.uni-stuttgart.de

——— Meet me ———

https://calendly.com/jan-range/meeting


Jacek Chudzik

unread,
Nov 26, 2025, 6:15:25 AM (2 days ago) Nov 26
to Dataverse Users Community
Hi,

I do not see this log message.
Zrzut ekranu z 2025-11-26 12-02-14.png

I attach full error message. It points out to code 400 when trying to get uploadurls. If I change :MaxFileUploadSizeInBytes for a larger number than my file - upload works fine (but seems like transering whole file at once).

API token is valid, user I use has admin privilages and only difference I see in having working/not working python dvuploader script is value of 
:MaxFileUploadSizeInBytes. 

Regards and thanks for any tips,
Jacek
error_message.txt

Range, Jan

unread,
Nov 26, 2025, 7:21:26 AM (2 days ago) Nov 26
to dataverse...@googlegroups.com

Thanks for the prompt response :)


Okay, the direct upload is definitely triggered. 


I attach full error message. It points out to code 400 when trying to get uploadurls. If I change :MaxFileUploadSizeInBytes for a larger number than my file - upload works fine (but seems like transering whole file at once).


This should work fine because the progress bar tracks the upload of all concurrent tasks for a single file. Additionally, since your file size exceeds the minimum part size, the file will be split into the appropriate number of parts. However, I should log the type of upload used, and I’ll create a pull request to add a message indicating this.


I am not sure if the S3 upload can bypass the maximum file upload size. Maybe @qqmyers or @pdurbin could help?


All the best,

Jan


———————————

Jan Range
Research Data Software Engineer

University of Stuttgart
Stuttgart Center for Simulation Science (SC SimTech)
Cluster of Excellence EXC 2075 „Data-Integrated Simulation Science“ (SimTech)

Pfaffenwaldring 5a | Room 01.013 | 70569 Stuttgart Germany

Phone: 0049 711 685 60095
E-Mail: jan....@simtech.uni-stuttgart.de

——— Meet me ———

https://calendly.com/jan-range/meeting

Am 26.11.2025 um 12:15 schrieb Jacek Chudzik <jacek....@cyfronet.pl>:

Hi,

I do not see this log message.
To view this discussion visit https://groups.google.com/d/msgid/dataverse-community/2beed4ef-f5d1-471b-881e-36dee6c5e563n%40googlegroups.com.
<Zrzut ekranu z 2025-11-26 12-02-14.png><error_message.txt>

James Myers

unread,
Nov 26, 2025, 8:59:51 AM (2 days ago) Nov 26
to dataverse...@googlegroups.com

A quick answer: The :MaxFileUploadSizeInBytes was intended to limit file size in general (admins may not want larger files). Per store limits, or different limits for normal and direct uploads could make sense. API vs UI isn’t so clear as the upload-a-folder capability and the SPA are both doing API uploads.

 

-- Jim

Jacek Chudzik

unread,
Nov 27, 2025, 1:43:30 AM (yesterday) Nov 27
to Dataverse Users Community
So if you don't mind I'll make an issue for such feature. It would be nice to have somthing like curl -X PUT -d '{"default":"2147483648","s3":"2147483648"}' http://localhost:8080/api/admin/settings/:MaxFileUploadSizeInBytes but with options UI/normal and direct upload.

Limiting the larger files is ok, but as you mentioned, it is not clear for users. Increasing the MaxFileUploadSizeInBytes value can give the false impression that this is the maximum amount of data that can be uploaded through the UI. In my experience, users don't understand nuances like HTTP upload limits.

Jim, Jan: Thank you for your clarification.
Have a nice day,
Jacek

Jacek Chudzik

unread,
Nov 27, 2025, 3:08:58 AM (yesterday) Nov 27
to Dataverse Users Community
Reply all
Reply to author
Forward
0 new messages