Maximu number of files per dataset

89 views
Skip to first unread message

Dario Basset

unread,
Sep 18, 2025, 9:58:53 AM (12 days ago) Sep 18
to Dataverse Users Community
Is there a configurable maximum number of files per dataset? 

I have dataverse 6.3 and having problems with dataset that have more than 20.000 files. I was wondering if there a specific limit and if that limit is configurable. 

Philip Durbin

unread,
Sep 18, 2025, 10:57:30 AM (12 days ago) Sep 18
to dataverse...@googlegroups.com
Hi Dario,

Yes, limiting the number of files per dataset is a new feature as of Dataverse 6.7. Please see the release notes at https://github.com/IQSS/dataverse/releases/tag/v6.7 and the docs at https://guides.dataverse.org/en/6.7.1/api/native-api.html#imposing-a-limit-to-the-number-of-files-allowed-to-be-uploaded-to-a-dataset

There is no default limit. You have to turn this feature on globally, per collection, or per dataset.

I hope this helps!

Phil

On Thu, Sep 18, 2025 at 9:58 AM Dario Basset <dario....@gmail.com> wrote:
Is there a configurable maximum number of files per dataset? 

I have dataverse 6.3 and having problems with dataset that have more than 20.000 files. I was wondering if there a specific limit and if that limit is configurable. 

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/dataverse-community/08187314-0c09-4471-9913-e72d8ccc3fd1n%40googlegroups.com.


--

Alfredo Cosco

unread,
Sep 18, 2025, 3:48:29 PM (12 days ago) Sep 18
to Dataverse Users Community

Hi Philip,

I'm aware of the configuration options available with the new version of Dataverse, but I'm not entirely sure about the real-world capabilities and the performance/infrastructure impact that such a large dataset could have.

I came across this thread: https://groups.google.com/g/dataverse-community/c/pi4A-_D-yrQ/m/hLhDZDQhBAAJ

Are the concerns raised there still relevant, or has the new version introduced significant performance improvements that would allow us to manage this scenario without issues? What would be your recommendation for dealing with this kind of dataset?

A bit more context: Dario is working with over 20,000 image files in a single collection, totaling around 12 GB. As the collection has grown, we've experienced several problems - currently the collection UI becomes very slow (i.e. if you click the paginator, often you can see the loader but not the result page) and we're currently unable to upload additional files.

Any insights or suggestions would be really appreciated!

Thanks,

Alfredo

Philip Durbin

unread,
Sep 18, 2025, 4:36:23 PM (12 days ago) Sep 18
to dataverse...@googlegroups.com
Hi Alfredo,

Yes, the concerns of having too many files in a single dataset are still absolutely relevant! That's why we added the option to limit the number of files in Dataverse 6.7.

I'm not aware of any performance fixes we've put in to help support datasets with lots of files. I believe our advice continues to be "don't do it!"

I hope this helps,

Phil

Dario Basset

unread,
Sep 18, 2025, 4:57:39 PM (12 days ago) Sep 18
to dataverse...@googlegroups.com
Thank you Philip & Alfredo. 
I understand. 

Then, is there a recommended number of files per dataset that we should not exceed? 2000?  5000?


You received this message because you are subscribed to a topic in the Google Groups "Dataverse Users Community" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/dataverse-community/PZgWlozXlXg/unsubscribe.
To unsubscribe from this group and all its topics, send an email to dataverse-commu...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/dataverse-community/CABbxx8E7eiJH7WMNvufZaCEWR5RXW5%3D8F144uhtc-NOKBV4G1Q%40mail.gmail.com.

Philip Durbin

unread,
Sep 18, 2025, 5:05:29 PM (12 days ago) Sep 18
to dataverse...@googlegroups.com
It's a good question. I'd like to hear what the community suggests!

James Myers

unread,
Sep 18, 2025, 5:12:39 PM (12 days ago) Sep 18
to dataverse...@googlegroups.com

As you’ve seen, there is no strict cutoff after which things just won’t work – the dataset gets slower to load and to edit, but you can potentially increase hardware and memory to make it work. (There are a range of settings that make large datasets more efficient, i.e. using an S3 store with direct upload/download.) That said, my guess is that most people would recommend less than a few thousand and some only hundreds (larger Dataverse instances). Conversely, QDR has accepted a dataset with over 10K files that works well enough and I think others have used the more efficient APIs to create even larger datasets.

 

The guidance that is usually given when there are too many files, is to zip the files up in some logical way. If you use S3 storage with direct upload, zip files won’t be unpacked and it’s straight forward. With other storage you may need to put the zip in another zip file (double zip) or use some program that produces a file extension other than .zip (e.g. .tar, .gz, etc.) so Dataverse’s automatic unzipping won’t be triggered.

 

If you do use zip files, one thing to look into is the Zip Previewer – it allows users to see the contents of the zip file and even download specific files from within it.

 

-- Jim

Sebastian Karcher

unread,
Sep 18, 2025, 5:22:37 PM (12 days ago) Sep 18
to dataverse...@googlegroups.com
Performance with large numbers of files is something we're working on. We find 2k unproblematic. We recently published 10k and hit a number of issues with that one (massive JSON-LD file in header, problems with Datacite and more) though none of them insurmountable. One remaining concern is getting the files _out_, given how the ZIP downloads work. So I would try to avoid going above 2k.

Sebastian 

Sent from my phone

Message has been deleted
Message has been deleted
Message has been deleted
Message has been deleted

Dario Basset

unread,
Sep 23, 2025, 11:57:20 AM (7 days ago) Sep 23
to Dataverse Users Community

Hi Philip and Sebastian, thanks for your help.

We are handling our datasets with thousands of files; fortunately they are a few and unpublished, and for now it is possible to remove the files. To avoid the problem in the future, we want to have a policy on managing thousands of files. 

We have to define: 

1) which is the maximum files per dataset

2) whenever we split in many datasets, how to connect the different datasets. In case they are replication data for an article how to manage them in the publication (e.g. creating a "summary dataset" to put in the article a single DOI or hanving the list of all DOIs e.g. in a summary file).

Do you have such policies for your dataverses? We would appreciate if you or any other group member could share them.

Thanks and best regards

Dario Basset & Stefano Bolelli

Reply all
Reply to author
Forward
0 new messages