Is there a configurable maximum number of files per dataset?I have dataverse 6.3 and having problems with dataset that have more than 20.000 files. I was wondering if there a specific limit and if that limit is configurable.
--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/dataverse-community/08187314-0c09-4471-9913-e72d8ccc3fd1n%40googlegroups.com.
Hi Philip,
I'm aware of the configuration options available with the new version of Dataverse, but I'm not entirely sure about the real-world capabilities and the performance/infrastructure impact that such a large dataset could have.
I came across this thread: https://groups.google.com/g/dataverse-community/c/pi4A-_D-yrQ/m/hLhDZDQhBAAJ
Are the concerns raised there still relevant, or has the new version introduced significant performance improvements that would allow us to manage this scenario without issues? What would be your recommendation for dealing with this kind of dataset?
A bit more context: Dario is working with over 20,000 image files in a single collection, totaling around 12 GB. As the collection has grown, we've experienced several problems - currently the collection UI becomes very slow (i.e. if you click the paginator, often you can see the loader but not the result page) and we're currently unable to upload additional files.
Any insights or suggestions would be really appreciated!
Thanks,
AlfredoTo view this discussion visit https://groups.google.com/d/msgid/dataverse-community/7db374c9-a0d3-4cc7-a9c2-ccb8445111can%40googlegroups.com.
You received this message because you are subscribed to a topic in the Google Groups "Dataverse Users Community" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/dataverse-community/PZgWlozXlXg/unsubscribe.
To unsubscribe from this group and all its topics, send an email to dataverse-commu...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/dataverse-community/CABbxx8E7eiJH7WMNvufZaCEWR5RXW5%3D8F144uhtc-NOKBV4G1Q%40mail.gmail.com.
To view this discussion visit https://groups.google.com/d/msgid/dataverse-community/CAEVPFw_%2B0Ap0oSK_s9AnM%3DCUmFdONNsMVo3gk4zbJq8Y9fyPTw%40mail.gmail.com.
As you’ve seen, there is no strict cutoff after which things just won’t work – the dataset gets slower to load and to edit, but you can potentially increase hardware and memory to make it work. (There are a range of settings that make large datasets more efficient, i.e. using an S3 store with direct upload/download.) That said, my guess is that most people would recommend less than a few thousand and some only hundreds (larger Dataverse instances). Conversely, QDR has accepted a dataset with over 10K files that works well enough and I think others have used the more efficient APIs to create even larger datasets.
The guidance that is usually given when there are too many files, is to zip the files up in some logical way. If you use S3 storage with direct upload, zip files won’t be unpacked and it’s straight forward. With other storage you may need to put the zip in another zip file (double zip) or use some program that produces a file extension other than .zip (e.g. .tar, .gz, etc.) so Dataverse’s automatic unzipping won’t be triggered.
If you do use zip files, one thing to look into is the Zip Previewer – it allows users to see the contents of the zip file and even download specific files from within it.
-- Jim
Hi Philip and Sebastian, thanks for your help.
We are handling our datasets with thousands of files; fortunately they are a few and unpublished, and for now it is possible to remove the files. To avoid the problem in the future, we want to have a policy on managing thousands of files.
We have to define:
1) which is the maximum files per dataset
2) whenever we split in many datasets, how to connect the different datasets. In case they are replication data for an article how to manage them in the publication (e.g. creating a "summary dataset" to put in the article a single DOI or hanving the list of all DOIs e.g. in a summary file).
Do you have such policies for your dataverses? We would appreciate if you or any other group member could share them.
Thanks and best regards
Dario Basset & Stefano Bolelli