The number of files in the zip archive is over the limit (1000)

82 views
Skip to first unread message

Philipp at UiT

unread,
Nov 5, 2019, 2:55:46 PM11/5/19
to Dataverse Users Community
We get the following message when trying to upload a zip file containing more than 1000 files:

The number of files in the zip archive is over the limit (1000); please upload a zip archive with fewer files, if you want them to be ingested as individual DataFiles.

Is this limit configurable, or do we have to split up the files into smaller zips?

Best, Philipp

Philip Durbin

unread,
Nov 5, 2019, 3:05:02 PM11/5/19
to dataverse...@googlegroups.com
Yes, it's configurable, and lightly documented. :)


--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/b27641ac-9e36-4562-8c5d-5af27ca8c492%40googlegroups.com.


--

Philipp at UiT

unread,
Nov 5, 2019, 3:19:09 PM11/5/19
to Dataverse Users Community
Great! Thanks, Phil. Do you have any recommendations on what the maximum upper limit should be? We currently have time series datasets each including folders containing approx. 3000 files. 


tirsdag 5. november 2019 21.05.02 UTC+1 skrev Philip Durbin følgende:
Yes, it's configurable, and lightly documented. :)


On Tue, Nov 5, 2019 at 2:55 PM Philipp at UiT <uit.p...@gmail.com> wrote:
We get the following message when trying to upload a zip file containing more than 1000 files:

The number of files in the zip archive is over the limit (1000); please upload a zip archive with fewer files, if you want them to be ingested as individual DataFiles.

Is this limit configurable, or do we have to split up the files into smaller zips?

Best, Philipp

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-community+unsub...@googlegroups.com.

Philip Durbin

unread,
Nov 5, 2019, 4:29:16 PM11/5/19
to dataverse...@googlegroups.com
With that many files (3000), I'd be concerned about how the UI would perform so I'd probably upload via API if there's any trouble: http://guides.dataverse.org/en/4.17/api/getting-started.html#uploading-files

Up to you, though. :)

Please report back with any findings!

Thanks,

Phil

On Tue, Nov 5, 2019 at 3:19 PM Philipp at UiT <uit.p...@gmail.com> wrote:
Great! Thanks, Phil. Do you have any recommendations on what the maximum upper limit should be? We currently have time series datasets each including folders containing approx. 3000 files. 


tirsdag 5. november 2019 21.05.02 UTC+1 skrev Philip Durbin følgende:
Yes, it's configurable, and lightly documented. :)


On Tue, Nov 5, 2019 at 2:55 PM Philipp at UiT <uit.p...@gmail.com> wrote:
We get the following message when trying to upload a zip file containing more than 1000 files:

The number of files in the zip archive is over the limit (1000); please upload a zip archive with fewer files, if you want them to be ingested as individual DataFiles.

Is this limit configurable, or do we have to split up the files into smaller zips?

Best, Philipp

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/6bb853aa-a1ef-4f1b-8750-cdf3c4b6bb9d%40googlegroups.com.

Philipp at UiT

unread,
Nov 6, 2019, 3:23:20 AM11/6/19
to Dataverse Users Community
Thanks, Phil. We'll test it out.


tirsdag 5. november 2019 22.29.16 UTC+1 skrev Philip Durbin følgende:
With that many files (3000), I'd be concerned about how the UI would perform so I'd probably upload via API if there's any trouble: http://guides.dataverse.org/en/4.17/api/getting-started.html#uploading-files

Up to you, though. :)

Please report back with any findings!

Thanks,

Phil

On Tue, Nov 5, 2019 at 3:19 PM Philipp at UiT <uit.p...@gmail.com> wrote:
Great! Thanks, Phil. Do you have any recommendations on what the maximum upper limit should be? We currently have time series datasets each including folders containing approx. 3000 files. 


tirsdag 5. november 2019 21.05.02 UTC+1 skrev Philip Durbin følgende:
Yes, it's configurable, and lightly documented. :)


On Tue, Nov 5, 2019 at 2:55 PM Philipp at UiT <uit.p...@gmail.com> wrote:
We get the following message when trying to upload a zip file containing more than 1000 files:

The number of files in the zip archive is over the limit (1000); please upload a zip archive with fewer files, if you want them to be ingested as individual DataFiles.

Is this limit configurable, or do we have to split up the files into smaller zips?

Best, Philipp

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-community+unsub...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-community+unsub...@googlegroups.com.

Meghan Goodchild

unread,
Nov 14, 2019, 9:16:40 AM11/14/19
to Dataverse Users Community
Hi Philipp, were you able to test the performance through the UI or using the API after increasing the limit? Anything to share? 

Thanks,
Meghan
Scholars Portal Dataverse

Philipp at UiT

unread,
Nov 24, 2019, 12:49:07 AM11/24/19
to Dataverse Users Community
Hi Meghan, sorry for my late reply! We haven't been able to test this as our system admin has been busy/out of office. We have some quite large time series where we'll need to upload folders with more than 1000 files, so we'll definitely have to test this. I'll keep you and the list posted.

Best, Philipp

Philipp at UiT

unread,
Feb 17, 2020, 2:16:34 AM2/17/20
to Dataverse Users Community
We have been doing some more tests about this, but couldn't really figure out what the upper limit is for the number of files that can be properly unpacked and ingested by Dataverse.

We are currently discussing how we should publish a dataset with even more files (more than 60 000). All the files make up one dataset and the idea is that they together are processed by a program. The single files are quite small, but the number of files makes it impossible to get them uploaded and assigned a DOI at file level. We have been considering several approaches to handle this:

1. Turn off DOI minting at file level. As far as I know, this can currently only be done at Dataverse installation level. We could turn it off temporarily while uploading the files to this dataset, but this solution seems somewhat unsustainable.

2. Split up the dataset into many small datasets. Not very practical for the depositor nor the end user.

3. Double-zip the files and deposit the data as zip file(s). Works fine, but to my knowledge, container files are not considered to be preferred file formats for long-term preservation.

For the time being, we'll recommend approach (3) to our researcher - given the files included in the zip are in (a) preferred file format(s), but we'd like to hear if anyone has been discussing this or similar issues in their repository.

Best, Philipp

Philip Durbin

unread,
Feb 24, 2020, 9:05:20 AM2/24/20
to dataverse...@googlegroups.com
I don't think it's any secret that Dataverse doesn't work especially well with tens of thousands of files in a single dataset. It would be nice to see this situation improve. :)

When we began the big data project with SBGrid (that ultimately resulted in the rsync feature) one of the early requirements was "Current largest dataset has >700k files (can be slow; but shouldn’t fail)." Whether it's 700,000 or 60,000 files, that's a lot for a Dataverse dataset in 2020, especially if you want DOIs for all those files! :)

Your choice of approach 3 (double zipping) makes complete sense to me. Maybe in the future that zip can be unpacked in Dataverse and each file can get a DOI. In terms of preservation, I'm wondering if the Archivematica integration or BagIt export helps here. Perhaps zip files could be extracted by some other system? From a quick look at the Archivematica documentation[1], I'm seeing "When you are processing a Dataverse dataset that includes packaged material (i.e. .zip or .tar files), Archivematica can extract the contents of these files and run preservation microservices on the contents."

My understanding from SBGrid is that most of all of their datasets only make sense when you have *all* the files and they aren't interested in DOIs at the file level. We made a new file type called a Dataverse Package[2] that looks like a single file in Dataverse but actually represents many large files or many, many small files. SGBrid uses rsync to get these files in and out.

The command-line DVUploader[3] should handle uploading large numbers of files more easily than the PrimeFaces component in the Dataverse web interface but if you have DOI minting for files enabled, as you mentioned, you'll still probably have trouble at publish time trying to mint DOIs for 60,000 files. I agree that the workaround I mentioned of turning DOIs off temporarily[4] is not very sustainable. Perhaps a feature could be added whereby this workaround is automated. That is, files eventually each get a DOI but not right when the dataset is published.

I hope this helps,

Phil

1. https://www.archivematica.org/en/docs/archivematica-1.8/user-manual/transfer/dataverse/




--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/60429010-6240-42a3-87bf-e416f24e03a3%40googlegroups.com.

Philipp at UiT

unread,
Feb 26, 2020, 2:34:05 AM2/26/20
to Dataverse Users Community
Thanks, Phil, that was useful information. I think currently double-zipping is the most practical approach for datasets with a lot of files. Also, I'm not sure if it makes sense to get DOIs assigned to alle these files, even if we could. Maybe it's enough that the zip file gets its DOI.

I'd still be interested to hear how other community members consider the suitability of zip files for long-term preservation.

Best, Philipp

To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-community+unsub...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages