using wget to download all files in a dataset not preserving file hierarcy

94 views
Skip to first unread message

meghan.good...@gmail.com

unread,
Jul 12, 2022, 3:38:17 PM7/12/22
to Dataverse Users Community
Hello community members,

I'm hoping someone might be able to help us!

A researcher has uploaded a large dataset (35 GB with almost 2000 files), and it's really important that users are able to download the dataset and maintain the file hierarchy. We cannot use the native API because it exceeds the zip download bundle size.

When using the wget command (example pasted below) for a dataset with a file structure hierarchy (i.e., tree structure), the files are saved in a flat structure (and because many files have the same name, they are getting overwritten).

`wget -r -e robots=off -nH --cut-dirs=3 --header "X-Dataverse-key: $API_TOKEN" --content-disposition https://borealisdata.ca/api/datasets/:persistentId/dirindex?persistentId=IDENTIFIER"`

We expected that the dataset should download in a file hierarchy as outlined in the documentation, “Using this API, wget --recursive (or a similar crawling client) can be used to download all the files in a dataset, preserving the file names and folder structure; without having to use the download-as-zip API.” https://guides.dataverse.org/en/latest/api/native-api.html#view-dataset-files-and-folders-as-a-directory-index

When testing on a local machine, it seems to work as expected, so we're wondering if it's an issue related to using S3 on our production instance.

I also created an issue for this, but I probably should have just asked here first instead! https://github.com/IQSS/dataverse/issues/8836


Any help would be greatly appreciated :)

Thanks,
Meghan
(Borealis team)
Reply all
Reply to author
Forward
0 new messages