Options for bulk downloading datasets in a collection

61 views
Skip to first unread message

Kaitlin Newson

unread,
Jun 28, 2022, 2:32:18 PM6/28/22
to Dataverse Users Community
Hi Dataverse community,

We had a recent question from a member of our community about options for downloading all datasets from one or more collections - in this case, they want to bulk download around 60 datasets from two collections, and are looking for alternatives to manual downloads in the UI. I didn't see much in the user guides for this particular use case, so wanted to check with the community to see if anyone has developed anything.

Thanks!


Geneviève Michaud

unread,
Jun 29, 2022, 2:24:48 AM6/29/22
to dataverse...@googlegroups.com
Hi Kaitlin,

Do you mean downloading metadata and files at once? Sorry if the answer is obvious.

Geneviève

Geneviève Michaud
CDSP - UAR 828 Sciences Po - CNRS

Centre de Données Socio-Politiques

27, rue Saint-Guillaume
75337 Paris cedex 07
Téléphone : +33 (0)1 45 49 72 83



--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/2f0970b2-0640-46ab-a793-92728f2e1afbn%40googlegroups.com.

Philipp at UiT

unread,
Jul 7, 2022, 1:07:15 PM7/7/22
to Dataverse Users Community

Hi Kaitlin,

If this is about downloading all files from specified collections, here's how I'd solve this task, based on a similar case I recently worked with at DataverseNO:

1. Create a list of all published dataset DOIs in collection abc and collection def by running the following command in a bash command line, adapted to your case:

curl 'https://demo.dataverse.org/api/search?q=*&type=dataset&subtree=abc&subtree=def' | jq -r '.data | .items | .[] | .global_id' > dataset_dois.txt

2. Copy the contents of dataset_dois.txt and paste into cell A2 in the attached LibreOffice spreadsheet dataverse_download_all_files_from_datasets.ods.

3. Copy cells B2 and C2 to the end of the contents of column A.

4. Copy the contents of column C from cell C2 onward.

5. Paste into a plain text document, and save it as dataverse_download_all_files_from_datasets.sh (or similar).

6. In the command line, run the this file:
bash dataverse_download_all_files_from_datasets.sh

This will download all the files from collection abc and collection def into a sub-folder called fileDownload and within this sub-folder into sub-sub-folders named after the dataset DOI suffix.

Of course, all this could also be done in one single script, but currently, I'm not capable of writing such a script ;-)

Best,
Philipp

dataverse_download_all_files_from_datasets.ods

Night Owl

unread,
Jul 8, 2022, 4:49:06 PM7/8/22
to Dataverse Users Community

Hi .. we have also been dealing with an issue of a user wanting to download many very large files from a dataset and didn’t have a solution .. this option worked well for this! Since the API tries to ZIP the files it didn’t work because it quickly reaches the download limit (and limits of our server!). Jim suggested trying this as well, and it works because it builds the list of files and then uses wget to get the files one by one.

We would like to provide this as a possible solution for our users in our guide for downloading files (if they need to download all the files in a dataset or datasets). Would that be okay for us to include the spreadsheet and instructions in our guide? (Once you figure out what it’s doing you can just manipulate the script as needed but the spreadsheet helps get you started). 

Thanks so much!

Philipp at UiT

unread,
Jul 8, 2022, 11:29:00 PM7/8/22
to Dataverse Users Community

Hi Night Owl,

Yes, sure, please reuse and adapt the script and spreadsheet as needed.

Best, Philipp

Stefan Kasberger

unread,
Jul 20, 2022, 4:55:29 AM7/20/22
to Dataverse Users Community
Hi,

for this I very much recommend using pyDataverse with its get_children() function, especially as bigger the collection gets.

See a code snippet on how to use it to download a full data collection (data tree) from a Dataverse instance here:

You can also use the newly released dataverse_tests collect functionality, which implements get_children() with some helpful functionalities wrapped around.

Cheerz, Stefan
Reply all
Reply to author
Forward
0 new messages