72 views
Skip to first unread message

Sergej Zr

unread,
May 7, 2026, 8:53:22 AM (7 days ago) May 7
to dataverse...@googlegroups.com

Dear Dataverse Community,

We would like to share a small open-source tool that we are developing at the University of Bonn to simplify downloading large and complex datasets from Dataverse installations.

We are the Service Center for Research Data at the University of Bonn and operate our Dataverse installation within the university IT - Center.

The motivation behind the tool was a problem we repeatedly encountered with large or complex datasets:
downloads may fail, interrupted transfers are difficult to resume, and it can become unclear which files have already been downloaded successfully — especially when datasets are organized in hierarchical folder structures and partial downloads are difficult.

To address this, we developed a lightweight desktop downloader with features such as:

  • resumable downloads
  • selective file and folder downloads
  • checksum verification
  • progress tracking
  • preservation of folder structures
  • support for large datasets and unstable connections

The project is fully open source and available here:

https://github.com/sergejzr/harvard-dataverse-downloader

We have already received first positive feedback from the German Dataverse community and would be very happy to hear from others as well:

  • Are you using similar tools already (what tools are those)?
  • Which download workflows work well for your users?
  • Are there features you would consider important?

We are also currently testing direct Dataverse integration via custom deep links (e.g. opening datasets directly from the browser into the downloader application).

Feedback, ideas, and contributions are very welcome (here or directly at GitHub).

Best regards from Bonn,

Sergej Zerr & the RDM Team
University of Bonn
Service Center for Research Data / University IT

PS: See you in Barcelona next week! :)

image.png



--
-- 
Dr. Sergej Zerr
Hochschulrechenzentrum Bonn 
Servicestelle Forschungsdatenmanagement - SFD
Tel: +49 228 73-4121
Raum: 3.011
Wegelerstrasse 6
53115 Bonn
www.hrz.uni-bonn.de

Philip Durbin

unread,
May 8, 2026, 5:49:11 AM (6 days ago) May 8
to dataverse...@googlegroups.com
Hi Sergej!

This download tool seems great, especially since it has a graphical user interface (GUI). The only other download tool I'm aware of is also great but it's a command line interface (CLI) tool. I was added in this pull request: https://github.com/gdcc/dataverse-recipes/pull/17

Back to your tool, it mostly worked but I did get a "checksum validation failed" on the one tabular file in https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/TJCLKP (which is my dataset). I'll attach a screenshot.

Thanks and see you in Barcelona!

Phil

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/dataverse-community/CACdcJw2qcqTTueDtuTi%2Bhu2O8Ay60xM3hNABhJV_Hvvsd1dqsg%40mail.gmail.com.


--
Screenshot 2026-05-08 at 12.43.39 PM.png

Martin Schorcht

unread,
May 8, 2026, 7:31:27 AM (6 days ago) May 8
to Dataverse Users Community
Hi Sergej,

Great tool !

We are facing an issue where some users with slow download speeds are unable to download large files. This is due to Payara's timeout setting, which is already set quite high at 30 minutes. For security reasons, we will not be increasing this limit.

In your GitHub repository, you wrote “resume support.” Does that refer to the dataset or to individual files as well? In other words, would a large file resume downloading from the point where it was interrupted?

thx & best
martin


Sergej Zr

unread,
May 8, 2026, 10:21:04 AM (6 days ago) May 8
to dataverse...@googlegroups.com

Hi Martin,

Thanks! Yes, we are facing exactly the same problem. For example, a file may download successfully from Europe, but from Australia the transfer can take slightly longer and eventually break due to the timeout.

And yes — the tool supports resuming downloads of individual files. It gives fairly complete control over the entire download process. Users can download selected subfolders or entire datasets, and interrupted downloads can be resumed at any time.

There is currently no dedicated feature to resume all failed downloads at once. However, restarting the application effectively should achieve the same result, since it automatically detects and continues interrupted downloads.

Best regards,
Sergej


Martin Schorcht

unread,
May 8, 2026, 10:54:08 AM (6 days ago) May 8
to Dataverse Users Community
wunderbar, thanks for the clarification. We'll recommend your tool to users who are having issues ;)

Eryk Kulikowski

unread,
May 9, 2026, 6:01:29 PM (5 days ago) May 9
to Dataverse Users Community
Hi all,

From KU Leuven: we've been building a browser-side equivalent to some of these problems: streaming zip downloads with per-file S3 resume, folder-aware selection, embedded into both the SPA and the legacy JSF dataset page. Screenshots and details are in https://github.com/IQSS/dataverse/pull/12382

It doesn't replace Sergej's tool for unattended or cross-session jobs: closing the browser tab kills an in-flight zip, that's a hard limit of running in the page. So the two are complementary: desktop for queue-and-walk-away, browser for the interactive case from the dataset page.

Happy to talk concretely at the DCM next week. We're presenting on this on May 14.

Best,
Eryk
KU Leuven

Op vrijdag 8 mei 2026 om 16:54:08 UTC+2 schreef Martin Schorcht:

Range, Jan

unread,
May 10, 2026, 2:02:28 AM (4 days ago) May 10
to dataverse...@googlegroups.com
Hi everyone,

the Dataverse CLI (DVCLI) also support resumable (bulk) downloads, if you prefer staying in the terminal :-)


See you in Barcelona!

All the best,
Jan

———————————

Jan Range
Research Data Software Engineer

University of Stuttgart
Stuttgart Center for Simulation Science (SC SimTech)
Cluster of Excellence EXC 2075 „Data-Integrated Simulation Science“ (SimTech)

Pfaffenwaldring 5a | Room 01.013 | 70569 Stuttgart Germany

Phone: 0049 711 685 60095
E-Mail: jan....@simtech.uni-stuttgart.de

——— Meet me ———

https://calendly.com/jan-range/meeting


Reply all
Reply to author
Forward
0 new messages