Institutional datasets with monthly updates — looking for examples & best practice

Lora Leligdon

unread,

Oct 6, 2025, 11:32:02 AMOct 6

to Dataverse Users Community

Hi everyone,

We have a couple of important institutional datasets that will grow over time (new files to be deposited every month). While we would prefer to deposit as static datasets, the volume and timeliness of the data, along with the way people will use it make this impractical. So, we are trying to figure out a way to archive them in Dataverse that’s citable, clear, and sustainable as a "growing" dataset.

Could anyone share:

Examples of Dataverses / datasets doing this
Best practices for documenting update frequency (in README, metadata, etc.)
Strategies for versioning
Workflow tips or automated methods (using API, scripts)
How you guide users to cite evolving datasets

Thanks in advance for any pointers.

Lora

Lora Leligdon | Dartmouth Libraries Head of Research Facilitation | Dartmouth | Hanover, NH 03755 | Make an Appointment

Leonid Andreev

unread,

Oct 10, 2025, 4:37:03 PMOct 10

to Dataverse Users Community

Hi Lora,

There are real life examples of such "growing datasets", where authors publish a new version of a dataset, at scheduled intervals and/or when new data become available.

There are different ways to organize and/or package the actual files, depending on the nature of your data and the update process.

In the most straightforward scenario you just add one or more extra files, with one month-worth of data in each, every month.

Or you can take advantage of the "file replacement" mechanism in Dataverse. That way the number of files in the latest published version stays manageable; but for every file users can trace the history of updates and additions, and access archived versions of the file from specific past versions of the dataset.

Here's a real life example: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/VH6GVH. Note that this happens to be the dataset with the highest number of published versions at Harvard (425 as of writing this!). There are 10 files in the latest published version; but 8 of them have been continuously replaced with new versions (almost) every time a new version of the dataset was published.

Please note that there are known performance issues with datasets with 100s of versions. For ex., with the dataset above, you can easily access specific versions directly in the UI (for ex., https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/VH6GVH&version=301.0); however, if you try and click on the "Versions" tab that tracks the overall revisions history of the dataset, it will take some appreciable time to populate that view. But, if you are planning to publish monthly, it should take some years before you accumulate enough versions to cause such problems... and by then we will, hopefully, have it all resolved.

I'll try and look for some scripts that may have been created for similar workflows in the past. You may also try and contact the author of the dataset above (note that he has a few more datasets in the collection - https://dataverse.harvard.edu/dataverse/layline - with similar, if slightly less frightening numbers of versions). Chances are he'll be willing to share the API scripts he's been using.

All the best,

-Leo

Lora Leligdon

unread,

Oct 13, 2025, 9:04:00 AMOct 13

to Dataverse Users Community

Thank you for your reply, Leo! I hadn't considered performance issues with large version numbers, but that makes sense. Thanks for sharing this example. It will help us set our institutional policy.