Ingest solutions?

31 views
Skip to first unread message

Pedro Luis

unread,
Aug 29, 2025, 10:31:28 AM (8 days ago) Aug 29
to Dataverse Users Community
Hi everyone,

We're uploading 300MB files to Dataverse in CSV format. However, the ingest time for these files is over an hour each, which we consider a long time for a simple CSV file to process.

The server is virtualized with 2 vCPUs and 12GB of memory, but the questions are:
- How can I get better performance for files in this format?
- Is disabling ingest for these files a best practice?
- What are the default limits if no limit is set in TabularIngestSizeLimit?

Thank you all in advance.

Pedro Luís
FABICO - UFRGS

Philip Durbin

unread,
Sep 4, 2025, 2:15:11 PM (2 days ago) Sep 4
to dataverse...@googlegroups.com
Hi Pedro,

https://github.com/IQSS/dataverse/issues/8954 was originally about slow ingest of SPSS files but more recently a comment was added about slow CSV (and Stata) ingest as well. You're welcome to check out those observations.

I think a dedicated issue about CSV ingest speed would be nice. If you feel like creating one, please go ahead.

Off the top of my head, I don't have any suggestions for how to speed it up apart from maybe trying to throw more hardware at the problem (more CPUs and more memory), but I don't know if this would make a difference or not. For Harvard Dataverse we've considered setting up a dedicated server for ingest: https://github.com/IQSS/dataverse.harvard.edu/issues/111

Certainly disabling ingest, as you say, is an option. This is what I sometimes do when I want to quickly load up sample data from https://github.com/IQSS/dataverse-sample-data in a development environment. As you may know, you can always for ingest for a particular file later via API: https://guides.dataverse.org/en/6.7.1/api/native-api.html#reingest-a-file

The docs at https://guides.dataverse.org/en/6.7.1/installation/config.html#tabularingestsizelimit are not particularly clear*, but out of the box there is no limit. That is, Dataverse will always try to ingest a file, no matter how large.

Currently, the ingest functionality of Dataverse is part of the monolith, the main app. We've talked on and off about splitting ingest off into its own service: https://github.com/IQSS/dataverse/issues/7852

If you can somehow process your CSV files in the same way outside of Dataverse and can construct a DDI (XML) file to feed into Dataverse, you can upload it with this API: https://guides.dataverse.org/en/6.7.1/api/native-api.html#editing-variable-level-metadata . This is a bit theoretical, but if we were to split off the ingest functionality of Dataverse into a separate service, I believe that service would call this API to send the summary statistics, etc. to Dataverse.

I hope this helps! Please keep the questions coming!

Thanks,

Phil

* Don't worry, the docs are being clarified in https://github.com/IQSS/dataverse/pull/11654

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/dataverse-community/5d07855b-7240-475a-80e8-65754e78b045n%40googlegroups.com.


--
Message has been deleted

Victoria Lubitch

unread,
Sep 5, 2025, 12:02:50 PM (23 hours ago) Sep 5
to Dataverse Users Community
Just a note regarding editing variable metadata API  https://guides.dataverse.org/en/6.7.1/api/native-api.html#editing-variable-level-metadata It does not update summary statics and all the data that is produced during the ingest, it assumes that the file went through ingest, but it allows to add/update variable metadata that is not created during the ingest, such as questions(literal, interview, etc), weights, weighted frequencies, notes. All those things that Data Explorer can update.
It is possible to upload a file through API with tabIngest false though https://guides.dataverse.org/en/6.7.1/api/native-api.html#add-a-file-to-a-dataset 

Philip Durbin

unread,
Sep 5, 2025, 1:26:06 PM (21 hours ago) Sep 5
to dataverse...@googlegroups.com
Ah, so that editing-variable-level-metadata API wouldn't work after all. Bummer. Thanks for the clarification, Victoria! And for adding that API. 😄

Reply all
Reply to author
Forward
0 new messages