Groups keyboard shortcuts have been updated
Dismiss
See shortcuts

Ingest of Stata files (dta)

86 views
Skip to first unread message

Philipp at UiT

unread,
May 19, 2020, 12:02:31 PM5/19/20
to Dataverse Users Community
Yesterday, one of our researchers uploaded a 600 MB Stata file (.dta). The file was ingesting for about one day, before Dataverse just now displayed the message that the file is successfully ingested (1153 variables, 510133 observations).
>> Has anyone experienced similarly long ingest periods?

I also uploaded the same file to the Harvard Demo Dataverse (https://demo.dataverse.org/dataset.xhtml?persistentId=doi:10.70122/FK2/HLHYQG; Phil: I gave you curator access to the dataverse). There, file ingest was "completed" much faster (a few minutes), but afterwards, I got the following message:
"Tabular data ingest failed. Ingest failed to produce Summary Statistics and/or UNF signatures; /tmp/tempTabfile.6815203645101011389. (No such file or directory)"
>> Any idea what went wrong?

Best, Philipp

Philip Durbin

unread,
May 19, 2020, 1:15:48 PM5/19/20
to dataverse...@googlegroups.com
I was able to download the file and get ingest started on my laptop but I only let it run for half an hour so I don't know if it would have completed or not.

A 600 MB Stata file strikes me as somewhat large (half a million observations, like you said) but I'd be curious to hear what's common, if people use the :TabularIngestSizeLimit setting, etc.

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/3372ad30-fbaa-4c6e-9550-b332533d1851%40googlegroups.com.


--

Christian Bischof

unread,
Dec 6, 2024, 7:48:18 AM12/6/24
to Dataverse Users Community
We also had now an extreme long ingest time, 2 Stata dta files took 23 hours (each 123 MB, 6221 variables, 11184 observations https://data.aussda.at/dataset.xhtml?persistentId=doi%3A10.11587%2FHNUFCC&version=1.0 ). Can this be normal, is there a configuration that can be changed? Because of data protection regulation, I can’t try it on https://demo.dataverse.org

Thanks, Christian

Philip Durbin

unread,
Dec 6, 2024, 3:29:39 PM12/6/24
to dataverse...@googlegroups.com
Wow, that's slow. https://github.com/IQSS/dataverse/issues/8954 is about SPSS ingest being slow. I'm not aware of Stata ingest being slow as well. Can you please create an issue?

Also, if anyone out there has a public file we can test with, please let us know.

Amber Leahey

unread,
Dec 12, 2024, 4:18:34 PM12/12/24
to Dataverse Users Community
ohhh here is a big SPSS file 

How can ingest be optimized for large files? Can it be given more compute resources? 

Philip Durbin

unread,
Dec 12, 2024, 7:35:29 PM12/12/24
to dataverse...@googlegroups.com
Thanks, Amber, I added a link to that 12 hour SPSS file to the "SPSS ingest is slow" issue: https://github.com/IQSS/dataverse/issues/8954#issuecomment-2540284364 (so we have data to work with some day when we work on it).

Christian Bischof

unread,
Dec 17, 2024, 3:48:07 AM12/17/24
to Dataverse Users Community

I generated now 2 test files (uniformly distributed random value 0-10, without variable and value labels) as dta, sav, csv and have ingested them on two test server (dv03, dv06) of us. The attached Stata code generates the test data.

11000 observations, 6200 variables:

test_data_11000x6200v.dta 270MB dv03: 14h10, dv06: 13h40

test_data_11000x6200v.sav 533MB dv03: 6h11, dv06: 7h55

test_data_11000x6200v.csv 139MB dv03: 6h36, dv06: 7h52

6200 observations, 11000 variables:

test_data_6200x11000v.dta 273MB dv03: 21h58, dv06: 28h27

test_data_6200x11000v.sav 533MB dv03: 11h46, dv06: 14h54

test_data_6200x11000v.csv 139MB dv03: 14h1, dv06: 14h53

Interestingly, dta takes much longer as sav and csv, sav is even faster than csv. If the data matrix observation-variables is transposed, the duration increases significantly.

Christian Bischof

unread,
Dec 17, 2024, 3:51:27 AM12/17/24
to Dataverse Users Community
gen_test_data.do
Reply all
Reply to author
Forward
0 new messages