Replacing tabular data files

61 views
Skip to first unread message

Jennifer Doty

unread,
Oct 5, 2017, 3:51:07 PM10/5/17
to Dataverse Users Community
I'm running into some errors when trying to use the "replace files" feature to update existing data files in a published dataset. This is in the Emory Dataverse within the UNC Dataverse, v.4.7.1.

File #1: CSV format when originally uploaded in 2016 and is listed as "plain text" format in the file list. When I tried to replace it today with an updated version of the CSV file, it told me they weren't the same format, but let me click the "Continue" button before throwing an error: "Dataset Save Failed add.add_file_error (see logs). Not sure if this is related to newer versions of Dataverse now treating CSV files as tabular data on ingest, so I moved on to another file to see if the experience would be different.

File #2: SPSS (.sav) format when originally uploaded in 2016, listed as "tabular data" and able to be explored within Dataverse. When I tried to replace it with an updated version of the SAV file, it also told me they weren't the same format. Again I clicked "Continue" and this time Dataverse acted like it was ingesting the file before it stalled and threw an error that the file upload was aborted.

Also interesting to note is that while I was attempting to replace file #2, file #1 totally disappeared from the file list. This freaked me out a bit, so after the second error popped up, I backed out of this dataset update attempt and chose to just delete the draft dataset so as to preserve the integrity of the originally published dataset.

Am I missing something important about using the "replace files" feature? Would it be better to just stick to adding updated files as new files and using the metadata to indicate which files are the new versions? Any feedback is much appreciated.

Thanks!
Jen Doty

julian...@g.harvard.edu

unread,
Oct 5, 2017, 5:58:47 PM10/5/17
to Dataverse Users Community
Hi Jen!

Thanks for all of the great detail! Just writing to say I'm trying to reproduce what you've described with all of the variables you've described, on demo.dataverse, which is also running 4.7.1. But from reading the issues, it does sound like a bug, and not what's intended. I'd have to defer to our developers to say if it's related to changes to how csv files are ingested.

When you write "Would it be better to just stick to adding updated files as new files and using the metadata to indicate which files are the new versions?", is this what you've had to do for other datasets? If so, could you link to any datasets where this has been done?

Best,
Julian

julian...@g.harvard.edu

unread,
Oct 5, 2017, 6:45:20 PM10/5/17
to Dataverse Users Community
Just realized that saying "I'm trying to reproduce what you've described with all of the variables you've described" might be confusing since we're talking about tabular data, and I'm not the punny type... I meant I'm trying to reproduce what you've described with the same conditions, but I'm not able to on demo.dataverse:

On demo.dataverse I was able to replace a csv file, which was published, uningested, in 2015 as plain text, with another csv file. I clicked continue on the warning message that the file types are different (one plain text, one tabular) but didn't get an error. The new csv was ingested, I was able to publish the new dataset version, and the previous version had the previous plain text file.

When you tried with the spss files and got the warning that the replacement file wasn't the same format, was it that "The original file (Tab-Delimited) and replacement file (SPSS SAV) are different file types"? If the original file was ingested, I'm assuming its file type would be tab-delimited. Just wanted to make sure. I was able to replace an ingested .sav file, which had been transformed to a tab file, with a different .sav file without errors or any stalling.

Jennifer Doty

unread,
Oct 6, 2017, 12:19:08 PM10/6/17
to Dataverse Users Community
Thanks for looking into this, Julian! The format warning I got was the same text, "The original file (Tab-Delimited) and replacement file (SPSS SAV) are different file types." I don't know if file size could be relevant to this issue as well. I should have mentioned that both files were close to 30MB each and the ingest process was running for several minutes before the error messages displayed. Would you recommend trying the replace function with these particular files myself in demo.dataverse?

And in answer to question in your first reply, I don't have an example dataset with updated files added as new files. This is the first time it's come up for this particular dataset (http://dx.doi.org/10.15139/S3/12193). In talking to the data creators, they were interested in trying the replace option so as to avoid confusion for data users. 

Gautier, Julian

unread,
Oct 6, 2017, 1:33:05 PM10/6/17
to dataverse...@googlegroups.com, d...@email.unc.edu, Sonia Barbosa
Hi Jen,

Yes, please feel free to try those larger files on demo.dataverse. Thanks for the info. Since the large .sav file in the dataset you linked to was already ingested, I'm assuming that any file size limits that the UNC Dataverse places on different tabular file types wouldn't be the problem. But I've CC'ed Don Sizemore from UNC just in case.

And I completely agree about using file replace in this case to not confuse users. (Considering the planned support for file PIDs, I'm curious if adding updated files as new files and using the metadata to indicate which files are the new versions is commonly done.)

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-community+unsub...@googlegroups.com.
To post to this group, send email to dataverse-community@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/8d821066-9284-4595-abf6-e02ee95cfa79%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
Julian Gautier
Product Research Specialist, IQSS

Don Sizemore

unread,
Oct 6, 2017, 1:50:32 PM10/6/17
to dataverse...@googlegroups.com, d...@email.unc.edu, Sonia Barbosa
Good afternoon,

UNC Dataverse doesn't set any upload limit in Glassfish or Apache, or at least we haven't done so on purpose:
"Setting :MaxFileUploadSizeInBytes not found"

Jen: if you can get me some approximate times of upload failures I can get exact error messages? Mandy says she's unable to recreate the problem, and if you can get us the replacement files we can try to wedge them in?

Julian: our archivist Mandy has been having trouble with a few CSV and R files in the 200MB range. We're happy to send you links if you need examples of problem files (ingest of these files choke on 4.8 as well).

Donald


To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-community+unsubscribe...@googlegroups.com.



--
Julian Gautier
Product Research Specialist, IQSS

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-community+unsub...@googlegroups.com.
To post to this group, send email to dataverse-community@googlegroups.com.

Jennifer Doty

unread,
Oct 6, 2017, 2:32:42 PM10/6/17
to Dataverse Users Community
Thanks for checking on UNC end, Don. My replace attempts were around 3-3:30pm yesterday. I will also send you the replacement files.

Julian, I successfully replaced one SAV file with another in the demo dataverse, so also couldn't replicate my error there. 
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-community+unsub...@googlegroups.com.
To post to this group, send email to dataverse...@googlegroups.com.



--
Julian Gautier
Product Research Specialist, IQSS

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-community+unsub...@googlegroups.com.
To post to this group, send email to dataverse...@googlegroups.com.

julian...@g.harvard.edu

unread,
Oct 6, 2017, 3:26:46 PM10/6/17
to Dataverse Users Community
Hi,

Hmm, thanks for testing, Jen.

Donald, thanks for the help. It would be great if you could send links to those large files. I think the limit on Harvard Dataverse for any tabular files is 2.6GB (developers correct me if I'm wrong :), so I think it would help to see why 200MB files are a problem.

Don Sizemore

unread,
Oct 6, 2017, 3:49:18 PM10/6/17
to dataverse...@googlegroups.com
Julian,

Jen shared us the files in a box folder, and the problem now isn't with upload but with ingest:
"Unknown exception occurred  during ingest (supressed stack trace); re-setting ingest status."

Jen notes that the original dataset was uploaded when UNC was still running 4.5, though now we're on 4.7.1 to match demo.dataverse.org (where the same file both uploads and ingests properly).

Still investigating on this end.

Two particular problem files may be found at https://demo.dataverse.org/dataset.xhtml?persistentId=doi%3A10.5072%2FFK2%2FMMHB4B once you log in as admin or look for them on the filesystem. They don't seem to ingest after, I think we're up to 9 days now on one of our 4.8 servers.

Donald



To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-community+unsubscribe...@googlegroups.com.



--
Julian Gautier
Product Research Specialist, IQSS

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-community+unsubscribe...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-community+unsub...@googlegroups.com.

julian...@g.harvard.edu

unread,
Oct 6, 2017, 6:00:27 PM10/6/17
to Dataverse Users Community
Thanks, I see the "ingest in progress" messages on that dataset on demo.dataverse (so the dataset is locked). There might be several issues, so I'll try separating them by the two types of files:

R files
There is no file size limit on demo for ingesting any type of tabular files, but on Harvard Dataverse the limit for R files is set to 1MB. I uploading the R file again on demo.dataverse (after it was just updated to 4.8), and see the same ingest in progress message for the R file (so the dataset is locked).

When I uploaded the R file on Harvard Dataverse, it doesn't ingest because of the 1 MB limit (but after uploading, there's no messaging about why it didn't ingest, an issue with messaging captured in this ticket). The 1 MB limit was set at Harvard Dataverse because of known issues with ingesting large R files. Perhaps the limit at UNC (and on demo.dataverse) should also be set to 1 MB?

CSV files
There's basically no file size limit for ingesting CSV files on demo or Harvard Dataverse. Interestingly, while 4.7 demo.dataverse froze when trying to ingest the ~200 MB csv file, 4.8 demo.dataverse and 4.8 Harvard Dataverse uploads it successfully but gives this error: "Invalid header rows. One of the cells is empty." (This might be because of the 4.8 csv ingest changes that Jen mentioned.)

The first column in that csv is blank, so I added a value to that cell, saved it as a second copy and uploaded it to 4.8 demo.dataverse. It's been trying to ingest for about an hour now. (I should note too that this second csv file with presumably valid header rows is much smaller (159 MB) than the original csv.)

So maybe there are issues in 4.8 with ingesting large csv files?

To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-community+unsub...@googlegroups.com.



--
Julian Gautier
Product Research Specialist, IQSS

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-community+unsub...@googlegroups.com.
To post to this group, send email to dataverse...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/CANK3XXyrPacmbA_1NE6cC1GTBdUVqJp%2BuELUKMY0gjJqMySv%2BA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages