Tabular ingest request to keep CSV files in CSV format

69 views
Skip to first unread message

Lena Schmidt

unread,
Jul 24, 2020, 1:02:37 PM7/24/20
to Dataverse Users Community

Dear all,

I've recently run into a problem when uploading csv files to dataverse, and was redirected to this forum for discussion. Background: I tried to upload a csv file containing columns with natural language (text, tabs, html tags, any character really). My intention for uploading this to dataverse was to make the files available publicly.

Unfortunately, the indigestion process seemed to destroy my files (columns got broken up, possibly due to tab format conversion. The document viewer online was unable to correctly display the data, and when I downloaded the indigested data they were scrambled). The only way how to upload and preserve my data was to upload it in proprietary Excel format, triggering a failure of the indigestion process, but therefore also allowing the file to remain unchanged.

Is it possible, at all, to keep csv files in csv format? I guess I am not only asking for myself, but also for everyone who is doing natural language processing research who wants to share datasets similar to mine.

All the best,
Lena

danny...@g.harvard.edu

unread,
Jul 24, 2020, 2:09:57 PM7/24/20
to Dataverse Users Community
Hi Lena,

The administrators of the Dataverse installation have the option to uningest the file (http://guides.dataverse.org/en/latest/api/native-api.html#uningest-a-file) using an API, but this is not something available through the UI. 

This is a challenging topic that comes up from time to time. In the vast majority of cases, the ingest process provides great value for reproducibility and interoperability, but there are some cases where it's a negative. Instead of trying to build a workflow that forces a depositor to choose at deposit time, we've made the decision to make ingestion the default and use administrative action (through the API) to clear up the cases where it's not appropriate. As Dataverse gains adoption in other disciplines, this may be something that's revisited. Please note that the original file is always available, and that any formats created through the ingestion process are additional formats available for download/access. 

Hope this helps,

Danny

Philip Durbin

unread,
Jul 28, 2020, 4:16:25 PM7/28/20
to dataverse...@googlegroups.com
To reinforce what Danny's saying, the original csv file is still there, but I understand your frustration.

One thought is that we have a "sample data" repo on GitHub that we use to collect files for testing and demos. If you are able to contribute a file to test with, please open an issue at https://github.com/IQSS/dataverse-sample-data/issues

Thanks,

Phil

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/1798ffd7-f093-4206-b910-58cb2365d2e2n%40googlegroups.com.


--

Janet McDougall - Australian Data Archive

unread,
Aug 2, 2020, 9:27:16 PM8/2/20
to Dataverse Users Community
hi All
To get around INGEST where we don't want it, we double zip files so they sit on Dataverse as a zipped file (Dataverse unzips once) in their original format - to be downloaded as a zipped file.
Thanks
Janet
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-community+unsub...@googlegroups.com.

Lena Schmidt

unread,
Aug 7, 2020, 2:21:48 PM8/7/20
to dataverse...@googlegroups.com
Hi,

Double zipping sounds like a great idea! It did not seem to me that the original file was still available after ingestion and format changes.. I'll make the file available on the Git repo as suggested in one of the earlier messages.

Thank you all for the kind feedback!
Lena

To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.

--
You received this message because you are subscribed to a topic in the Google Groups "Dataverse Users Community" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/dataverse-community/aUxOcOKYC6g/unsubscribe.
To unsubscribe from this group and all its topics, send an email to dataverse-commu...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/4889b2ac-1791-4f57-8de8-0e493db2e3feo%40googlegroups.com.

Philipp at UiT

unread,
Aug 12, 2020, 6:48:31 AM8/12/20
to Dataverse Users Community
Hi Lena,

In the Tromsø Repository of Language and Linguistics (TROLLing; https://info.trolling.uit.no/), we require depositors to deposit their data in plain text format with Unicode UTF-8 encoding. Instead of csv, we recommend tab-separated files with .txt as file extension. That way, the files are not ingested in Dataverse, but are kept unchanged. Your are welcome to submit your data to TROLLing and/or contact us for any questions.

Best, Philipp
Reply all
Reply to author
Forward
0 new messages