First row of csv file deleted at ingest

56 views
Skip to first unread message

Philipp at UiT

unread,
Jan 3, 2020, 4:46:06 AM1/3/20
to Dataverse Users Community
We are having problems with the first row of a csv tabular file being deleted at upload/ingest in Dataverse. We haven't experience this before. Any suggestions on how to avoid this? We are running on version 4.17.

Best, Philipp

Philip Durbin

unread,
Jan 3, 2020, 7:19:04 AM1/3/20
to dataverse...@googlegroups.com
Very strange. Do you get the same behavior on https://demo.dataverse.org ?

On Fri, Jan 3, 2020 at 4:46 AM Philipp at UiT <uit.p...@gmail.com> wrote:
We are having problems with the first row of a csv tabular file being deleted at upload/ingest in Dataverse. We haven't experience this before. Any suggestions on how to avoid this? We are running on version 4.17.

Best, Philipp

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/fa161cd9-4fdd-4e6f-b935-f5baf0d32a1c%40googlegroups.com.


--

Philipp at UiT

unread,
Jan 3, 2020, 3:48:36 PM1/3/20
to Dataverse Users Community
Yes, the same behavior on demo.dataverse.org. Tabulor csv files created with Norwegian locale settings (e.g. in Excel) are usually semicolon-separated, not comma-separated (we use comma as decimal signs). I have replaced the semicolons in the file with commas, and now the first row is not removed anymore! This reminded me of problems we from time to time have with ingesting csv files. Maybe the reason for these problems is that our csv files actually use semicolons and not commas as delimiters? In our deposit guide, we recommend our users to provide tabulor files as tabulator separated plain text files with the extension .txt. However, such files are never ingested properly. Maybe we could have a discussion on what delimiters and extensions ingest should accept?

Best, Philipp

P.S. Happy New Year to the Dataverse Community!



fredag 3. januar 2020 13.19.04 UTC+1 skrev Philip Durbin følgende:
Very strange. Do you get the same behavior on https://demo.dataverse.org ?

On Fri, Jan 3, 2020 at 4:46 AM Philipp at UiT <uit.p...@gmail.com> wrote:
We are having problems with the first row of a csv tabular file being deleted at upload/ingest in Dataverse. We haven't experience this before. Any suggestions on how to avoid this? We are running on version 4.17.

Best, Philipp

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-community+unsub...@googlegroups.com.

Philip Durbin

unread,
Jan 3, 2020, 4:28:57 PM1/3/20
to dataverse...@googlegroups.com
I just downloaded a CSV from https://tovare.com/post/norwegian_unemployment/ and sure enough it had semicolons in it (!) and failed ingest with errors like in the attached screenshot ("Tabular data ingest failed. Reading mismatch, line 425 of the Data file: 1 delimited values expected, 2 found."). A similar error appears in server.log: Ingest failure (IO Exception): Reading mismatch, line 425 of the Data file: 1 delimited values expected, 2 found..

I just took a quick look at the code[1] and we use the same "CSVFileReader" for both CSV and TSV files. The only difference is passing a comma (,) vs. a tab (\t). I would think it would be easy enough to pass a semicolon but the harder part is figuring out when. Right now we assume CSV files have, well, commas. :) It would be easy enough to define a global setting to always pass a semicolon to "CSVFileReader" instead of a comma for CSV files but I'm not sure if you'd want that (you may have some of both).

Of course, what I'm describing above is "complete failure to ingest" rather than "mostly successful ingest but the first row is missing", which is even stranger. Would you be able to provide a test file for this? It sounds like you're already naming them .txt would should be able to be uploaded to a GitHub issue just fine.

As a work around, maybe you could use TSV files. :)

Phil


On Fri, Jan 3, 2020 at 3:48 PM Philipp at UiT <uit.p...@gmail.com> wrote:
Yes, the same behavior on demo.dataverse.org. Tabulor csv files created with Norwegian locale settings (e.g. in Excel) are usually semicolon-separated, not comma-separated (we use comma as decimal signs). I have replaced the semicolons in the file with commas, and now the first row is not removed anymore! This reminded me of problems we from time to time have with ingesting csv files. Maybe the reason for these problems is that our csv files actually use semicolons and not commas as delimiters? In our deposit guide, we recommend our users to provide tabulor files as tabulator separated plain text files with the extension .txt. However, such files are never ingested properly. Maybe we could have a discussion on what delimiters and extensions ingest should accept?

Best, Philipp

P.S. Happy New Year to the Dataverse Community!



fredag 3. januar 2020 13.19.04 UTC+1 skrev Philip Durbin følgende:
Very strange. Do you get the same behavior on https://demo.dataverse.org ?

On Fri, Jan 3, 2020 at 4:46 AM Philipp at UiT <uit.p...@gmail.com> wrote:
We are having problems with the first row of a csv tabular file being deleted at upload/ingest in Dataverse. We haven't experience this before. Any suggestions on how to avoid this? We are running on version 4.17.

Best, Philipp

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/e90b8618-49b0-420b-afd8-6a70246ce000%40googlegroups.com.
semicolon.png

Philipp at UiT

unread,
Jan 10, 2020, 7:51:13 AM1/10/20
to Dataverse Users Community
I have just curated another dataset with tabular data. I have tried to upload the files in different versions:

tabulator-separated .tsv
tabulator-separated .txt
semicolon-separated .csv
semicolon-separated .txt
comma-separated .csv
comma-separated .txt

The only variant resulting in a successful tabular ingest was:
comma-separated .csv

Our deposit guidelines say that files must be uploaded in a preferred file format. In addition they can be uploaded in the original file format (if this is not preferred). For spreadsheets, e.g., we say that they should be uploaded as tabulator-separated text files with the extension .txt (which is the extension Excel usually chooses - at least with Norwegian settings). So, we have many datasets with Excel files and corresponding .txt files.

But when I tested the different options above today, I realized that when I try to upload a comma-separated .csv file in addition to an identical Excel file (which has been correctly ingested), the system recognizes that the content in the two files is identical, and renames the .csv files by adding "-1". This is not quite what we want. We haven't been aware of this problem because we recommend our depositors to provide tabular data as tabulator-separated .txt files, and they are not ingested properly, so the system does not rename them in case an identical Excel file exists.

You could of course say that in cases where you already have a correctly ingested Excel file, you don't really need to upload an identical plain text file, because a user can download the Excel file as a tabulator-separated .tab file. BUT my question is then: Is this tab file actually stored in the repository? From what I read in this conversation, the answer is no; it's only produced on the fly. Our preservation policy requires depositors to provide their data in preferred file formats, but for spreadsheet files (and probably other tabular data), the scenario I just described doesn't make it possible to deposit and store a tabular file both in its original file format (e.g. .xlsx) and as a properly ingested plain text file without having to rename the properly ingested plain text file. The option of only storing the Excel file and then produce a plain text version on the fly is not compliant with our preservation policy because we want the plain text (= preferred version) actually to be stored on upload.

Any thoughts on this?

Best, Philipp



fredag 3. januar 2020 22.28.57 UTC+1 skrev Philip Durbin følgende:
I just downloaded a CSV from https://tovare.com/post/norwegian_unemployment/ and sure enough it had semicolons in it (!) and failed ingest with errors like in the attached screenshot ("Tabular data ingest failed. Reading mismatch, line 425 of the Data file: 1 delimited values expected, 2 found."). A similar error appears in server.log: Ingest failure (IO Exception): Reading mismatch, line 425 of the Data file: 1 delimited values expected, 2 found..

I just took a quick look at the code[1] and we use the same "CSVFileReader" for both CSV and TSV files. The only difference is passing a comma (,) vs. a tab (\t). I would think it would be easy enough to pass a semicolon but the harder part is figuring out when. Right now we assume CSV files have, well, commas. :) It would be easy enough to define a global setting to always pass a semicolon to "CSVFileReader" instead of a comma for CSV files but I'm not sure if you'd want that (you may have some of both).

Of course, what I'm describing above is "complete failure to ingest" rather than "mostly successful ingest but the first row is missing", which is even stranger. Would you be able to provide a test file for this? It sounds like you're already naming them .txt would should be able to be uploaded to a GitHub issue just fine.

As a work around, maybe you could use TSV files. :)

Phil


On Fri, Jan 3, 2020 at 3:48 PM Philipp at UiT <uit.p...@gmail.com> wrote:
Yes, the same behavior on demo.dataverse.org. Tabulor csv files created with Norwegian locale settings (e.g. in Excel) are usually semicolon-separated, not comma-separated (we use comma as decimal signs). I have replaced the semicolons in the file with commas, and now the first row is not removed anymore! This reminded me of problems we from time to time have with ingesting csv files. Maybe the reason for these problems is that our csv files actually use semicolons and not commas as delimiters? In our deposit guide, we recommend our users to provide tabulor files as tabulator separated plain text files with the extension .txt. However, such files are never ingested properly. Maybe we could have a discussion on what delimiters and extensions ingest should accept?

Best, Philipp

P.S. Happy New Year to the Dataverse Community!



fredag 3. januar 2020 13.19.04 UTC+1 skrev Philip Durbin følgende:
Very strange. Do you get the same behavior on https://demo.dataverse.org ?

On Fri, Jan 3, 2020 at 4:46 AM Philipp at UiT <uit.p...@gmail.com> wrote:
We are having problems with the first row of a csv tabular file being deleted at upload/ingest in Dataverse. We haven't experience this before. Any suggestions on how to avoid this? We are running on version 4.17.

Best, Philipp

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-community+unsub...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-community+unsub...@googlegroups.com.

Philip Durbin

unread,
Jan 10, 2020, 8:23:15 AM1/10/20
to dataverse...@googlegroups.com
I guess my first thought is that it might be nice to have a dataset in the "sample data" collection that illustrates the files that you would ideally upload to a dataset.

To add to the "sample data" you would do something like this:

- Create a dataset on https://demo.dataverse.org
- upload the files to be in line with your replication policy
- export the metadata in Dataverse's native JSON format
- create a branch and add the JSON file and the data files under the "data" directory at https://github.com/IQSS/dataverse-sample-data
- make a pull request

I know this is asking a lot! Fun times with git! Perhaps you could start with the first two steps? My thought is that with the example in the "sample data" we can write automated tests that make assertions on how Dataverse behaves.

Thanks,

Phil

To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/86ed0cc0-5499-4c13-93cf-7620a7e75c06%40googlegroups.com.

James Myers

unread,
Jan 10, 2020, 10:10:18 AM1/10/20
to dataverse...@googlegroups.com

FWIW: One thought w.r.t. a work-around – for QDR I implemented buttons to uningest files (and ingest again if desired). That could be made into a pull request. I think the main QDR use case was to be able to clear errors when ingest fails and to be able to retry after updates (i.e. when we made ingest work on files with more columns) but there may have been cases where the upload set had multiple formats of the same data. I think we talked about making ingest optional to start with for that, but, since QDR is a curated archive, I think we decided that just allowing uningest via the GUI was enough of a solution to drop the priority on going further.

 

I don’t want this to get in the way of discussion of potentially better solutions (making derived files permanent? making derived files first class objects linked to the originals instead of auxiliary entries?), but wanted to mention it in case it might help in the shorter term.

 

-- Jim

To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.

To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.

--

You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.

To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/86ed0cc0-5499-4c13-93cf-7620a7e75c06%40googlegroups.com.

Philipp at UiT

unread,
Jan 10, 2020, 12:41:07 PM1/10/20
to Dataverse Users Community
Thanks, Phil and Jim! I'll provide some sample data on Demo, but unfortunately, I won't have time before February :-/
Best, Philipp

To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-community+unsub...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.

To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-community+unsub...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.

To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-community+unsub...@googlegroups.com.

Philipp at UiT

unread,
Jun 5, 2020, 1:56:34 AM6/5/20
to Dataverse Users Community
Finally, I got some time to continue on this task.

@Phil: Could you please explain further what the dataset should contain? Do you need one dataset per tabular data file type (e.g. one for .xlsx, and another for tab-separated .txt), or can I upload both into the same dataset?

Thanks!
Philipp
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-community+unsub...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-community+unsub...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-community+unsub...@googlegroups.com.

Philip Durbin

unread,
Jun 5, 2020, 11:49:58 AM6/5/20
to dataverse...@googlegroups.com
Hi Philipp,

You can absolutely upload more than one file to the same dataset.

A while back I wrote some contribution guidelines at https://github.com/IQSS/dataverse-sample-data/blob/master/CONTRIBUTING.md

However, they're pretty short. Here are a few more thoughts:

- You are welcome to add a dataverse under the root to help organize your datasets.
- To contribute a dataset, you'll need it's JSON representation. You're welcome to create a dataset on the demo site and then export it.
- File hierarchy is supported so please feel free to take advantage of that.
- File metadata (description, tags) is also supported. See "open-source-at-harvard" for an example.

To back up, at a high level it's useful for the development team to have realistic looking data to play with. Thanks for any contributions!

Phil

To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/5712c89c-9775-40c8-bb18-7c27f68fd572o%40googlegroups.com.

Philipp at UiT

unread,
Jun 8, 2020, 2:25:44 AM6/8/20
to Dataverse Users Community
Thanks Phil. I have added my pull request. Please let me know if I need to change things.
Best, Philipp
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-community+unsub...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-community+unsub...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-community+unsub...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-community+unsub...@googlegroups.com.

Philip Durbin

unread,
Jun 8, 2020, 2:14:29 PM6/8/20
to dataverse...@googlegroups.com
I just did some light editing* and added your sample data alongside the rest. Thanks for the contribution!


To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/ac08f19d-d8cb-4355-b4c4-510ebadb7f72o%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages