Excel files with multiple sheets (again)

40 views
Skip to first unread message

Thomas Jouneau

unread,
Sep 23, 2022, 11:53:39 AM9/23/22
to dataverse...@googlegroups.com

Dear all,

I'm taking my Recherche Data Gouv hat (our national repository) for this message. We would like to reanimate an oldish subject, which is the partial ingestion of Excel files with multiple sheets.

This has been discussed here :

https://github.com/IQSS/dataverse/issues/7452

and what follows is a kind of recap. Feel free to correct us.

As a reminder, Dataverse as of today only ingests the first sheet, if this sheet is correctly structured. 

Some details may escape us, but the point is that only a part of the uploaded file is actually converted and ingested. Maybe there's already something in place within the code, to detect Excel files with multiple sheets?

In any case, the current behaviour causes two problems.

  • First, from the depositor point of view, the file has been completely uploaded and ingested, when in fact it didn't. Of course, there's no hard data loss, since the original Excel file is still available, but a potential misunderstanding is at stake. Also, the sheets that have not been imported are not available in the more interoperable TAB format, which is not good for the FAIR status of the dataset.
  • Second, from the user/downloader point of view, if (s)he chooses to download the converted TAB files, the dataset is truncated which could result in potential misunderstandings, potentially critical if it happens in the course of a peer-reviewing process, for example.

We also saw a ticket here that proposes to decompose automatically the Excel files in separate tabs : https://github.com/IQSS/dataverse/issues/8518

We're not sure it's the right way to go :

  • The process suggested is rather complex, error-prone and maybe too high a goal to achieve ;
  • It's not clear of what Dataverse would do with the relations, formulas, graphs... that sometimes link intrinsically the sheets together.

We would maybe rather have two modest options to alter the current behaviour :

  • warnings displayed during the upload, encouraging users to split their files in separate sheets and warning them that the files have only be partially ingested ;
  • maybe a possibility to preventing altogether the ingestion of Excel files with multiple sheets.

Both options could be activated in the JVM or on the server side.

We already discourage users to upload files with multiple sheets. But this effort is not efficient without some user guidance in the software itself.

We at Recherche Data Gouv would be happy to contribute with development workforce on this, maybe in the course of 2023. We would like however to gather your opinions on this first.

Best,

-- 
Thomas JOUNEAU
Université de Lorraine
Soutien aux données de la recherche
Direction de la Documentation - Mission appui recherche
B.U. Ile du Saulcy BP 20728
57045 Metz Cedex 01
Tél. : 03 72 74 10 27

Sherry Lake

unread,
Sep 26, 2022, 12:44:36 PM9/26/22
to dataverse...@googlegroups.com
Hi Thomas,

For UVa's Dataverse, we have turned off tabular ingestion for Excel. https://guides.dataverse.org/en/latest/installation/config.html#tabularingestsizelimit

We have come to the conclusion that we cannot stop researchers from making multiple sheets in Excel, so to protect them, we have turned off tabular ingestion. This of course has turned off ingest for those with one sheet and who's file would ingest with no problems.

Here's an interesting example of how some of our researchers use (abuse) Excel:

I would love to see Dataverse software's ingestion program, first check to see if the file has multiple sheets, and if so - don't ingest, but ingest all other excel 1-sheet files (or at least try to ingest).

Thanks,
Sherry Lake
UVA Dataverse Repository http://dataverse.lib.virginia.edu



--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/357d07b6-6775-7432-0438-acaf7b76a337%40gmail.com.

Amber Leahey

unread,
Oct 5, 2022, 9:56:02 AM10/5/22
to Dataverse Users Community
Hi Thomas, Sherry, 

We too have had conversations about this in the Borealis community and would like to see an improvement. Here are some additional comments and ideas to consider:
  • While we may not be able to stop researchers from using Excel in certain ways, we don't necessarily want to turn off ingest as it is the basis for our Data Explorer and Data Curation Tool, which supports open data exploration, curation, reuse, and preservation in the platform. 
  • We have tried to document some common errors with Excel ingest over the years see Borealis User Guide and Dataverse's User Guide that can be improved certainly. 
  • Many would like to see these error messages improved to better troubleshoot at the point of ingest/upload, we certainly get a lot of questions about these errors and some researchers are quite alarmed by them and think something is wrong -- not sure this is the approach we want to take either. 
  • There is now an API to uningest, this can be performed by most users who can upload not just super admins (Please see https://guides.dataverse.org/en/5.11.1/api/native-api.html#add-a-file-to-a-dataset)
Moving forward we could look at the following options:
  1. Offer option for end-users to select when to use ingest or not in the UI (by default this could be checked and labelled along these lines: "Select to ingest data for open display and preservation", currently there is only an API)
  2. Offer option for Admins to select when to use ingest or not in UI  (similar to above, but not visible unless an Admin role is assigned)
  3. Improve the error message warnings in UI, suggest to make this less like an error and more like a tip to improve the openness of their data. The error message would become an improvement tip such as "Flagged for improvement. Learn more about Dataverse's open file ingestion features",  etc.
  4. Redevelop ingest tools to support Excel use cases better e.g. multi-sheet use case, parsing common formatting, etc. (seems like the harder approach, but definitely an area of interest)
Related Github tickets: #2199 (recently reopened), #8526 (covers many ingest related issues), #8518 and there are many others related

Really curious to hear and learn more about what you are thinking to improve this feature! 

Best, 
Amber Leahey
Reply all
Reply to author
Forward
0 new messages