Strange upload error - Dataverse says the file is already there - but it isn't

53 views
Skip to first unread message

Jonathan Bohan

unread,
Jan 23, 2018, 12:33:48 PM1/23/18
to Dataverse Users Community
I have a datset with more than 70 files which I batch uploaded. 1 file did not upload. When I search for it I get no results (see first attachment), but when I try to upload it it says it is already in the dataset (second image). Is this a known issue? Any fix for it?
 
Best regards,
Jonathan Bohan

paint 1.jpg
paint 2.jpg

Matthew Dunlap

unread,
Jan 23, 2018, 12:43:38 PM1/23/18
to Dataverse Users Community
My understanding looking at the code is that dataverse generates a checksum for each file uses taht to see if the file already exists. Do two of your files have a different name but the exact same contents?

Sonia Barbosa

unread,
Jan 23, 2018, 12:48:00 PM1/23/18
to dataverse...@googlegroups.com
That would be the likely issue, the MD5s are the same but the files have different names.


--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-community+unsub...@googlegroups.com.
To post to this group, send email to dataverse-community@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/0c476bb8-4d9f-42ad-b91c-3c0f4f927e46%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Jonathan Bohan

unread,
Jan 23, 2018, 12:48:29 PM1/23/18
to Dataverse Users Community
Yes! The 1986 data I have is actually the same as the 1985 data, now that I actually open the file and look at it! Thanks, I hadn't even thought of that.

Sherry Lake

unread,
Jan 24, 2018, 8:16:30 AM1/24/18
to Dataverse Users Community
But I would like to point out, again, that some times a file with the same contents, thus same MD5 and a different name is important and Dataverse should not stop these from being uploaded as duplicates.

I have a researcher who runs scripts based on filenames & to him files with different names, but same contents are not the same. The fact that there is the same information in both means something in his research. Who am I to question a researcher's logic? And if we are trying to capture research outputs for reproducibility, then I would think that having "all" files used in analysis (even though they have the same content) would be necessary.


Thanks for listening.
Sherry Lake

Jonathan Bohan

unread,
Jan 24, 2018, 8:22:12 AM1/24/18
to Dataverse Users Community
I would agree with this - I think in my particular case it was an erroneously named file, but there certainly may be reasons for the same file to appear multiple times.

Also, based on your comments in that other thread, not sure if you've figured this out on your own but if you zip a zipped file it will upload to Dataverse as the first zipped file.

Best,

Jonathan Bohan

Sonia Barbosa

unread,
Jan 24, 2018, 10:16:10 AM1/24/18
to dataverse...@googlegroups.com
We just this week discussed the having the ability to upload data files with the same name. This is def not a knew request and I certainly see it as I assist people with uploading and organizing their data for upload.



--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-community+unsub...@googlegroups.com.
To post to this group, send email to dataverse-community@googlegroups.com.

Laura Waugh

unread,
Sep 30, 2019, 10:33:32 AM9/30/19
to Dataverse Users Community
We are having this issue with Dataverse not uploading certain files in a set because it states "file is a duplicate of an already uploaded file." This is problematic and the researcher simply has another file with the same name. Are there any updates for managing this or suggestions?

Many thanks,

Laura Waugh
To post to this group, send email to dataverse...@googlegroups.com.

Sonia Barbosa

unread,
Oct 7, 2019, 1:32:45 PM10/7/19
to dataverse...@googlegroups.com
Hi Laura:
This sounds like a file with a duplicate checksum, not just a name issue. Did you figure it out?
This happens very often.


To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.

To post to this group, send email to dataverse...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/e33a8a82-db90-4af4-b4fe-52c2f934bf5c%40googlegroups.com.

Laura Waugh

unread,
Oct 8, 2019, 10:53:54 AM10/8/19
to Dataverse Users Community
Hi Sonia, 

We have not found a solution. Do you have any tips for determining or fixing this? Or, info we can pass on to researchers to avoid this? I found this thread: https://github.com/IQSS/dataverse/issues/2621

Many thanks in advance for thoughts on this. 

Laura Waugh

To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-community+unsub...@googlegroups.com.

To post to this group, send email to dataverse...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-community+unsub...@googlegroups.com.

Sherry Lake

unread,
Oct 8, 2019, 11:10:44 AM10/8/19
to Dataverse Users Community
Hi Laura,

Going back to your original problem - does the file with the same name appear in a different directory? This issue is closer to your problem and has not been resolved: https://github.com/IQSS/dataverse/issues/4813

And maybe that issue needs to be split as I see two different problems that are being flagged by the "duplicate" error. There is the error with files of the same name (see Phil's example in the issue) in different directories AND there is the error with files that different names but have the same content (that is my example in the issue).

--
Sherry

Barbosa, Sonia

unread,
Oct 8, 2019, 11:25:12 AM10/8/19
to dataverse...@googlegroups.com
Hi Laura:
I'll need to think about a possible way for you to find the duplicates, if it's the checksums and Phil may have some suggestions for that.

Did Dataverse by any chance show you the MD5 it was flagging as a duplicate, during the upload error?

I wish I had a recommendation other than the obvious for avoiding this which is be sure files are not duplicated prior to upload. If there is ever a reason to have a duplicate file, zip the files as a .tar file or 7z file before uploading. I don't recommend compressed files for obvious reasons but if and when it's needed, it's the option to use.

Phil, anything they can use to check the MD5s of the files before uploading to DV?

Thanks


To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.

To post to this group, send email to dataverse...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/475973a9-91dc-421f-9db4-794e4e5c3f54%40googlegroups.com.


--
Sonia Barbosa
Manager of Data Curation, The Dataverse Project
Manager of the Murray Research Archive, IQSS
Data Science
Harvard University

All Harvard Dataverse Repository inquiries should be sent to:  sup...@dataverse.harvard.edu
All software inquiries should be sent to: sup...@dataverse.org


Need to deposit data? Visit http://dataverse.harvard.edu

All test dataverses should be created  in our demo environment: https://demo.dataverse.org/


Philip Durbin

unread,
Oct 9, 2019, 9:10:01 AM10/9/19
to dataverse...@googlegroups.com
I don't have much to add except that when reviewing https://github.com/IQSS/dataverse-sample-data/pull/12 yesterday I was surprised to see comments about "file names must be unique per dataset". When I have a minute I'm planning on digging in more and trying to understand the current state of things.

Especially now that Dataverse supports file hierarchy, I think we should allow files with the same name as long as they are in different directories. For example, you might have a README.md in a directory called "code" and a README.md in a directory called "data".

Additionally, I think Dataverse should accept multiple files with the same checksum. With an eye toward reproducibility and putting more code into Dataverse, it's common in the Python world to have multiple zero-bytes files (all with the same checksum, of course) called __init.py__. This should be supported.



--

Laura Waugh

unread,
Oct 9, 2019, 6:34:49 PM10/9/19
to Dataverse Users Community
Hi all, 

I looked into this further and in our case it is not a duplicate file name. It is different file names with the same content. The data is related to a correction of a previous study and two of the files have different file names but empty rows to be filled in for replication/reuse - so the lack of content looks like duplication. I confirmed with the researcher that this is intentional and necessary for this particular dataset. 

In case this comes up again, wanted to provide the info and case in point here. Thank you again to all for looking into options on this. 

Much appreciated,

Laura Waugh

To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-community+unsub...@googlegroups.com.

To post to this group, send email to dataverse...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-community+unsub...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-community+unsub...@googlegroups.com.


--
Sonia Barbosa
Manager of Data Curation, The Dataverse Project
Manager of the Murray Research Archive, IQSS
Data Science
Harvard University

All Harvard Dataverse Repository inquiries should be sent to:  sup...@dataverse.harvard.edu
All software inquiries should be sent to: sup...@dataverse.org


Need to deposit data? Visit http://dataverse.harvard.edu

All test dataverses should be created  in our demo environment: https://demo.dataverse.org/


--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-community+unsub...@googlegroups.com.

Sonia Barbosa

unread,
Oct 9, 2019, 8:33:30 PM10/9/19
to dataverse...@googlegroups.com
The way the software looks at files is on content. If they have the same exact content,  the MD5 will be the same, thus the error. 

In this case,  for now until this issue ilof supporting duplicate MD5s is discussed,  I recommend compressing the files,  using either .tar or .7z

Thanks for looking into this further with the author. 




To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.

To post to this group, send email to dataverse...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.


--
Sonia Barbosa
Manager of Data Curation, The Dataverse Project
Manager of the Murray Research Archive, IQSS
Data Science
Harvard University

All Harvard Dataverse Repository inquiries should be sent to:  sup...@dataverse.harvard.edu
All software inquiries should be sent to: sup...@dataverse.org


Need to deposit data? Visit http://dataverse.harvard.edu

All test dataverses should be created  in our demo environment: https://demo.dataverse.org/


--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/3d26f42a-e46a-4fdb-b3af-5c166f0f5af8%40googlegroups.com.

Laura Waugh

unread,
Oct 10, 2019, 6:27:22 PM10/10/19
to Dataverse Users Community
Thanks so much, Sonia!

To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-community+unsub...@googlegroups.com.

To post to this group, send email to dataverse...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-community+unsub...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-community+unsub...@googlegroups.com.


--
Sonia Barbosa
Manager of Data Curation, The Dataverse Project
Manager of the Murray Research Archive, IQSS
Data Science
Harvard University

All Harvard Dataverse Repository inquiries should be sent to:  sup...@dataverse.harvard.edu
All software inquiries should be sent to: sup...@dataverse.org


Need to deposit data? Visit http://dataverse.harvard.edu

All test dataverses should be created  in our demo environment: https://demo.dataverse.org/


--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-community+unsub...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-community+unsub...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages