Updating a dataset but keeping filenames.

77 views
Skip to first unread message

Ken Mankoff

unread,
Jun 26, 2020, 3:18:17 AM6/26/20
to Dataverse Users Community
Hello again,

I was planning on using our Dataverse to host a near-realtime dataset that updates every week (maybe every day). I don't mind having v365 at the end of the year, or v3650 after 10 years.

But I just noticed that the uploaded files get a "-1" added to them (confusing since the dataset is now v2).

I'm already a bit confused that I upload "file.csv" and the user sees "file.tab". I find problematic that I upload "file.csv" and the user sees "file-2.csv", and unacceptable that my scripts that upload "file.csv" and then download "file.csv" don't get the same file they just uploaded...

Is there a work-around or option for this? I found this ticket https://github.com/IQSS/dataverse/issues/6574 that seems to address it, but the ticket is closed, and it isn't clear to me what the resolution is.

Thanks,

  -k.

Philipp at UiT

unread,
Jun 26, 2020, 9:08:11 AM6/26/20
to Dataverse Users Community
Hello,

It seems you tried to add a new file with the same file name as an existing file. Therefore, the system added a "-1" to the file name.
To avoid this, I think you should first delete the existing file. When you then publish the new DRAFT version, the old file.csv will be in V1, and the new file.csv will be in V2.

Dataverse ingests tabular files (.csv and .xlsx) in order to be able to provide a .tab version of it when a user wants to download your data.
The original file format is still available as a choice when you click the download button.

Best, Philipp

Ken Mankoff

unread,
Jun 26, 2020, 9:25:45 AM6/26/20
to dataverse...@googlegroups.com, Philipp at UiT

Ah - this is simple. So two changes are required when "updating" a dataset (delete, add), not just one. Thank you for the suggestion.

-k.

On 2020-06-26 at 06:08 -07, Philipp at UiT <uit.p...@gmail.com> wrote...

Philipp at UiT

unread,
Jun 26, 2020, 9:29:57 AM6/26/20
to Dataverse Users Community
There is also the Replace files feature, which I haven't testet out enough to be able to say whether it would be useful in your case.

Philipp

Philip Durbin

unread,
Jun 26, 2020, 11:17:06 AM6/26/20
to dataverse...@googlegroups.com
I agree that the "replace file" feature is probably worth investigating. It maintains a link between the old file and the new file. Before that feature was added, people indeed simply deleted the old file before adding the new one. All the old files are available in previous versions, of course.

The status of https://github.com/IQSS/dataverse/issues/6574 is that it will be included in the next version of Dataverse. The code (pull request #6893) has already been merged.

The other thing I wanted to mention is that when you are downloading files that have gone through ingest, you can always retrieve the saved original. So while the UI might show "file.tab" you can still download "file.csv". Also, there's some discussion of .csv being converted to .tab at https://github.com/IQSS/dataverse/issues/6385

I hope this helps,

Phil

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/5a77c002-b221-4941-8f1b-655cf5bc75d7o%40googlegroups.com.


--

Stefan Kasberger

unread,
Jul 13, 2020, 5:56:44 PM7/13/20
to Dataverse Users Community
Hi,

I have implemented the replace file endpoint in pyDataverse in the develop branch a few weeks ago to support the CoronaWhy community.
So it works, and can be used with pyDataverse (develop): replace_datafile, see https://github.com/AUSSDA/pyDataverse/blob/develop/src/pyDataverse/api.py#L1805

But beware, you have to do a metadata request before, so you don't loose them.

Regards, Stefan
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-community+unsub...@googlegroups.com.

Ken Mankoff

unread,
Aug 7, 2020, 12:29:48 PM8/7/20
to Dataverse Users Community
Hi Phil & Stefan,

I'm trying to replace a file but only the metadata is updating. Our DV version is {"status":"OK","data":{"version":"4.19","build":"331-affbf4f"}}. Is the "replace" feature documented here http://guides.dataverse.org/en/latest/api/native-api.html#replacing-files working on this version of the API?

Stefan - Does pyDataverse implement file replacement when the API does not yet officially support it?

Thanks,

  -k.

Philip Durbin

unread,
Aug 7, 2020, 2:16:55 PM8/7/20
to dataverse...@googlegroups.com
Hi Ken,

Yes, it should work. I can't think of any changes between 4.19 and the latest release, which is 4.20 (or the next release, 5.0). Also, file replace via API is tested regularly by our API test suite so it should be working fine. You are very welcome to try it at https://demo.dataverse.org (running 4.20 as of this writing) to see if you get the same results. Even if the only bug you find is confusing documentation, please feel free to open a GitHub issue.

If it helps, here's the method we use to test "replace file" in our test suite: https://github.com/IQSS/dataverse/blob/v4.20/src/test/java/edu/harvard/iq/dataverse/api/UtilIT.java#L611

Thanks,

Phil


James Myers

unread,
Aug 7, 2020, 2:39:48 PM8/7/20
to dataverse...@googlegroups.com

FWIW – Replace doesn’t work before you’ve published the dataset once. QDR is interested in allowing replace when you only have a draft dataset as well – see https://github.com/IQSS/dataverse/issues/7149.

-- Jim

Ken Mankoff

unread,
Aug 9, 2020, 10:23:08 AM8/9/20
to dataverse...@googlegroups.com, Philip Durbin
Hi Phil,

For this DOI and server

export SERVER_URL=https://demo.dataverse.org
export PERSISTENT_ID=doi:10.70122/FK2/EIB6CG/QUN1LQ

Although the DOI doesn't resolve yet so the URL is
https://demo.dataverse.org/file.xhtml?persistentId=doi:10.70122/FK2/EIB6CG/QUN1LQ&version=1.0

I'm trying to upload a different file as a replacement:

export FILENAME=~/Desktop/goodbye.txt

curl -H "X-Dataverse-key:$API_TOKEN" -X POST -F 'file=@'${FILENAME} -F 'jsonData={"description":"Goodbye", "forceReplace":true}' "$SERVER_URL/api/files/:persistentId/metadata?persistentId=$PERSISTENT_ID"


Documentation suggestion: The response to the Curl commands should be included in the documentation? I don't know what I should expect to see, but here is what the response is:

File Metadata update has been completed: {"label":"hello.txt","description":"Goodbye","restricted":false,"id":1401539}$

I'm trying to replace the *file*, in addition to the *metadata*.

On 2020-08-07 at 11:16 -07, Philip Durbin <philip...@harvard.edu> wrote...
> If it helps, here's the method we use to test "replace file" in our
> test suite:
> https://github.com/IQSS/dataverse/blob/v4.20/src/test/java/edu/harvard/iq/dataverse/api/UtilIT.java#L611

Ah no, at least not yet. I'm not sure how to translate that Java test to my attempt at Curl API access.

-k.

James Myers

unread,
Aug 9, 2020, 2:15:23 PM8/9/20
to dataverse...@googlegroups.com, Philip Durbin
Ken,

Looks like 6 months ago some of the documentation examples got a cut/paste error. The correct command to replace is:

curl -H "X-Dataverse-key:$API_TOKEN" -X POST -F 'file=@'${FILENAME} -F 'jsonData={"description":"Goodbye", "forceReplace":true}' "$SERVER_URL/api/files/:persistentId/replace?persistentId=$PERSISTENT_ID"

(with /metadata changed to /replace)

I'll submit an issue to get the docs corrected.

-- Jim

-----Original Message-----
From: dataverse...@googlegroups.com [mailto:dataverse...@googlegroups.com] On Behalf Of Ken Mankoff
Sent: Sunday, August 09, 2020 10:23 AM
To: dataverse...@googlegroups.com
Cc: Philip Durbin
Subject: Re: [Dataverse-Users] Re: Updating a dataset but keeping filenames.

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/871rkfoq46.fsf%40gmail.com.

Ken Mankoff

unread,
Aug 9, 2020, 8:50:47 PM8/9/20
to dataverse...@googlegroups.com, Philip Durbin, James Myers

On 2020-08-09 at 11:15 -07, James Myers <qqm...@hotmail.com> wrote...
> Looks like 6 months ago some of the documentation examples got a
> cut/paste error. The correct command to replace is:
>
> [...]
>
> (with /metadata changed to /replace)

Nice catch. Now it works. Thank you.

-k.
Message has been deleted
Message has been deleted

Ken Mankoff

unread,
Aug 24, 2020, 1:55:23 PM8/24/20
to dataverse...@googlegroups.com, Philip Durbin, James Myers, Stefan Kasberger
Hello,

It turns out I'm still having trouble with file replacement. Things are much easier with the help of the pyDataverse tool (thank you Stefan!) but I'm experience an issue both with pyDataverse and direct curl access to the API.

If I replace a single file everything seems to work fine. But if I want to replace multiple files, everything after the 1st file gets a "-1" appended to the filename. Is it possible that the API treats published dataverses and draft dataverses differently? It seems like after 1 curl command, a copy of the dataverse is created and now exists in draft mode. Subsequent curl calls access that version perhaps? And then things behave differently?

Here is a simple example showing the issue using https://demo.dataverse.org/dataset.xhtml?persistentId=doi:10.70122/FK2/EIB6CG where I'm attempting to replace both "foo.txt" and "bar.txt"

export SERVER_URL=https://demo.dataverse.org

export ID_foo=doi:10.70122/FK2/EIB6CG/OFMC8B
export ID_bar=doi:10.70122/FK2/EIB6CG/YCYSP7

export FILE_foo=/home/kdm/tmp/DV/two/foo.txt
export FILE_bar=/home/kdm/tmp/DV/two/bar.txt

curl -H "X-Dataverse-key:$API_TOKEN" -X POST -F "file=@${FILE_foo}" -F 'jsonData={"description":"Foo", "forceReplace":true, "directoryLabel":"."}' "$SERVER_URL/api/files/:persistentId/replace?persistentId=$ID_foo"

curl -H "X-Dataverse-key:$API_TOKEN" -X POST -F "file=@${FILE_bar}" -F 'jsonData={"description":"Bar", "forceReplace":true, "directoryLabel":"."}' "$SERVER_URL/api/files/:persistentId/replace?persistentId=$ID_bar"

The return from the first curl call shows "foo.txt" as the replacement filename:

{"status":"OK","data":{"files":[{"description":"Foo","label":"foo.txt","restricted":false,"version":1,"datasetVersionId":167418,"dataFile":{"id":1579547,"persistentId":"","pidURL":"","filename":"foo.txt","contentType":"text/plain","filesize":6,"description":"Foo","storageIdentifier":"file://1742194cca8-40f3489766ee","rootDataFileId":1579546,"previousDataFileId":1579546,"md5":"e19673b6e69c5f73192e4f78f6e771ab","checksum":{"type":"MD5","value":"e19673b6e69c5f73192e4f78f6e771ab"},"creationDate":"2020-08-24"}}]}}

The return from the second curl call shows "bar-1.txt" as the replacement filename:

{"status":"OK","data":{"files":[{"description":"Bar","label":"bar-1.txt","restricted":false,"version":1,"datasetVersionId":167418,"dataFile":{"id":1579548,"persistentId":"","pidURL":"","filename":"bar-1.txt","contentType":"text/plain","filesize":6,"description":"Bar","storageIdentifier":"file://1742194d927-2a91b6b40468","rootDataFileId":1579545,"previousDataFileId":1579545,"md5":"17f50f4b842bd98f58f2e11c0848c821","checksum":{"type":"MD5","value":"17f50f4b842bd98f58f2e11c0848c821"},"creationDate":"2020-08-24"}}]}}

Is there some API feature that I'm missing that will let me replace >1 file at a time while controlling the filenames?

Thanks,

-k.

James Myers

unread,
Aug 24, 2020, 3:01:52 PM8/24/20
to dataverse...@googlegroups.com
Hmm - sounds like a bug to me (without digging into too much the code) and I wouldn't be surprised if, after the first replace has created a new draft version, that the code to stop you from uploading two files with the same name (not replacing) gets triggered.

The only possible work-around I can think of (aside from db edits) is making a second call to change the file name. If the replace call actually removes the file being replaced from the version as it should, you may be able to change the file name after that without hitting the same issue/bug.

I know there were changes in 5.0 to how/when file renaming occurs - so it's possible that this has been fixed/changed in 5.0. I'd suggest at least adding this to github as an issue and see if its confirmed to exist in 5.0.

-- Jim

-----Original Message-----
From: Ken Mankoff [mailto:man...@gmail.com]
Sent: Monday, August 24, 2020 1:55 PM
To: dataverse...@googlegroups.com
Cc: Philip Durbin; James Myers; Stefan Kasberger
Subject: Re: [Dataverse-Users] Re: Updating a dataset but keeping filenames.

Ken Mankoff

unread,
Aug 24, 2020, 7:24:37 PM8/24/20
to dataverse...@googlegroups.com, James Myers
Hi James,

On 2020-08-24 at 12:01 -07, James Myers <qqm...@hotmail.com> wrote...
> Hmm - sounds like a bug to me (without digging into too much the code)
> and I wouldn't be surprised if, after the first replace has created a
> new draft version, that the code to stop you from uploading two files
> with the same name (not replacing) gets triggered.

Bug report filed: https://github.com/IQSS/dataverse/issues/7223

> The only possible work-around I can think of (aside from db edits) is
> making a second call to change the file name. If the replace call
> actually removes the file being replaced from the version as it
> should, you may be able to change the file name after that without
> hitting the same issue/bug.

This appears to works fine. Thanks for the suggestion.

d = api.replace_datafile(persistent_id, filename, json_str)
assert(d.json()["status"] != "ERROR")
file_id = d.json()['data']['files'][0]['dataFile']['id']
d2 = api.update_datafile_metadata(file_id, json_str=json_str, is_filepid=False)

> I know there were changes in 5.0 to how/when file renaming occurs - so
> it's possible that this has been fixed/changed in 5.0. I'd suggest at
> least adding this to github as an issue and see if its confirmed to
> exist in 5.0.

I'll test it when our or the demo update to v5.

-k.

Stefan Kasberger

unread,
Sep 7, 2020, 5:58:33 AM9/7/20
to Dataverse Users Community
No, pyDataverse is only offering the functionality, but so far the user has to know, which endpoints/functions are already available in their own Dataverse instance and then call it.
Reply all
Reply to author
Forward
0 new messages