problems with files after uploading to s3

86 views
Skip to first unread message

Jacek Chudzik

unread,
May 29, 2024, 6:25:52 AMMay 29
to Dataverse Users Community
Hi,

recently one of our user was uploading about 60 files (most of them 1-5 GB) via dvuploader to s3 bucket. Files are in s3 but we have two issues with them:

1. file names are changed and in s3 directory we get hashes/s3 id not a fileName so you can't tell which file is which (what am I missing in s3 upload configuration?)
2. files are not visibile in the dataset draft - tried reindexing solr but it didn't help.

server.log doesn't seem to show any errors on upload:

[2024-05-29T10:40:17.896+0200] [Payara 6.2023.8] [INFO] [] [] [tid: _ThreadID=93 _ThreadName=http-thread-pool::jk-connector(4)] [timeMillis: 1716972017896] [levelValue: 800] [[
  jsonData: {"description":"","directoryLabel":"","mimeType":"text/plain","categories":["DATA"],"restrict":false,"storageIdentifier":"s3://BUCKET_NAME:18fc37c8662-df6b1c8cd7cd","fileName":"Unknown_A60BMK240408-BY775-ZX01-010001-01_good_1.fq.gz","checksum":{"@type":"MD5","@value":"82f52fb21aic96d809f39458349cbf50"}}]]


Dataset is in draft state. Dataverse version we are using is 6.1.

Can you help me with that?
Jacek

James Myers

unread,
May 29, 2024, 9:58:15 AMMay 29
to dataverse...@googlegroups.com

Jacek,

 

Direct upload (which I’m assuming is what is being used) is a multi-step process in which the files are first uploaded to S3 and then Dataverse is called to add them to the dataset. It sounds like that last step failed for some reason in your case. In a case like that, I would have expected the tool to report some error, so if this problem is repeatable with no error messages, it would be worth submitting an issue.

 

Using opaque names in the storage (like 18fc37c8662-df6b1c8cd7cd) is how Dataverse works. The link between the file name and this identifier is kept in the database. (Note that file names can be changed per dataset version.) So that itself is not an indication of an S3 misconfiguration. It could still be that you have some issue with Dataverse being able to access the bucket – if you do, uploads of smaller files via the Dataverse UI would also be failing for that S3 store/bucket.

 

Things you could try:

 

If you want to keep the files already in S3, you could try manually calling the last step in the direct-upload api – e.g. https://guides.dataverse.org/en/latest/developers/s3-direct-upload-api.html#adding-the-uploaded-file-to-the-dataset for a single file, or the multifile version in the next section. The JSON payload required for these calls is what you are seeing in the log as you’ve shown below. If that works, it may have been a one-time issue with your upload or perhaps a bug in the tool. If it doesn’t work, you’ll at least have a repeatable case to debug with.

 

Given the log entry below where “categories”:[“Data”] is set, it looks like you are not using the DVUploader – it doesn’t set categories, (or this log entry wasn’t from the tool?). Perhaps it was python-dvuploader? There may be a local log file from that tool (I know DVUploader creates one, but I don’t know how python-dvuploader deals with errors.) That may indicate what the underlying issue is, but perhaps not with nothing in the Dataverse log. In any case, I don’t think any of these tools can address the case where the files are already in place. If you want to retry with the same or different tool, you’d need to delete the files now in S3 and re-run the uploads. (I would definitely make sure you have the latest version of whatever tool is being used.) If these are the only files in the dataset, deleting them with the AWS command line client might be easiest. Alternately, or If there are good files in the dataset as well, using the Cleanup storage of a dataset API call should also remove the files not listed in the dataset.

 

I hope that helps. If you have/get additional details about when the failure occurs, we can perhaps identify an issue that can be fixed in Dataverse or the upload tool.

 

-- Jim

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/2f7933a7-9618-46d2-aea7-ac81fb547592n%40googlegroups.com.

Message has been deleted

Jacek Chudzik

unread,
Jun 11, 2024, 6:42:42 AMJun 11
to Dataverse Users Community
Dear Jim,

Thanks for your answer.

Yes, the issue was with https://github.com/gdcc/python-dvuploader. I was uploading lately files via this tool and everything went ok. I asked the owner of this "corrupted data" to re-upload or to give me the the possibility to download and to add them. I did the cleanup process, so now I'm just waiting... I'm inclined to think that the problem was with corrupted process of adding data (network problem?). I belive it will work fine next time.

Regards,
Jacek

Range, Jan

unread,
Jun 11, 2024, 9:22:31 AMJun 11
to dataverse...@googlegroups.com
Hi Jacek,

I maintain the Python DVUploader. Sorry to hear that there was an issue. Which version did you use for the upload?

I suspect that the final step of registering the uploaded files in Dataverse may have failed. This explains why the files are present in S3 but not visible on the dataset. In a previous version, this was an issue because of another endpoint that was being used, but this has been resolved in the recent versions.

Happy to reproduce the error and find a fix if the recent version doesn't work either.

All the best,
Jan

———————————

Jan Range
Research Data Software Engineer

University of Stuttgart
Stuttgart Center for Simulation Science (SC SimTech)
Cluster of Excellence EXC 2075 „Data-Integrated Simulation Science“ (SimTech)

Pfaffenwaldring 5a | Room 01.013 | 70569 Stuttgart Germany

Phone: 0049 711 685 60095
E-Mail: jan....@simtech.uni-stuttgart.de

——— Meet me ———

https://calendly.com/jan-range/meeting

——— My Projects ———

🧬 PyEnzyme - https://github.com/EnzymeML/PyEnzyme

🏛 PyDaRUS - https://github.com/JR-1991/pyDaRUS

🪐 EasyDataverse - https://github.com/gdcc/easyDataverse

Jacek Chudzik

unread,
Jun 12, 2024, 1:54:46 AMJun 12
to Dataverse Users Community
Hi Jan,

Sorry, I wasn't clear and I didn't mean to suggest that Python DVUploader was the source of the problem. The version of your tool used is probably the latest, as we sent the upload instructions to the author of the data several days ago. 
I was unable to reproduce this error and I do not have possibility to upload original data that were corrupted. I asked the author for access to the data so that I could upload it myself.
Every time I was unable to upload files using the Python DVUploader, the problem was - I guess - on the network side, timeout etc. Uploading the same files in smaller packages or from a better network has always been successful.
I hope the case will end successfully too. I'll let you know.

Have a nice day,
Jacek
Reply all
Reply to author
Forward
0 new messages