Data integrity API call - how does it work?

Grant Hurley

unread,

Aug 13, 2021, 3:52:43 PM8/13/21

to Dataverse Users Community

Hello Dataverse friends,

I am exploring the use of the very useful Datafile integrity API call for Scholars Portal Dataverse that is available as part of the native API. I'm specifically looking to use the call that will check the checksum of a file against the value stored in the database:

curl -H X-Dataverse-key:$API_TOKEN -X POST $SERVER_URL/api/admin/validateDataFileHashValue/{fileId}

The goal is to run these checks on a regular basis against all indexed files. To ensure this is scalable, I am curious about what this call is doing in the background. We are using S3 as the storage location. In the context of a remote storage location like S3, is it looking at the etag? Downloading the file to some location and then calculating the checksum there? Or something else? Any advice would be much appreciated!

Grant

Grant Hurley
Digital Preservation Librarian, Scholars Portal
grant....@utoronto.ca

Philip Durbin

unread,

Aug 13, 2021, 4:08:17 PM8/13/21

to dataverse...@googlegroups.com

Hi Grant,

From a quick glance at the code* it looks like the file is downloaded from S3 when that "validate checksum" operation is called.

(If it helps, it looks like it was added here: https://github.com/IQSS/dataverse/pull/6228 )

Probably not what you wanted to hear. Perhaps someone else can jump in with more. I hadn't heard of etags but it does seem like a more lightweight approach than downloading the file from S3.

Hope this helps,

Phil

* https://github.com/IQSS/dataverse/blob/v5.6/src/main/java/edu/harvard/iq/dataverse/api/Admin.java#L1702-L1706

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/ace4ec21-8d23-42f3-a400-8f28d099ed37n%40googlegroups.com.

--

Philip Durbin
Software Developer for http://dataverse.org
http://www.iq.harvard.edu/people/philip-durbin

James Myers

unread,

Aug 13, 2021, 4:17:04 PM8/13/21

to dataverse...@googlegroups.com

It streams the file from storage and computes the hash (in whatever algorithm is configured). It does not create a new file copy on disk anywhere but will use bandwidth to stream the whole file.

It doesn’t use eTags for a few reasons:

· eTags are md5 based (if I recall correctly – they are only in one algorithm in any case and can’t be configured to use others like Dataverse can)

· eTags aren’t a simple hash value for files uploaded to S3 in multiple parts – the combine hashes for the parts which is not what Dataverse stores in its database

· if you’ll trust an eTag, why not trust the Dataverse database? (that’s probably too flippant (see below): if your S3 provider is periodically checking against eTags, they might be more trustworthy, but at some point, you presumably want a live check.)

· eTags are S3 specific and the API call is generic across any storage mechanism.

It could make sense, particularly for large files, for Dataverse to also keep track of the eTag/eTags for parts and allow some level of validation that the stored file matches what Dataverse expects without the overhead of recomputing the full hash. (One could also argue that, given the random names generated for files (from a very large space of names), if you trust your S3 provider (which you would if you believe their eTags), just the existence of the file is enough to know that it’s the right file/things haven’t changed since the original upload.)

-- Jim

--

Grant Hurley

unread,

Aug 13, 2021, 4:37:03 PM8/13/21

to Dataverse Users Community

Thank you both for your quick and very helpful replies! Much appreciated. We were wondering what happens since there wasn't a sign of increased disk usage haha. I'm in favour of periodic live checks myself, so it's good to hear that is actually what is happening. But Jim's suggestion for one method to use etags is interesting!

I will take this info to our team and let you know if we have any questions.

All the best,

Grant

Reply all

Reply to author

Forward