--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/ace4ec21-8d23-42f3-a400-8f28d099ed37n%40googlegroups.com.
It streams the file from storage and computes the hash (in whatever algorithm is configured). It does not create a new file copy on disk anywhere but will use bandwidth to stream the whole file.
It doesn’t use eTags for a few reasons:
· eTags are md5 based (if I recall correctly – they are only in one algorithm in any case and can’t be configured to use others like Dataverse can)
· eTags aren’t a simple hash value for files uploaded to S3 in multiple parts – the combine hashes for the parts which is not what Dataverse stores in its database
· if you’ll trust an eTag, why not trust the Dataverse database? (that’s probably too flippant (see below): if your S3 provider is periodically checking against eTags, they might be more trustworthy, but at some point, you presumably want a live check.)
· eTags are S3 specific and the API call is generic across any storage mechanism.
It could make sense, particularly for large files, for Dataverse to also keep track of the eTag/eTags for parts and allow some level of validation that the stored file matches what Dataverse expects without the overhead of recomputing the full hash. (One could also argue that, given the random names generated for files (from a very large space of names), if you trust your S3 provider (which you would if you believe their eTags), just the existence of the file is enough to know that it’s the right file/things haven’t changed since the original upload.)
-- Jim
--