hash differs: cherryPy file-like objects <> filesystem file object

33 views
Skip to first unread message

Raphael Feick

unread,
Aug 20, 2020, 11:20:51 AM8/20/20
to cherrypy-users
Hi all

Currently cherrypy receives a file (via POST), stores it, gets its hash and then does stuff with it, depending on the hash.

Now since cherryPy already has the in-memory file-like asset, I thought I can skip this step and calculate the hash of it, already, but the outcomes are different.

Check this:

def blake(filename):
hash_blake2 = blake2b()
with open(filename, "rb") as f:
for chunk in iter(lambda: f.read(8192), b""): hash_blake2.update(chunk) 
return hash_blake2.hexdigest()

def vblake(fileLike):
hash_blake2 = blake2b()
with fileLike.file as f:
for chunk in iter(lambda: f.read(8192), b""): hash_blake2.update(chunk)
return hash_blake2.hexdigest()

multipart_dict = cherrypy.request.body.params
for value in multipart_dict.values(): v_file = value # for demo purpose
fs_file = os.path.normpath(os.path.join(upload_path, upload_filename))
with open(fs_file, 'wb') as out:
while True:
data = v_file.file.read(8192)
if not data: break
out.write(data)

print(vblake(v_file))
print(blake(fs_file))

>>

786a02f742015903c6c6fd852552d272912f4740e15847618a86e217f71f5419d25e1031afee585313896444934eb04b903a685b1448b755d56f701afe9be2ce

a9333a71c3341ebbbef35dea8be9e55ac23d8befe8f29bc99b6151901bce96f03e4707a6e5ea13faa4b63f77b40185587b99f719a97c21b59d762af1ee141ceb


As you can see, these hashes don't match. However it's hard to grasp for me because shouldn't they be identical? It would be great because then I could compare hashes with existing files (e.g. to avoid duplication).

Am I doing something wrong here or is it entirely impossible to have the same hash for the file-like (v_file) object as the file-system (fs_file) object?



Cheerio

Tim Roberts

unread,
Aug 21, 2020, 3:04:35 AM8/21/20
to cherryp...@googlegroups.com
On Aug 19, 2020, at 6:21 AM, Raphael Feick <raphe...@gmail.com> wrote:
>
> Currently cherrypy receives a file (via POST), stores it, gets its hash and then does stuff with it, depending on the hash.
>
> Now since cherryPy already has the in-memory file-like asset, I thought I can skip this step and calculate the hash of it, already, but the outcomes are different.

You are assuming that the file is the only parameter. Are you absolutely sure that’s true? Have you printed the length of the two strings to see if you have actually grabbed the right data? Why don’t you fetch the value by name?

Tim Roberts, ti...@probo.com
Providenza & Boekelheide, Inc.

Raph

unread,
Aug 21, 2020, 4:31:06 AM8/21/20
to cherryp...@googlegroups.com
Hi Tim

I found & solved the issue. But it goes way deeper.
TL;DR: with blocks close buffers making them inaccessible after first-use

So, first off, you're correct, the one blake2b hash

786a02f742015903c6c6fd852552d272912f4740e15847618a86e217f71f5419d25e1031afee585313896444934eb04b903a685b1448b755d56f701afe9be2ce 

comes from b'' which means, an empty binary input. Alas, that was already an important find. But it gets more interesting from here on:

A "with" block after reading any file/file-like(buffer) object upon completion calls __exit__ which closes a file/file-like(buffer).
So any first function call using a with block always returns correct results, and the second is always reading b'' (empty bytes).

Once I figured that out, it was easy to use a while True block instead of a with block

def hash_of_filebuffer(file): # type <class '_io.BufferedRandom'>
hash_blake = blake2b()
while True:
chunk = file.read(8192)
if chunk: hash_blake.update(chunk)
if len(chunk) == 0: break

file.seek(0) # this is the important part
return hash_blake.hexdigest()

See in particular the file.seek(0) which is basically a rewind to the start of the buffer without closing it so it can later be re-read.


Sure enough I have no idea if this is anywhere near best practice. And I find many people advocating doing "everything that needs to be done with a file while it's being read" but this is impossible with a two step approach where (1) first I need to get the file hash and (2) upon this result I have to decide whether to actually store the file on the filesystem or not.

So here it is, in all its glory, half a day of desperation with a silver horizon in the end ^^

All the best!


--
You received this message because you are subscribed to a topic in the Google Groups "cherrypy-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/cherrypy-users/F3Nf8tFisPc/unsubscribe.
To unsubscribe from this group and all its topics, send an email to cherrypy-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/cherrypy-users/4CD8860F-D0D5-48CD-BA34-2FFC4A0FBAB6%40probo.com.
Reply all
Reply to author
Forward
0 new messages