Does hashlib support a file mode?

Phlip

unread,

Jul 6, 2011, 1:54:50 AM7/6/11

to

Pythonistas:

Consider this hashing code:

import hashlib
file = open(path)
m = hashlib.md5()
m.update(file.read())
digest = m.hexdigest()
file.close()

If the file were huge, the file.read() would allocate a big string and
thrash memory. (Yes, in 2011 that's still a problem, because these
files could be movies and whatnot.)

So if I do the stream trick - read one byte, update one byte, in a
loop, then I'm essentially dragging that movie thru 8 bits of a 64 bit
CPU. So that's the same problem; it would still be slow.

So now I try this:

sum = os.popen('sha256sum %r' % path).read()

Those of you who like to lie awake at night thinking of new ways to
flame abusers of 'eval()' may have a good vent, there.

Does hashlib have a file-ready mode, to hide the streaming inside some
clever DMA operations?

Prematurely optimizingly y'rs

--
Phlip
http://bit.ly/ZeekLand

Thomas Rachel

unread,

Jul 6, 2011, 2:37:47 AM7/6/11

to

Am 06.07.2011 07:54 schrieb Phlip:
> Pythonistas:
>
> Consider this hashing code:
>
> import hashlib
> file = open(path)
> m = hashlib.md5()
> m.update(file.read())
> digest = m.hexdigest()
> file.close()
>
> If the file were huge, the file.read() would allocate a big string and
> thrash memory. (Yes, in 2011 that's still a problem, because these
> files could be movies and whatnot.)
>
> So if I do the stream trick - read one byte, update one byte, in a
> loop, then I'm essentially dragging that movie thru 8 bits of a 64 bit
> CPU. So that's the same problem; it would still be slow.

Yes. That is why you should read with a reasonable block size. Not too
small and not too big.

def filechunks(f, size=8192):
while True:
s = f.read(size)
if not s: break
yield s
# f.close() # maybe...

import hashlib
file = open(path)
m = hashlib.md5()

fc = filechunks(file)
for chunk in fc:
m.update(chunk)
digest = m.hexdigest()
file.close()

So you are reading in 8 kiB chunks. Feel free to modify this - maybe use
os.stat(file).st_blksize instead (which is AFAIK the recommended
minimum), or a value of about 1 MiB...

> So now I try this:
>
> sum = os.popen('sha256sum %r' % path).read()

This is not as nice as the above, especially not with a path containing
strange characters. What about, at least,

def shellquote(*strs):
return " ".join([
"'"+st.replace("'","'\\''")+"'"
for st in strs
])

sum = os.popen('sha256sum %r' % shellquote(path)).read()

or, even better,

import subprocess
sp = subprocess.Popen(['sha256sum', path'],
stdin=subprocess.PIPE, stdout=subprocess.PIPE)
sp.stdin.close() # generate EOF
sum = sp.stdout.read()
sp.wait()

?

> Does hashlib have a file-ready mode, to hide the streaming inside some
> clever DMA operations?

AFAIK not.

Thomas

Chris Rebert

unread,

Jul 6, 2011, 2:44:28 AM7/6/11

to Phlip, pytho...@python.org

On Tue, Jul 5, 2011 at 10:54 PM, Phlip <phli...@gmail.com> wrote:
> Pythonistas:
>
> Consider this hashing code:
>
> import hashlib
> file = open(path)
> m = hashlib.md5()
> m.update(file.read())
> digest = m.hexdigest()
> file.close()
>
> If the file were huge, the file.read() would allocate a big string and
> thrash memory. (Yes, in 2011 that's still a problem, because these
> files could be movies and whatnot.)
>
> So if I do the stream trick - read one byte, update one byte, in a
> loop, then I'm essentially dragging that movie thru 8 bits of a 64 bit
> CPU. So that's the same problem; it would still be slow.
>
> So now I try this:
>
> sum = os.popen('sha256sum %r' % path).read()
>
> Those of you who like to lie awake at night thinking of new ways to
> flame abusers of 'eval()' may have a good vent, there.

Indeed (*eyelid twitch*). That one-liner is arguably better written as:
sum = subprocess.check_output(['sha256sum', path])

> Does hashlib have a file-ready mode, to hide the streaming inside some
> clever DMA operations?

Barring undocumented voodoo, no, it doesn't appear to. You could
always read from the file in suitably large chunks instead (rather
than byte-by-byte, which is indeed ridiculous); see
io.DEFAULT_BUFFER_SIZE and/or the os.stat() trick referenced therein
and/or the block_size attribute of hash objects.
http://docs.python.org/library/io.html#io.DEFAULT_BUFFER_SIZE
http://docs.python.org/library/hashlib.html#hashlib.hash.block_size

Cheers,
Chris
--
http://rebertia.com

Anssi Saari

unread,

Jul 6, 2011, 5:47:19 AM7/6/11

to

Phlip <phli...@gmail.com> writes:

> If the file were huge, the file.read() would allocate a big string and
> thrash memory. (Yes, in 2011 that's still a problem, because these
> files could be movies and whatnot.)

I did a crc32 calculator like that and actually ran into some kind of
string length limit with large files. So I switched to 4k blocks and
the speed is about the same as a C implementation in the program
cksfv. Well, of course crc32 is usually done with a table lookup, so
it's always fast.

I just picked 4k, since it's the page size in x86 systems and also a
common block size for file systems. Seems to be big enough.
io.DEFAULT_BUFFER_SIZE is 8k here. I suppose using that would be the
proper way.

Adam Tauno Williams

unread,

Jul 6, 2011, 6:55:31 AM7/6/11

to pytho...@python.org

On Tue, 2011-07-05 at 22:54 -0700, Phlip wrote:
> Pythonistas
> Consider this hashing code:
> import hashlib
> file = open(path)
> m = hashlib.md5()
> m.update(file.read())
> digest = m.hexdigest()
> file.close()
> If the file were huge, the file.read() would allocate a big string and
> thrash memory. (Yes, in 2011 that's still a problem, because these
> files could be movies and whatnot.)

Yes, the simple rule is do not *ever* file.read(). No matter what the
year this will never be OK. Always chunk reading a file into reasonable
I/O blocks.

For example I use this function to copy a stream and return a SHA512 and
the output streams size:

def write(self, in_handle, out_handle):
m = hashlib.sha512()
data = in_handle.read(4096)
while True:
if not data:
break
m.update(data)
out_handle.write(data)
data = in_handle.read(4096)
out_handle.flush()
return (m.hexdigest(), in_handle.tell())

> Does hashlib have a file-ready mode, to hide the streaming inside some
> clever DMA operations?

Chunk it to something close to the block size of your underlying
filesystem.

Phlip

unread,

Jul 6, 2011, 9:47:02 AM7/6/11

to

wow, tx y'all!

I forgot to mention that hashlib itself is not required; I could also
use Brand X. But y'all agree that blocking up the file in python adds
no overhead to hashing each block in C, so hashlib in a loop it is!

Phlip

unread,

Jul 6, 2011, 11:59:43 AM7/6/11

to

Tx, all!. But...

> For example I use this function to copy a stream and return a SHA512 and
> the output streams size:
>
> def write(self, in_handle, out_handle):
> m = hashlib.sha512()
> data = in_handle.read(4096)
> while True:
> if not data:
> break
> m.update(data)
> out_handle.write(data)
> data = in_handle.read(4096)
> out_handle.flush()
> return (m.hexdigest(), in_handle.tell())

The operation was a success but the patient died.

My version of that did not return the same hex digest as the md5sum
version:

def file_to_hash(path, m = hashlib.md5()):

with open(path, 'r') as f:

s = f.read(8192)

while s:
m.update(s)
s = f.read(8192)

return m.hexdigest()

You'll notice it has the same control flow as yours.

That number must eventually match an iPad's internal MD5 opinion of
that file, after it copies up, so I naturally cannot continue working
this problem until we see which of the two numbers the iPad likes!

Peter Otten

unread,

Jul 6, 2011, 12:26:09 PM7/6/11

to pytho...@python.org

Phlip wrote:

- Open the file in binary mode.
- Do the usual dance for default arguments:
def file_to_hash(path, m=None):
if m is None:
m = hashlib.md5()

Phlip

unread,

Jul 6, 2011, 12:49:10 PM7/6/11

to

> - Open the file in binary mode.

I had tried open(path, 'rb') and it didn't change the "wrong" number.

And I added --binary to my evil md5sum version, and it didn't change
the "right" number!

Gods bless those legacy hacks that will never die, huh? But I'm using
Ubuntu (inside VMWare, on Win7, on a Core i7, because I rule), so that
might explain why "binary mode" is a no-op.

> - Do the usual dance for default arguments:
> def file_to_hash(path, m=None):
> if m is None:
> m = hashlib.md5()

Not sure why if that's what the defaulter does? I did indeed get an
MD5-style string of what casually appeared to be the right length, so
that implies the defaulter is not to blame...

Peter Otten

unread,

Jul 6, 2011, 1:06:13 PM7/6/11

to pytho...@python.org

Phlip wrote:

>> - Open the file in binary mode.
>
> I had tried open(path, 'rb') and it didn't change the "wrong" number.
>
> And I added --binary to my evil md5sum version, and it didn't change
> the "right" number!
>
> Gods bless those legacy hacks that will never die, huh? But I'm using
> Ubuntu (inside VMWare, on Win7, on a Core i7, because I rule), so that
> might explain why "binary mode" is a no-op.

Indeed. That part was a defensive measure mostly meant to make your function
Windows-proof.

>> - Do the usual dance for default arguments:
>> def file_to_hash(path, m=None):
>> if m is None:
>> m = hashlib.md5()
>
> Not sure why if that's what the defaulter does? I did indeed get an
> MD5-style string of what casually appeared to be the right length, so
> that implies the defaulter is not to blame...

The first call will give you the correct checksum, the second: not. As the
default md5 instance remembers the state from the previous function call
you'll get the checksum of both files combined.

Phlip

unread,

Jul 6, 2011, 1:38:10 PM7/6/11

to

> >> def file_to_hash(path, m=None):
> >> if m is None:
> >> m = hashlib.md5()

> The first call will give you the correct checksum, the second: not. As the

> default md5 instance remembers the state from the previous function call
> you'll get the checksum of both files combined.

Ouch. That was it.

Python sucks. m = md5() looks like an initial assignment, not a
special magic storage mode. Principle of least surprise fail, and
principle of most helpful default behavior fail.

Chris Torek

unread,

Jul 6, 2011, 1:54:07 PM7/6/11

to

>> - Do the usual dance for default arguments:
>> def file_to_hash(path, m=None):
>> if m is None:
>> m = hashlib.md5()

[instead of

def file_to_hash(path, m = hashlib.md5()):

]

In article <b317226a-8008-4177...@e20g2000prf.googlegroups.com>

Phlip <phli...@gmail.com> wrote:
>Not sure why if that's what the defaulter does?

For the same reason that:

def spam(somelist, so_far = []):
for i in somelist:
if has_eggs(i):
so_far.append(i)
return munch(so_far)

is probably wrong. Most beginners appear to expect this to take
a list of "things that pass my has_eggs test", add more things
to that list, and return whatever munch(adjusted_list) returns ...
which it does. But then they *also* expect:

result1_on_clean_list = spam(list1)
result2_on_clean_list = spam(list2)
result3_on_partly_filled_list = spam(list3, prefilled3)

to run with a "clean" so_far list for *each* of the first two
calls ... but it does not; the first call starts with a clean
list, and the second one starts with "so_far" containing all
the results accumulated from list1.

(The third call, of course, starts with the prefilled3 list and
adjusts that list.)

>I did indeed get an MD5-style string of what casually appeared
>to be the right length, so that implies the defaulter is not to
>blame...

In this case, if you do:

print('big1:', file_to_hash('big1'))
print('big2:', file_to_hash('big2'))

you will get two md5sum values for your two files, but the
md5sum value for big2 will not be the equivalent of "md5sum big2"
but rather that of "cat big1 big2 | md5sum". The reason is
that you are re-using the md5-sum-so-far on the second call
(for file 'big2'), so you have the accumulated sum from file
'big1', which you then update via the contents of 'big2'.
--
In-Real-Life: Chris Torek, Wind River Systems
Intel require I note that my opinions are not those of WRS or Intel
Salt Lake City, UT, USA (40�39.22'N, 111�50.29'W) +1 801 277 2603
email: gmail (figure it out) http://web.torek.net/torek/index.html

Andrew Berg

unread,

Jul 6, 2011, 2:42:12 PM7/6/11

to comp.lang.python

On 2011.07.06 12:38 PM, Phlip wrote:
> Python sucks. m = md5() looks like an initial assignment, not a
> special magic storage mode. Principle of least surprise fail, and
> principle of most helpful default behavior fail.

func() = whatever the function returns
func = the function object itself (in Python, everything's an object)

Maybe you have Python confused with another language (I don't know what
exactly you mean by initial assignment). Typically one does not need
more than one name for a function/method. When a function/method is
defined, it gets created as a function object and occupies the namespace
in which it's defined.

Phlip

unread,

Jul 6, 2011, 3:07:56 PM7/6/11

to

If I call m = md5() twice, I expect two objects.

I am now aware that Python bends the definition of "call" based on
where the line occurs. Principle of least surprise.

Ian Kelly

unread,

Jul 6, 2011, 3:48:36 PM7/6/11

to Phlip, pytho...@python.org

On Wed, Jul 6, 2011 at 1:07 PM, Phlip <phli...@gmail.com> wrote:
> If I call m = md5() twice, I expect two objects.
>
> I am now aware that Python bends the definition of "call" based on
> where the line occurs. Principle of least surprise.

There is no definition-bending. The code:

"""
def file_to_hash(path, m = hashlib.md5()):

# do stuff...

file_to_hash(path1)
file_to_hash(path2)
"""

does not call hashlib.md5 twice. It calls it *once*, at the time the
file_to_hash function is defined. The returned object is stored on
the function object, and that same object is passed into file_to_hash
as a default value each time the function is called. See:

http://docs.python.org/reference/compound_stmts.html#function

Ethan Furman

unread,

Jul 6, 2011, 4:24:31 PM7/6/11

to pytho...@python.org

Phlip wrote:
>> On 2011.07.06 12:38 PM, Phlip wrote:
>>> Python sucks. m = md5() looks like an initial assignment, not a
>>> special magic storage mode. Principle of least surprise fail, and
>>> principle of most helpful default behavior fail.
>>>
>

> If I call m = md5() twice, I expect two objects.

You didn't call md5 twice -- you called it once when you defined the
function.

Phlips naive code:
---

def file_to_hash(path, m = hashlib.md5()):

\---------------/
happens once, when
def line is executed

If you want separate md5 objects, don't create just one when you create
the function, create one inside the function:

def file_to_hash(path, m = None):

if m is None:
m = hashlib.md5()

You should try the Principle of Learning the Language.

~Ethan~

Andrew Berg

unread,

Jul 6, 2011, 3:42:30 PM7/6/11

to comp.lang.python

On 2011.07.06 02:07 PM, Phlip wrote:
> If I call m = md5() twice, I expect two objects.

You get two objects because you make the function run again. Of course,
the first one is garbage collected if it doesn't have another reference.

>>> m1 = hashlib.md5()
>>> m2 = hashlib.md5()
>>> m1 is m2
False

Are you assuming Python acts like another language or is there something
confusing in the docs or something else?

Mel

unread,

Jul 6, 2011, 3:43:36 PM7/6/11

to

Phlip wrote:

> If I call m = md5() twice, I expect two objects.
>
> I am now aware that Python bends the definition of "call" based on
> where the line occurs. Principle of least surprise.

Actually, in

def file_to_hash(path, m = hashlib.md5()):

hashlib.md5 *is* called once; that is when the def statement is executed.

Later on, when file_to_hash gets called, the value of m is either used as
is, as the default parameter, or is replaced for the duration of the call by
another object supplied by the caller.

Mel.

geremy condra

unread,

Jul 6, 2011, 3:15:54 PM7/6/11

to Phlip, pytho...@python.org

Python doesn't do anything to the definition of call. If you call
hashlib.md5() twice, you get two objects:

>>> import hashlib

>>> m1 = hashlib.md5()
>>> m2 = hashlib.md5()

>>> id(m1)
139724897544712
>>> id(m2)
139724897544880

Geremy Condra

Carl Banks

unread,

Jul 6, 2011, 4:25:37 PM7/6/11

to

On Wednesday, July 6, 2011 12:07:56 PM UTC-7, Phlip wrote:
> If I call m = md5() twice, I expect two objects.
>
> I am now aware that Python bends the definition of "call" based on
> where the line occurs. Principle of least surprise.

Phlip:

We already know about this violation of the least surprise principle; most of us acknowledge it as small blip in an otherwise straightforward and clean language. (Incidentally, fixing it would create different surprises, but probably much less common ones.)

We've helped you with your problem, but you risk alienating those who helped you when you badmouth the whole language on account of this one thing, and you might not get such prompt help next time. So try to be nice.

You are wrong about Python bending the definition of "call", though. Surprising though it be, the Python language is very explicit that the default arguments are executed only once, when creating the function, *not* when calling it.

Carl Banks

Phlip

unread,

Jul 6, 2011, 5:07:47 PM7/6/11

to

On Jul 6, 1:25 pm, Carl Banks <pavlovevide...@gmail.com> wrote:

> We already know about this violation of the least surprise principle; most of us acknowledge it as small blip in an otherwise straightforward and clean language.

Here's the production code we're going with - thanks again all:

def file_to_hash(path, hash_type=hashlib.md5):
"""
Per: http://groups.google.com/group/comp.lang.python/browse_thread/thread/ea1c46f77ac1738c
"""

hash = hash_type()

with open(path, 'rb') as f:

while True:
s = f.read(8192) # CONSIDER: io.DEFAULT_BUFFER_SIZE
if not s: break
hash.update(s)

return hash.hexdigest()

Note the fix also avoids comparing to None, which, as usual, is also
icky and less typesafe!

(And don't get me started about the extra lines needed to avoid THIS
atrocity!

while s = f.read(8192):
hash.update(s)

;)

Steven D'Aprano

unread,

Jul 6, 2011, 7:16:35 PM7/6/11

to

Phlip wrote:

> Note the fix also avoids comparing to None, which, as usual, is also
> icky and less typesafe!

"Typesafe"? Are you trying to make a joke?

--
Steven

Anssi Saari

unread,

Jul 7, 2011, 7:29:38 AM7/7/11

to

Mel <mwi...@the-wire.com> writes:

> def file_to_hash(path, m = hashlib.md5()):
>
> hashlib.md5 *is* called once; that is when the def statement is executed.

Very interesting, I certainly wasn't clear on this. So after that def,
the created hashlib object is in the module's scope and can be
accessed via file_to_hash.__defaults__[0].

Paul Rudin

unread,

Jul 7, 2011, 8:13:22 AM7/7/11

to

Anssi Saari <a...@sci.fi> writes:

This also why you have to be a bit careful if you use e.g. [] or {} as a
default argument - if you then modify these things within the function
you might not end up with what you expect - it's the same list or
dictionary each time the function is called. So to avoid that kind of
thing you end up with code like:

def foo(bar=None):
if bar is None:
bar = []
...

Andrew Berg

unread,

Jul 7, 2011, 8:50:08 AM7/7/11

to comp.lang.python

Maybe he has a duck phobia. Maybe he denies the existence of ducks.
Maybe he doesn't like the sound of ducks. Maybe he just weighs the same
as a duck. In any case, duck tolerance is necessary to use Python
effectively.

On a side note, it turns out there's no word for the fear of ducks. The
closest phobia is anatidaephobia, which is the fear of being /watched/
by a duck.

Phlip

unread,

Jul 7, 2011, 9:11:19 AM7/7/11

to

> On 2011.07.06 06:16 PM, Steven D'Aprano wrote:> Phlip wrote:
>
> > > Note the fix also avoids comparing to None, which, as usual, is also
> > > icky and less typesafe!
>
> > "Typesafe"? Are you trying to make a joke?

No, I was pointing out that passing a type is more ... typesafe.

Andrew Berg

unread,

Jul 7, 2011, 9:24:58 AM7/7/11

to comp.lang.python

On 2011.07.07 08:11 AM, Phlip wrote:
> No, I was pointing out that passing a type is more ... typesafe.

None is a type.

>>> None.__class__
<class 'NoneType'>

Phlip

unread,

Jul 7, 2011, 9:39:54 AM7/7/11

to

On Jul 7, 6:24 am, Andrew Berg <bahamutzero8...@gmail.com> wrote:
> On 2011.07.07 08:11 AM, Phlip wrote:> No, I was pointing out that passing a type is more ... typesafe.
>
> None is a type.

I never said it wasn't.

Andrew Berg

unread,

Jul 7, 2011, 9:58:48 AM7/7/11

to comp.lang.python

You are talking about this code, right?

def file_to_hash(path, m=None):

if m is None:
m = hashlib.md5()

What's not a type? The is operator compares types (m's value isn't the
only thing compared here; even an separate instance of the exact same
type would make it return False), and m can't be undefined.

Steven D'Aprano

unread,

Jul 7, 2011, 9:46:12 PM7/7/11

to

Andrew Berg wrote:

> On 2011.07.07 08:39 AM, Phlip wrote:
>> On Jul 7, 6:24 am, Andrew Berg <bahamutzero8...@gmail.com> wrote:
>> > On 2011.07.07 08:11 AM, Phlip wrote:> No, I was pointing out that
>> > passing a type is more ... typesafe.
>> >
>> > None is a type.
>>
>> I never said it wasn't.

Unfortunately, it isn't.

None is not a type, it is an instance.

>>> isinstance(None, type) # is None a type?
False
>>> isinstance(None, type(None)) # is None an instance of None's type?
True

So None is not itself a type, although it *has* a type:

>>> type(None)
<type 'NoneType'>
>>> isinstance(type(None), type) # is NoneType itself a type?
True

> You are talking about this code, right?
>
> def file_to_hash(path, m=None):
> if m is None:
> m = hashlib.md5()
>
> What's not a type? The is operator compares types (m's value isn't the
> only thing compared here; even an separate instance of the exact same
> type would make it return False), and m can't be undefined.

The is operator does not compare types, it compares instances for identity.
There is no need for is to ever care about the type of the arguments --
that's just a waste of time, since a fast identity (memory location) test
is sufficient.

This is why I initially thought that Phlip was joking when he suggested
that "m is None" could be type-unsafe. It doesn't matter what type m
has, "m is <anything>" will always be perfectly safe.

--
Steven

Andrew Berg

unread,

Jul 7, 2011, 10:32:59 PM7/7/11

to comp.lang.python

On 2011.07.07 08:46 PM, Steven D'Aprano wrote:
> None is not a type, it is an instance.
>
> >>> isinstance(None, type) # is None a type?
> False
> >>> isinstance(None, type(None)) # is None an instance of None's type?
> True
>
> So None is not itself a type, although it *has* a type:
>
> >>> type(None)
> <type 'NoneType'>
> >>> isinstance(type(None), type) # is NoneType itself a type?
> True

I worded that poorly. None is (AFAIK) the only instance of NoneType, but
I should've clarified the difference.

> The is operator does not compare types, it compares instances for identity.
> There is no need for is to ever care about the type of the arguments --
> that's just a waste of time, since a fast identity (memory location) test
> is sufficient.

"Compare" was the wrong word. I figured the interpreter doesn't
explicitly compare types, but obviously identical instances are going to
be of the same type.

Phlip

unread,

Jul 7, 2011, 11:26:56 PM7/7/11

to

> I worded that poorly. None is (AFAIK) the only instance of NoneType, but
> I should've clarified the difference.> The is operator does not compare types, it compares instances for identity.

None is typesafe, because it's strongly typed.

However, what's even MORE X-safe (for various values of X) is a method
that takes LESS for its arguments. That's why I switched from passing
an object to passing a type, because the more restrictive argument
type is more typesafe.

However, the MOST X-safe version so far simply passes a string, and
uses hashlib the way it designs to be used:

def file_to_hash(path, hash_type):

hash = hashlib.new(hash_type)

with open(path, 'rb') as f:

while True:
s = f.read(8192)

Steven D'Aprano

unread,

Jul 8, 2011, 3:42:47 AM7/8/11

to

Phlip wrote:

>> I worded that poorly. None is (AFAIK) the only instance of NoneType, but
>> I should've clarified the difference.> The is operator does not compare
>> types, it compares instances for identity.
>
> None is typesafe, because it's strongly typed.

Everything in Python is strongly typed. Why single out None?

Python has strongly-typed objects, dynamically typed variables, and a
philosophy of preferring duck-typing over explicit type checks when
possible.

> However, what's even MORE X-safe (for various values of X) is a method
> that takes LESS for its arguments. That's why I switched from passing
> an object to passing a type, because the more restrictive argument
> type is more typesafe.

It seems to me that you are defeating duck-typing, and needlessly
restricting what the user can pass, for dubious or no benefit. I still
don't understand what problems you think you are avoiding with this tactic.

> However, the MOST X-safe version so far simply passes a string, and
> uses hashlib the way it designs to be used:
>
> def file_to_hash(path, hash_type):
> hash = hashlib.new(hash_type)
> with open(path, 'rb') as f:
> while True:
> s = f.read(8192)
> if not s: break
> hash.update(s)
> return hash.hexdigest()

There is no advantage to this that I can see. It limits the caller to using
only hashes in hashlib. If the caller wants to provide her own hashing
algorithm, your function will not support it.

A more reasonable polymorphic version might be:

def file_to_hash(path, hash='md5', blocksize=8192):
# FIXME is md5 a sensible default hash?
if isinstance(hash, str):
# Allow the user to specify the hash by name.
hash = hashlib.new(hash)
else:
# Otherwise hash must be an object that implements the
# hashlib interface, i.e. a callable that returns an
# object with appropriate update and hexdigest methods.
hash = hash()

with open(path, 'rb') as f:
while True:

s = f.read(blocksize)

if not s: break
hash.update(s)
return hash.hexdigest()

--
Steven

Phlip

unread,

Jul 8, 2011, 9:03:54 AM7/8/11

to

On Jul 8, 12:42 am, Steven D'Aprano <steve

+comp.lang.pyt...@pearwood.info> wrote:
> Phlip wrote:
> >> I worded that poorly. None is (AFAIK) the only instance of NoneType, but
> >> I should've clarified the difference.> The is operator does not compare
> >> types, it compares instances for identity.
>
> > None is typesafe, because it's strongly typed.
>
> Everything in Python is strongly typed. Why single out None?

You do understand these cheap shots are bad for conversations, right?

I didn't single out None. When did you stop raping your mother?

Steven D'Aprano

unread,

Jul 9, 2011, 12:44:48 AM7/9/11

to

Phlip wrote:

Phlip, I'm not an idiot, please don't pee on my leg and tell me it's
raining. In the very sentence you quote above, you clearly and obviously
single out None:

"None is typesafe, because it's strongly typed."

Yes, None is strongly typed -- like everything else in Python. I don't
understand what point you are trying to make. Earlier you claimed that
identity testing for None is type-unsafe (or at least *less* type-safe,
whatever that means):

"Note the fix also avoids comparing to None, which, as usual, is also icky
and less typesafe!"

then you say None is type-safe -- if there is a coherent message in your
posts, it is too cryptic for me.

> When did you stop raping your mother?

What makes you think I've stopped?

--
Steven