I recently found myself explaining to a friend how Ruby's various comparison operators work. In the process, I tried to find some decent documentation on why Ruby has so many different ways to test for equality, how they differ and how they should be implemented and used.
I was unable to find any such documentation, so I decided to have a go myself :-) You can see the fruits of my labours here:
I believe that it's an accurate reflection of both how things work and the philosophy underlying the design of this area. I would be very grateful if you could let me know if I have anything wrong or have left anything out though!
There are a number of subtle pitfalls in this area waiting to trap the unwary. I hope that this may go some way to helping a few people avoid them :-)
Comments, criticisms and suggestions all gratefully received.
> I recently found myself explaining to a friend how Ruby's various > comparison operators work. In the process, I tried to find some decent > documentation on why Ruby has so many different ways to test for > equality, how they differ and how they should be implemented and used.
> I was unable to find any such documentation, so I decided to have a go > myself :-) You can see the fruits of my labours here:
> I believe that it's an accurate reflection of both how things work and > the philosophy underlying the design of this area. I would be very > grateful if you could let me know if I have anything wrong or have left > anything out though!
> There are a number of subtle pitfalls in this area waiting to trap the > unwary. I hope that this may go some way to helping a few people avoid > them :-)
> Comments, criticisms and suggestions all gratefully received.
Thanks for this. I found it great for a quick reference and to explain it to someone else :-).
There are a couple of typos that you might want to correct:
Note that this means that, unlike the other methods we're considering here, this means that === won't in general be commutative:
String === 'foo' => true 'foo' == String => false
Should be 'foo' === String (three equals).
[...] In the vast majority of cases, you will either want to test for "natural" equality (===)
Nobody has (yet) risen to the challenge at the end of my article, so I thought that I'd ask here :-)
Just about every class in the standard library implements == and eql? as I describe in the article, i.e. eql? tests for equal values and == tests for "natural" equality (which normally means equal values).
On 10/17/07, Paul Butcher <p...@texperts.com> wrote:
> Nobody has (yet) risen to the challenge at the end of my article, so I > thought that I'd ask here :-)
> Just about every class in the standard library implements == and eql? as > I describe in the article, i.e. eql? tests for equal values and == tests > for "natural" equality (which normally means equal values). ..
> Hash, however, is an exception. Hash#== tests for equal values. > Hash.eql?, however, tests for object identity: .. > Why is hash the odd one out? I'm sure that there must be a good reason > (Matz?) but I can't at the moment work out what it might be.
> I'd be very grateful for any light anyone could cast on this. Thanks!
1) Using hashs as keys in another hash is not a common use case. I'm a little hard-pressed to think of why I'd want to, although I'm famous for lack of imagination. 2) Because of the requirement that obj1.eql? obj2 => obj1.hash == obj2.hash, implementing Hash#hash requires iterating over the keys and values and would be fairly expensive and make accessing a hash with hash keys by key impractical.
On Wed, Oct 17, 2007 at 09:12:26PM +0900, Rick DeNatale wrote:
> 1) Using hashs as keys in another hash is not a common use case. I'm a > little hard-pressed to think of why I'd want to, although I'm famous > for lack of imagination. > 2) Because of the requirement that obj1.eql? obj2 => obj1.hash == > obj2.hash, implementing Hash#hash requires iterating over the keys and > values and would be fairly expensive and make accessing a hash with > hash keys by key impractical.
foo.eql? still seems a little inconsistent (and surprising) to me. Essentially, it's == except when comparing hashes, at which point it suddenly becomes foo.equal?. Is there some instance in which foo.eql? and foo.equal? evaluate differently for hashes that escapes me at the moment? Is there a particular use-case for foo.eql? (which for the moment looks to me like the red-headed stepchild of Ruby equality) that isn't satisfied by any of the other equality comparison methods?
While we're discussing things with equal signs -- why isn't = a method?
-- CCD CopyWrite Chad Perrin [ http://ccd.apotheon.org ] print substr("Just another Perl hacker", 0, -2);
> On 10/17/07, Paul Butcher <p...@texperts.com> wrote: > > Just about every class in the standard library implements == and eql? as > > I describe in the article, i.e. eql? tests for equal values and == tests > > for "natural" equality (which normally means equal values).
> > Hash, however, is an exception. Hash#== tests for equal values. > > Hash.eql?, however, tests for object identity:
> > Why is hash the odd one out? I'm sure that there must be a good reason > > (Matz?) but I can't at the moment work out what it might be.
> I think the reason is twofold:
> 1) Using hashs as keys in another hash is not a common use case. I'm a > little hard-pressed to think of why I'd want to, although I'm famous > for lack of imagination.
I've wanted it on 3 occasions (that I can remember) now. Here's a contrived example derived from the real-world use case I can no longer remember:
You have a file like this... alpha,beta,15 gamma,delta,3 beta,alpha,4 alpha,alpha,3 delta,alpha,5 gamma,delta,7 ...and you want to sum up the numbers for each unique pair of greek letters. Naively, I'd do (and initially tried) something like:
I believe I instead resorted to sorting the keys and using a nested hash to drill down to the value. It was annoying.
> 2) Because of the requirement that obj1.eql? obj2 => obj1.hash == > obj2.hash, implementing Hash#hash requires iterating over the keys and > values and would be fairly expensive and make accessing a hash with > hash keys by key impractical.
That logic seems slightly mothering, though. "Ruby prevents you from doing A because if you did A it might be slow." Ruby doesn't prevent me from writing: my_huge_array.delete_if{ |v1| my_huge_array.find{ |v2| (v1 - v2).abs < mu } } I suppose the distinction is that the above is a foolish pairing of individually-reasonable parts, while Hash#hash is an atomic method written to optimize speed for one (reasonably useless) use case at the expense of allowing another use case.
As a related aside: Having never written a hashing function, I'm uncertain how I'd write Hash#hash in a way that reasonably prevented two hashes with different keys and/or values from ending up with the same value. (Multiply the .hash values of all keys and values in the Hash and then mod them on a big prime number?) Has anyone taken a stab at implementing this?
> On Oct 17, 6:12 am, "Rick DeNatale" <rick.denat...@gmail.com> wrote: > > On 10/17/07, Paul Butcher <p...@texperts.com> wrote: > > > Just about every class in the standard library implements == and eql? as > > > I describe in the article, i.e. eql? tests for equal values and == tests > > > for "natural" equality (which normally means equal values).
> > > Hash, however, is an exception. Hash#== tests for equal values. > > > Hash.eql?, however, tests for object identity:
> > > Why is hash the odd one out? I'm sure that there must be a good reason > > > (Matz?) but I can't at the moment work out what it might be.
> > I think the reason is twofold:
> > 1) Using hashs as keys in another hash is not a common use case. I'm a > > little hard-pressed to think of why I'd want to, although I'm famous > > for lack of imagination.
> I've wanted it on 3 occasions (that I can remember) now. Here's a > contrived example derived from the real-world use case I can no longer > remember:
> You have a file like this... > alpha,beta,15 > gamma,delta,3 > beta,alpha,4 > alpha,alpha,3 > delta,alpha,5 > gamma,delta,7 > ...and you want to sum up the numbers for each unique pair of greek > letters. Naively, I'd do (and initially tried) something like:
> > 2) Because of the requirement that obj1.eql? obj2 => obj1.hash == > > obj2.hash, implementing Hash#hash requires iterating over the keys and > > values and would be fairly expensive and make accessing a hash with > > hash keys by key impractical.
> That logic seems slightly mothering, though. "Ruby prevents you from > doing A because if you did A it might be slow." Ruby doesn't prevent > me from writing: > my_huge_array.delete_if{ |v1| > my_huge_array.find{ |v2| (v1 - v2).abs < mu } > } > I suppose the distinction is that the above is a foolish pairing of > individually-reasonable parts, while Hash#hash is an atomic method > written to optimize speed for one (reasonably useless) use case at the > expense of allowing another use case.
I suspect that it's just pragmatism
> As a related aside: > Having never written a hashing function, I'm uncertain how I'd write > Hash#hash in a way that reasonably prevented two hashes with different > keys and/or values from ending up with the same value. (Multiply > the .hash values of all keys and values in the Hash and then mod them > on a big prime number?) Has anyone taken a stab at implementing this?
That's not the requirement, the requirement is that if two objects are eql? then their hashes must also be ==, there's nothing to prevent two objects which aren't eql? to have the same hash value.
In fact let me offer:
require "benchmark"
DATA = <<-END alpha,beta,15 gamma,delta,3 beta,alpha,4 alpha,alpha,3 delta,alpha,5 gamma,delta,7 END
def hash_key_impl sums = Hash.new(0) DATA.each{ |line| _, g1, g2, num = /(\w+),(\w+),(\d+)/.match( line ).to_a sums[ { g1=>true, g2=>true } ] += num.to_i } sums end
class HashableHash < Hash def hash to_a.sort.hash end
def eql?(other) self == other end end
def hashable_key_impl sums = Hash.new(0) DATA.each{ |line| _, g1, g2, num = /(\w+),(\w+),(\d+)/.match( line ).to_a sums[ HashableHash[g1=>true, g2=>true ] ] += num.to_i } sums end
class HashableHash2 < Hash def hash 1 end
def eql?(other) self == other end end
def hashable2_key_impl sums = Hash.new(0) DATA.each{ |line| _, g1, g2, num = /(\w+),(\w+),(\d+)/.match( line ).to_a sums[ HashableHash2[g1=>true, g2=>true ] ] += num.to_i } sums end
def array_key_impl sums = Hash.new(0) DATA.each{ |line| _, g1, g2, num = /(\w+),(\w+),(\d+)/.match( line ).to_a sums[ [g1,g2].sort ] += num.to_i } sums end
p hash_key_impl p hashable_key_impl p hashable2_key_impl p array_key_impl
TESTS = 1000 Benchmark.bmbm do |results|
results.report("hash key:") do TESTS.times do hash_key_impl end end
results.report("hashable key:") do TESTS.times do hashable_key_impl end end
results.report("hashable2 key:") do TESTS.times do hashable2_key_impl end end
results.report("array key:") do TESTS.times do array_key_impl end end
user system total real hash key: 0.050000 0.000000 0.050000 ( 0.050082) hashable key: 0.100000 0.010000 0.110000 ( 0.101206) hashable2 key: 0.070000 0.000000 0.070000 ( 0.077322) array key: 0.050000 0.000000 0.050000 ( 0.051378)
The original hash implementation performs almost exactly the same as my simple substitution of sorted arrays as the keys, but it has the distinct disadvantage of getting the wrong answer
I did two different implementations of a HashableHash which use == for eql? and differ in that one uses information in the hash to compute a hash value, while the other returns a constant value for hash.
Both are slower than the array key implementation. HashableHash2 is somewhat faster than HashableHash since the hash method runs in constant time, but I'm not sure that mapping the hash value to a constant is a good idea.
On 10/17/07, Chad Perrin <per...@apotheon.com> wrote:
> While we're discussing things with equal signs -- why isn't = a method?
I assume you mean in the execution of an expression like
a = b
as opposed to
a.b = c
Now in the first case, what object is sent the message :=
Remember that variable aren't objects, they are references to objects, and assignment changes the binding of a variable it doesn't operate on an object.
On Thu, Oct 18, 2007 at 04:56:48AM +0900, Rick DeNatale wrote:
> Remember that variable aren't objects, they are references to objects, > and assignment changes the binding of a variable it doesn't operate on > an object.
Oops. I guess I should have thought of that.
Thanks.
-- CCD CopyWrite Chad Perrin [ http://ccd.apotheon.org ] John Kenneth Galbraith: "If all else fails, immortality can always be assured through spectacular error."
Rick Denatale wrote: > def hash > to_a.sort.hash > end
This requires the hash key and value to implement <=> which will not be always the case and thus breaks.
> def hash > 1 > end > ... > but I'm not sure that mapping the hash value to a > constant is a good idea.
I'm sure it is not. That way every Hash as key is stored in the same bucket, essentially making it no different from an Array in the way keys are looked up (only advantage is, that you don't have the method call overhead you'd have with an Array solution).
On 10/18/07, Stefan Rusterholz <apei...@gmx.net> wrote:
> Rick Denatale wrote: > > def hash > > to_a.sort.hash > > end
> This requires the hash key and value to implement <=> which will not be > always the case and thus breaks.
Correct, depending on the individual hash. On the other hand this is yet another aspect of the subtle differences between the 'type' of an object and its class, and I don't think that such things are probably unavoidable is such an evil thing:
> > def hash > > 1 > > end > > ... > > but I'm not sure that mapping the hash value to a > > constant is a good idea.
> I'm sure it is not. That way every Hash as key is stored in the same > bucket, essentially making it no different from an Array in the way keys > are looked up (only advantage is, that you don't have the method call > overhead you'd have with an Array solution).
Well not exactly, it means that there will always be a collision, followed by searching in the hash.
I was actually subtly reinforcing the point that performance considerations might well have driven the decision to make Hash#eql? be an alias for equal? To be fair, my naive constant hash could be improved by using some inexpensive quality which did partitioned hashes into a greater number of equivalence classes, for example perhaps: class Hash def hash length end end
Although this would be been pretty ineffective in Phrogz' use-case
It might be of interest to note that Smalltalk has both a Dictionary (which is pretty much equivalent to Ruby's Hash) and an IdentityDictionary which uses object-identity instead of equality to differentiate keys.
> Chad Perrin wrote: > On Wed, Oct 17, 2007 at 09:12:26PM +0900, Rick DeNatale wrote: >> 1) Using hashs as keys in another hash is not a common use case. >> I'm a >> littlehard-pressed to think of why I'd want to, although I'm famous >> for lack of imagination. >> 2) Because of the requirement that obj1.eql? obj2 => obj1.hash == >> obj2.hash, implementing Hash#hash requires iterating over the keys >> and >> values and would be fairly expensive and make accessing a hash with >> hash keys by key impractical.
> foo.eql? still seems a little inconsistent (and surprising) to me. > Essentially, it's == except when comparing hashes, at which point it > suddenly becomes foo.equal?.
This isn't really true. I just spent some time looking at the 1.8 sources for Object/Array/String/Hash/Set in order to understand things better and found some interesting insights into eql? vs == that I haven't seen mentioned before.
Here is how I would characterize equal?, eql? and ==
equal? object identity test eql? strict equality test (defaults to identity) == loose equality test (defaults to identity)
Strict equality for Object, Array, String, and Hash means that the class of the objects are identical. I mention this because that is not the case with loose equality.
Array and String define 'strict equality' based on their contents. The containers don't have have to be identical but the contents must follow the 'strict equality' rule. For arrays, this means considering x.eql? (y) for all the corresponding elements. For strings, it just means a byte comparison of the contents.
Object and Hash define 'strict equality' as 'object identity'. The other messages in the thread give some reasons why this is a reasonable definition for Hash (i.e. why the contents are ignored).
The more interesting concept is 'loose equality' defined by ==. What I didn't know and what isn't clear from the documentation is that Array, String, and Hash all implement a coercion 'protocol' for #==. If the object on the right hand side responds to to_ary, to_str, or to_hash respectively, then the receiver and argument are reversed and #== is called again.
str == str_pretender
becomes
str_pretender == str
at which point the implementation of == as provided by the 'pretender' comes into play. This little trick means that can you construct your own classes that pretend to be arrays, strings, or hashes and == will commute correctly with the standard classes as long as you take care to define to_ary/to_str/to_hash and ==. Nice.
In addition, it is important to note that while Hash#== uses #eql? to compare *keys* it uses #== to compare values:
Because #== is used on the values, the coercion protocol I mentioned above will come into play when comparing hash contents.
I didn't look at <=> as closely but it looks like Array#<=> will happily convert the right hand side via #to_ary if available. String#<=> looks for to_str and then applies the coercion protocol as with == taking care to reverse the results. Objects and Hashes don't have a natural ordering so <=> isn't defined.
Set
Set is a bit strange. Set#eql? depends on Hash#eql? on an internal hash but it isn't possible for two sets to share the same internal hash and since Hash#eql? is based on object identity it isn't possible for Set#eql? to ever return true, which means that 'strict equality' for sets is the same as object identity. If the definition of Hash#eql? changed, then Set#eql? would follow. Set#== doesn't implement a coercion protocol and sets aren't ordered so no Set#<=>.