Re: [julia-users] The unique function and iterables of custom composite types

Matt Bauman

unread,

Jul 16, 2015, 12:30:49 PM7/16/15

to julia...@googlegroups.com

On Thursday, July 16, 2015 at 12:25:01 PM UTC-4, Matt Bauman wrote:

On Thursday, July 16, 2015 at 12:19:25 PM UTC-4, milktrader wrote:
Also, back to the OP question, is the correct solution to simply define

Base.hash(f::Foo) = f.x

No, I'd define Base.hash(f::Foo) = hash(f.x, 0x64c74221932dea5b), where I chose the constant by rand(UInt). This way it won't collide with other types. I really need to spend a bit more time with my interfaces chapter and add this info there.

Gah, that won't work on 32 bit, sorry. The current idiom for this in base is something like:

const _foo_seed = UInt === UInt64 ? 0x64c74221932dea5b : 0x80783eb4
Base.hash(f::Foo) = hash(f.x, _foo_seed)

It'd be nice to make this easier.

Seth

unread,

Jul 16, 2015, 12:35:25 PM7/16/15

to julia...@googlegroups.com

On Thursday, July 16, 2015 at 9:25:01 AM UTC-7, Matt Bauman wrote:

On Thursday, July 16, 2015 at 12:19:25 PM UTC-4, milktrader wrote:
Also, back to the OP question, is the correct solution to simply define

Base.hash(f::Foo) = f.x

No, I'd define Base.hash(f::Foo) = hash(f.x, 0x64c74221932dea5b), where I chose the constant by rand(UInt). This way it won't collide with other types. I really need to spend a bit more time with my interfaces chapter and add this info there.

How would you do this (in a performant way) for a type that has two internal values? That is, I have

immutable Foo{T}
   x::T
   y::T
end

Obviously, the hash should be based on both values, right? I could do

Base.hash(f::Foo) = hash(hash(f.x, f.y), 0x64c74221932dea5b)

But that calls hash twice. (Is this even necessary with immutables?)

Matt Bauman

unread,

Jul 16, 2015, 12:46:12 PM7/16/15

to julia...@googlegroups.com

Goodness, I'll get this right one of these times. Sorry for spouting off so much disinformation! I'll try to make up for it by adding more examples to the documentation.

const _foo_seed = UInt === UInt64 ? 0x64c74221932dea5b : 0x80783eb4

Base.hash(f::Foo, h::UInt) = hash(f.x, h += _foo_seed)

The canonical hash function to define is the one that takes a second `UInt` argument. Here's what the help has to say about hashing multiple values: "New types should implement the 2-argument form, typically by calling the 2-argument `hash` method recursively in order to mix hashes of the contents with each other (and with `h`)". And that's exactly how most base functions do it.

Seth

unread,

Jul 16, 2015, 1:51:04 PM7/16/15

to julia...@googlegroups.com

OK, I think I got it:

type Foo{T}


  x::T
  y::T
end

Base.hash(f::Foo) = hash(f.x, hash(f.y, pair_seed))

Stefan Karpinski

unread,

Jul 16, 2015, 1:55:25 PM7/16/15

to Julia Users

Immutables automatically have value-based hashing defined for them. That's a dangerous default for mutable values since it makes it easy to stick something in a dict, then mutate it, and "lose it", e.g.:

type Mutable
x::Int
end

Base.hash(m::Mutable, h::UInt) = hash(m.x, h + (0x17d88030d571c6e3 % UInt))
==(m1::Mutable, m2::Mutable) = m1.x == m2.x

julia> m = Mutable(0)
Mutable(0)

julia> d = Dict()
Dict{Any,Any} with 0 entries

julia> d[m] = "here"
"here"

julia> m.x = 1
1

julia> d[m]
ERROR: KeyError: Mutable(1) not found
in getindex at dict.jl:695

julia> d
Dict{Any,Any} with 1 entry:
Mutable(1) => "here"

Jeffrey Sarnoff

unread,

Jul 17, 2015, 3:32:55 AM7/17/15

to julia...@googlegroups.com

just curious, why are the full hashes not compared when least significant octets match (above)?

On Thursday, July 16, 2015 at 12:25:01 PM UTC-4, Matt Bauman wrote:

On Thursday, July 16, 2015 at 12:19:25 PM UTC-4, milktrader wrote:
Also, back to the OP question, is the correct solution to simply define

Base.hash(f::Foo) = f.x

No, I'd define Base.hash(f::Foo) = hash(f.x, 0x64c74221932dea5b), where I chose the constant by rand(UInt). This way it won't collide with other types. I really need to spend a bit more time with my interfaces chapter and add this info there.

On Thursday, July 16, 2015 at 12:07:50 PM UTC-4, Seth wrote:
If I'm reading between the lines correctly, the default/existing hash function is based on the last byte of the ID? Is there a reason we don't make the table wider to reduce the chances of collisions (or would this have bad effects on memory utilization)?

It's not really a hash collision, but rather a hashtable index collision. A new dict only has 16 slots for its keys, and the index for an element is simply chosen by the last 8 bits of its hash. If those last 8 bits match the hash of an object already in the table, there's a collision, and the Dict checks to see if the existing element `isequal` to the new one (and doesn't look at the hash again). So the hashes are different, but isequal says the objects are the same. This is why things are wonky.

Stefan Karpinski

unread,

Jul 17, 2015, 8:40:42 AM7/17/15

to Julia Users

Two reasons:

That would entail recomputing the hash of any object that's already in the same table slot (which is not expected to be *that* rare in a hash table).
The == function (well, isequal, really), if correctly defined should always imply that hashes match. If it's possible to get different hashes when two objects are == then either your hash or your == function are broken.

If we stored hash values in the hash table, then reason 1 would go away (at a significant memory cost), so we could warn the user when 2 occurs and they have a mismatch between their == and hash functions.

Seth

unread,

Jul 17, 2015, 10:55:31 AM7/17/15

to julia...@googlegroups.com

But doesn't mean that unique() will stay broken for custom mutable types? That is, if unique() on custom mutable types relies on the fact that the last hash octet is different (and therefore there are no hash collisions), then there has to be SOME method of ensuring differentiation. What am I missing?

Stefan Karpinski

unread,

Jul 17, 2015, 11:08:30 AM7/17/15

to Julia Users

All custom types – mutable or immutable – work correctly by default. There is only a problem if you define a custom == and don't define a hash function that matches, which technically is user error, although it's probably also a documentation issue. But I'm not sure that it's a good idea to have a design where it's this easy to accidentally break things without realizing it.

Matt Bauman

unread,

Jul 17, 2015, 11:34:59 AM7/17/15

to julia...@googlegroups.com

unique() is only broken for custom mutable types that are broken. That method of ensuring differentiation is equality. Here's what happened:

* Define mutable type Foo

* Define equality by Foo's contents

* But don't define a new hash function, and instead fall back on the default implementation (which hashes differently based upon object identity)

Now we're in a situation where two objects that compare as equal have different hashes. When we go to insert an object of this type into a keys of dictionary (or set, as used by unique), here's what happens:

* The dictionary asks for the hash of the key

* Dictionaries start with 16 "slots" where they can store objects, so it just looks at the last hex digit from the hash to decide which one to use

* If there's already an object in that slot, the dictionary must determine if it's the object it's looking for or simply a collision (which is expected). So it checks if the existing object is equal to the new one. If it's equal, then that's the slot it's looking for. If not, then it checks if it's in the next slot. And so on.

* So now we're in this inconsistent state: most of the time (15 out of 16 times) objects that claim to be equal get put in different slots because their hashes are different. But sometimes they'll happen to end up in the same slot, and you get this very different behavior (which happened to be what you want in this case — the one-element uniqued vector).

Seth

unread,

Jul 17, 2015, 11:49:55 AM7/17/15

to julia...@googlegroups.com

I finally understand it! Thank you.

It seems to me that it's not obvious that there's a linkage between == and unique() (and anything else that relies on hash). Here's where this would cause a problem:

* a new user creates a custom type (let's use type Foo; x::Int; end for this discussion).

* he sets a = Foo(4) and b = Foo(4)

now, equality doesn't hold that a == b. So when he sees that, he redefines == to check Foo.x, and it works. But now he's (silently) broken unique() for 15/16 of his structures.

I wonder if this is something the linter could detect.

Stefan Karpinski

unread,

Jul 17, 2015, 11:50:22 AM7/17/15

to Julia Users

I wonder if this whole business could be addressed better with a better interface. When doing both hashing and equality, the real question is just "what parts of this thing actually matter?" and "how do I compare them?" If you had a "important_parts" function that equality and hashing were both generically defined in terms of, then they couldn't possibly mismatch. Obviously that function needs a better name.

Seth

unread,

Jul 17, 2015, 11:52:52 AM7/17/15

to julia...@googlegroups.com

Perhaps a @unique macro in front of the type field could specify that this is to be used in equality/hashing?

Stefan Karpinski

unread,

Jul 17, 2015, 11:55:18 AM7/17/15

to Julia Users

On Fri, Jul 17, 2015 at 11:49 AM, Seth <catc...@bromberger.com> wrote:

I wonder if this is something the linter could detect.

The easy thing to check is if some type has a custom == method but no custom hash method. That's a fairly sure sign that they've broken things. I'd rather make it harder to break things though. Just thinking about how. One approach is to just not define equality or hashing for mutable values by default. Then defining equality would be ok and unique would raise an error since you'd still not have defined hashing – which you'd hopefully do in a way that matches. But I'm still struck by the fact that if you know what fields are important, equality and hashing both fall out in a fairly generic way.

Stefan Karpinski

unread,

Jul 17, 2015, 11:58:37 AM7/17/15

to Julia Users

I think a function that you can override that returns an identity tuple would make sense. In a sense the (poorly named) decompose function for numbers does this same thing, but allows for some amount of post-normalization for performance reasons.

Stefan Karpinski

unread,

Jul 17, 2015, 4:55:52 PM7/17/15

to julia...@googlegroups.com, mba...@gmail.com

https://github.com/JuliaLang/julia/issues/12198

Stefan Karpinski

unread,

Jul 17, 2015, 4:56:01 PM7/17/15

to julia...@googlegroups.com, mba...@gmail.com

andrew cooke

unread,

Jul 19, 2015, 8:03:39 AM7/19/15

to julia...@googlegroups.com

you may be interested in https://github.com/andrewcooke/AutoHashEquals.jl, but it uses all entries. i will consider adding that feature, though (with the idea that if it's not used at all, all are used).

more generally - and this may just be a problem with my programming style - i find that there's a problem with julia (or me) conflating two different things:

* record types
* immutable types

immutable types in julia are intended, as far as i can see, for small, value-like things that are stored on the stack. while "types" are for larger, record like things.

some of us (perhaps coming more from a functional programming background) like to program with records that are immutable, we have no way to do that in julia except by convention.

hence the need for AutoHashEquals (imho) (there are two problems - first, most obviously, you then want record equality to depend on contents, but second, less obviously, when these "immutable records" are used in immutable types then you want the immutable type's hash to depend on the record's contents, and not on the pointer value. hence applying the macro to both types).

andrew

Stefan Karpinski

unread,

Jul 19, 2015, 10:53:43 AM7/19/15

to julia...@googlegroups.com

I don't really get what's lacking. If you want an immutable record type why not just use an immutable type?

andrew cooke

unread,

Jul 19, 2015, 11:44:49 AM7/19/15

to julia...@googlegroups.com

1 - i asked myself the same question after posting and went back and looked at the use case that motivated this. and it was for values that i need to mutate early in their lifetime, and which then remain invariant. i've been thinking about how i could have done it differently, and one approach i am considering is re-creating the objects during the start-up process. i think that would work. so that particular use case is perhaps not so convincing.

2 - however, isn't there still a case for things that should be semantically equal, even though they are composite? maybe i am misunderstanding how things work, but i could imagine some object that stored, say, name and age. if i used an immutable type then equality would depend on interning of strings, wouldn't it? another example would be immutable arrays. in short: immutable things that are accessed by a pointer, and whose contents you want to include in equality.

andrew

Yichao Yu

unread,

Jul 19, 2015, 11:48:57 AM7/19/15

to Julia Users

On Sun, Jul 19, 2015 at 11:44 AM, andrew cooke <and...@acooke.org> wrote:
>
> 1 - i asked myself the same question after posting and went back and looked
> at the use case that motivated this. and it was for values that i need to
> mutate early in their lifetime, and which then remain invariant. i've been
> thinking about how i could have done it differently, and one approach i am
> considering is re-creating the objects during the start-up process. i think
> that would work. so that particular use case is perhaps not so convincing.

If I understand correctly, this is what motivate this julep[1] (the
more general case)

[1] https://github.com/JuliaLang/julia/issues/11902

Reply all

Reply to author

Forward