The unique function and iterables of custom composite types

140 views
Skip to first unread message

Marc Gallant

unread,
Jul 16, 2015, 9:16:01 AM7/16/15
to julia...@googlegroups.com
The unique function doesn't appear to work using iterables of custom composite types, e.g.,

julia> type Foo
       x::Int
       end

julia> import Base: ==

julia> ==(f1::Foo, f2::Foo) = f1.x == f2.x
== (generic function with 85 methods)

julia> unique(foos)
2-element Array{Foo,1}:
 Foo(4)
 Foo(4)

julia> unique(foos)[1] == unique(foos)[2]
true

Is this the intended behaviour?

milktrader

unread,
Jul 16, 2015, 9:33:26 AM7/16/15
to julia...@googlegroups.com
How did you get  foos?

Marc Gallant

unread,
Jul 16, 2015, 9:35:18 AM7/16/15
to julia...@googlegroups.com
Whoops! Here is the updated code:


julia> type Foo
       x::Int
       end

julia> import Base: ==

julia> ==(f1::Foo, f2::Foo) = f1.x == f2.x
== (generic function with 85 methods)

julia> foos = [Foo(4), Foo(4)]
2-element Array{Foo,1}:
 Foo(4)
 Foo(4)

julia> unique(foos)
2-element Array{Foo,1}:
 Foo(4)
 Foo(4)

julia> unique(foos)[1] == unique(foos)[2]
true

Stefan Karpinski

unread,
Jul 16, 2015, 9:36:21 AM7/16/15
to julia...@googlegroups.com
You need to also define a hash method for this type.

milktrader

unread,
Jul 16, 2015, 10:29:57 AM7/16/15
to julia...@googlegroups.com
Julia 0.4- has different behavior ...

First, with 0.3.9

julia> versioninfo()
Julia Version 0.3.9
Commit 31efe69 (2015-05-30 11:24 UTC)
Platform Info:
  System: Darwin (x86_64-apple-darwin13.4.0)
  CPU: Intel(R) Core(TM)2 Duo CPU     P7350  @ 2.00GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Penryn)
  LAPACK: libopenblas
  LIBM: libopenlibm
  LLVM: libLLVM-3.3

julia> type Foo
            x::Int
            end

julia> import Base: ==

julia> ==(f1::Foo, f2::Foo) = f1.x == f2.x
== (generic function with 80 methods)

julia> foos = [Foo(4), Foo(4)]
2-element Array{Foo,1}:
 Foo(4)
 Foo(4)

julia> unique(foos)
2-element Array{Foo,1}:
 Foo(4)
 Foo(4)

julia> unique(foos)[1] == unique(foos)[2]
true

And now 0.4-dev

julia> versioninfo()
Julia Version 0.4.0-dev+5587
Commit 78760e2 (2015-06-25 14:27 UTC)
Platform Info:
  System: Darwin (x86_64-apple-darwin13.4.0)
  CPU: Intel(R) Core(TM)2 Duo CPU     P7350  @ 2.00GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Penryn)
  LAPACK: libopenblas
  LIBM: libopenlibm
  LLVM: libLLVM-3.3

julia> type Foo
            x::Int
            end

julia> import Base: ==

julia> ==(f1::Foo, f2::Foo) = f1.x == f2.x
== (generic function with 108 methods)

julia> foos = [Foo(4), Foo(4)]
2-element Array{Foo,1}:
 Foo(4)
 Foo(4)

julia> unique(foos)
1-element Array{Foo,1}:
 Foo(4)

julia> unique(foos)[1] == unique(foos)[2]
ERROR: BoundsError: attempt to access 1-element Array{Foo,1}:
 Foo(4)
  at index [2]
 in getindex at array.jl:292

Stefan Karpinski

unread,
Jul 16, 2015, 10:52:03 AM7/16/15
to Julia Users
I don't see that on 0.4-dev – it also doesn't seem possible without having defined a hash method since unique is implemented with a dict.

Seth

unread,
Jul 16, 2015, 11:02:46 AM7/16/15
to julia...@googlegroups.com
I can confirm this works as described by milktrader on 0.4.0-dev+5860 (2015-07-08 20:57 UTC) Commit 7fa43ed (7 days old master).

julia> unique(foos)
1-element Array{Foo,1}:
 
Foo(4)


Matt Bauman

unread,
Jul 16, 2015, 11:30:39 AM7/16/15
to julia...@googlegroups.com
Bizarre.  I happen to have last updated on *exactly* the same commit SHA, but I'm seeing the original (expected) behavior:

$ julia -q
julia> versioninfo()
Julia Version 0.4.0-dev+5860
Commit 7fa43ed (2015-07-08 20:57 UTC)
Platform Info:
  System: Darwin (x86_64-apple-darwin14.3.0)
  CPU: Intel(R) Core(TM) i5 CPU       M 520  @ 2.40GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT NO_AFFINITY NEHALEM)
  LAPACK: libopenblas
  LIBM: libopenlibm
  LLVM: libLLVM-3.3

julia> type Foo
           x::Int
       end

julia> ==(f1::Foo, f2::Foo) = f1.x == f2.x
== (generic function with 109 methods)

julia> unique([Foo(4),Foo(4)])
2-element Array{Foo,1}:
 Foo(4)
 Foo(4)

julia> @which hash(Foo(4), zero(UInt))
hash(x::ANY, h::UInt64) at hashing.jl:10

Might there be some package that changes this behavior?  Is the result of `@which hash(Foo(4), zero(Uint))` the same as what I show above?

Stefan Karpinski

unread,
Jul 16, 2015, 11:34:18 AM7/16/15
to Julia Users
I just tried it on that exact version and do not see that behavior: 

julia> versioninfo()
Julia Version 0.4.0-dev+5587
Commit 78760e2 (2015-06-25 14:27 UTC)
Platform Info:
  System: Darwin (x86_64-apple-darwin14.3.0)
  CPU: Intel(R) Core(TM) M-5Y71 CPU @ 1.20GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Prescott)
  LAPACK: libopenblas
  LIBM: libopenlibm
  LLVM: libLLVM-3.3

julia> type Foo
           x::Int
       end

julia> ==(f1::Foo, f2::Foo) = f1.x == f2.x
== (generic function with 108 methods)

julia> foos = [Foo(4), Foo(4)]
2-element Array{Foo,1}:
 Foo(4)
 Foo(4)

julia> unique(foos)
2-element Array{Foo,1}:
 Foo(4)
 Foo(4)

If you look at the definition of unique, it doesn't see like it could possibly do that without defining a hash method for Foo:

function unique(C)
    out = Array(eltype(C),0)
    seen = Set{eltype(C)}()
    for x in C
        if !in(x, seen)
            push!(seen, x)
            push!(out, x)
        end
    end
    out
end

The only way I can think of that this might occur is if you happen to get a hash collision with the default hashing function – which actually has about a 1/16 chance of happening. Makes me wonder if we shouldn't check both hash values and equality in our Dict implementation. Saving hash values is actually fairly common in hash table implementations since it allows rehashing the table without actually recomputing the hash of every key.

Stefan Karpinski

unread,
Jul 16, 2015, 11:36:03 AM7/16/15
to Julia Users
Dan and/or Seth, can you try that again and check if hash(foos[1]) and hash(foos[2]) have the same last hex digit?

milktrader

unread,
Jul 16, 2015, 11:39:34 AM7/16/15
to julia...@googlegroups.com
julia> hash(foos[1]) #and hash(foos[2])
0xfa40ebab47e8bee1

julia> hash(foos[2])
0x00ef97f955461671

Matt Bauman

unread,
Jul 16, 2015, 11:50:01 AM7/16/15
to julia...@googlegroups.com
Sure enough:

julia> unique([Foo(4),Foo(4)])
2-element Array{Foo,1}:
 Foo(4)
 Foo(4)

julia> unique([Foo(4),Foo(4)])
1-element Array{Foo,1}:
 Foo(4)

julia> unique([Foo(4),Foo(4)])
2-element Array{Foo,1}:
 Foo(4)
 Foo(4)

I think they just got "unlucky."  This seems to collide far too easily.

Stefan Karpinski

unread,
Jul 16, 2015, 11:50:52 AM7/16/15
to Julia Users
Well, that's it then. Cause: accidental hash collision. The fix is to define hash for the type. This makes me wonder if we shouldn't just leave hash undefined for custom mutable types and make it easy to opt into hashing by identity. At least then you'll get a clear no method error (and we could trap that and add more helpful information), instead of weird behavior when you define == but not hash for your types.

Seth

unread,
Jul 16, 2015, 12:04:13 PM7/16/15
to julia...@googlegroups.com
I can't because I just rebuilt to latest to test https://github.com/JuliaLang/julia/issues/12063 - but I'll try on the latest master...

... and Julia Version 0.4.0-dev+6005 Commit 242bf47 does not appear to have the issue (I'm getting two results returned for unique()).

milktrader

unread,
Jul 16, 2015, 12:04:56 PM7/16/15
to julia...@googlegroups.com
Yep, restarting I don't get the one-element array

julia> type Foo
            x::Int
            end

julia> import Base: ==

julia> ==(f1::Foo, f2::Foo) = f1.x == f2.x
== (generic function with 108 methods)

julia> foos = [Foo(4), Foo(4)]
2-element Array{Foo,1}:
 Foo(4)
 Foo(4)

julia> unique(foos)
2-element Array{Foo,1}:
 Foo(4)
 Foo(4)


Seth

unread,
Jul 16, 2015, 12:07:50 PM7/16/15
to julia...@googlegroups.com
Stefan,

If I'm reading between the lines correctly, the default/existing hash function is based on the last byte of the ID? Is there a reason we don't make the table wider to reduce the chances of collisions (or would this have bad effects on memory utilization)?

Stefan Karpinski

unread,
Jul 16, 2015, 12:09:08 PM7/16/15
to Julia Users
This is just an artifact of memory layout – hashing by identity is based on object_id which is based on memory address. It's not a meaningful behavioral difference between Julia versions. Sticking mutable objects for which hash and == disagree is an undefined behavior and causes dictionaries to do potentially weird things.

Stefan Karpinski

unread,
Jul 16, 2015, 12:12:00 PM7/16/15
to Julia Users
On Thu, Jul 16, 2015 at 12:07 PM, Seth <catc...@bromberger.com> wrote:
Stefan,

If I'm reading between the lines correctly, the default/existing hash function is based on the last byte of the ID? Is there a reason we don't make the table wider to reduce the chances of collisions (or would this have bad effects on memory utilization)?

The initial size of a Dict is 16 slots, which is why two values whose hash values have the same last hex digit collide – the slot is the hash modulo the size of the Dict slots array. The reason not to make it bigger is that if you have a lot of small Dicts you don't want to waste memory. Making it bigger doesn't fix the problem, it just makes it harder to discover, which strikes me as worse, not better.

Seth

unread,
Jul 16, 2015, 12:13:15 PM7/16/15
to julia...@googlegroups.com
Ah, ok, thanks. This might cause issues with one of my packages, which is why I'm interested. How would you approach creating the hash dispatch for custom types, and does this impact immutables as well?

milktrader

unread,
Jul 16, 2015, 12:19:25 PM7/16/15
to julia...@googlegroups.com
Also, back to the OP question, is the correct solution to simply define 

 Base.hash(f::Foo) = f.x

Matt Bauman

unread,
Jul 16, 2015, 12:25:01 PM7/16/15
to julia...@googlegroups.com
On Thursday, July 16, 2015 at 12:19:25 PM UTC-4, milktrader wrote:
Also, back to the OP question, is the correct solution to simply define 

 Base.hash(f::Foo) = f.x

No, I'd define Base.hash(f::Foo) = hash(f.x, 0x64c74221932dea5b), where I chose the constant by rand(UInt).  This way it won't collide with other types.  I really need to spend a bit more time with my interfaces chapter and add this info there.

On Thursday, July 16, 2015 at 12:07:50 PM UTC-4, Seth wrote:
If I'm reading between the lines correctly, the default/existing hash function is based on the last byte of the ID? Is there a reason we don't make the table wider to reduce the chances of collisions (or would this have bad effects on memory utilization)?

It's not really a hash collision, but rather a hashtable index collision.  A new dict only has 16 slots for its keys, and the index for an element is simply chosen by the last 8 bits of its hash.  If those last 8 bits match the hash of an object already in the table, there's a collision, and the Dict checks to see if the existing element `isequal` to the new one (and doesn't look at the hash again).  So the hashes are different, but isequal says the objects are the same.  This is why things are wonky.
Reply all
Reply to author
Forward
0 new messages