Generated hashcode() returns different values across JVM instances?

3,517 views
Skip to first unread message

Jay Booth

unread,
May 9, 2011, 11:25:09 AM5/9/11
to Protocol Buffers
I'm testing an on-disk hashtable with Protobufs and noticed that with
the java generated hashcode function, it seems to return a different
hashcode across JVM invocations for the same logically equivalent
object (tested with a single string protobuf, same string for both
instances).

Is this known behavior? Bit busy right now backporting this to work
with String keys instead but I could provide a bit of command line
code that demonstrates the issue when I get a chance.

Glancing at the generated hashcode() function, it looks like the
difference comes from etiher getDescriptorForType().hashCode() or
getUnknownFields().hashCode(), both of which are incorporated.

Dmitriy Ryaboy

unread,
May 11, 2011, 12:01:11 AM5/11/11
to Jay Booth, Protocol Buffers
Hi Jay,

I encountered that before. Unfortunately this is a legitimate thing to
do, as documented in Object.hashCode()

I have a write-up of the problem and how we wound up solving it (not
elegant.. suggestions welcome) here:
http://squarecog.wordpress.com/2011/02/20/hadoop-requires-stable-hashcode-implementations/

D

> --
> You received this message because you are subscribed to the Google Groups "Protocol Buffers" group.
> To post to this group, send email to prot...@googlegroups.com.
> To unsubscribe from this group, send email to protobuf+u...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/protobuf?hl=en.
>
>

Jay Booth

unread,
May 11, 2011, 3:02:08 PM5/11/11
to Protocol Buffers
It can be legitimate, especially in the case of Object.hashCode(), but
it's supposed to be in sync with equals() by contract. As it stands,
two objects which are equal() will produce different hashes, or the
same logical object will produce different hashes across JVMs. That
breaks the contract.. if the equals() method simply did return (other
== this), then it'd be fine, albeit a little useless.

I created an issue and posted a 1-liner patch that would eliminate the
problem by using getClass().getName().hashCode() to incorporate type
information into the hashCode without depending on a Descriptor
object's memory address.

On May 11, 12:01 am, Dmitriy Ryaboy <dvrya...@gmail.com> wrote:
> Hi Jay,
>
> I encountered that before. Unfortunately this is a legitimate thing to
> do, as documented in Object.hashCode()
>
> I have a write-up of the problem and how we wound up solving it (not
> elegant.. suggestions welcome) here:http://squarecog.wordpress.com/2011/02/20/hadoop-requires-stable-hash...

Ben Wright

unread,
May 11, 2011, 4:54:13 PM5/11/11
to Protocol Buffers
Jay:

Using the class name to generate the hashcode is logically incorrect
because the class name can be derived by the options java_package_
name and java_outer_classname.

Additionally (although less likely to matter), separate protocol
buffer files can define an identical class names with different
protocol buffers.

Lastly, and most importantly...

If the same Message is being used with generated code and with dynamic
code, the hash code for the descriptor would still be identical if
generated from the descriptor instance, whereas the dynamic usage does
not have a classname from which to derive a hashcode. While in your
case this should not matter, it does matter for other users of
protobuf. The hashcode function would be better served by being
implemented correctly from state data for the descriptor.
Additionally, in generated code it seems that this hashcode could be
pre-computed by the compiler and Descriptor.hashcode() could return a
constant integer - which would be much more efficient than any other
method.

Jay Booth

unread,
May 11, 2011, 5:20:47 PM5/11/11
to Protocol Buffers
Well, to sidestep the whole discussion, we could just eliminate types
from hashcode completely and work strictly on field hashcodes. The
type check is there in equals(), hashmaps will function correctly, and
hash collisions are already allowed between inequal objects, they're a
fact of life. Return 1; would be legal, it would just be a really
bad idea for anything that actually uses the hashcode.

I figured getClass().getName() would provide the additional hash
spread benefit in the majority of cases and HashMaps and the like
could fall back on equals() as they always do upon the rare collision
between differently-typed-but-same-fields objects. Descriptor has a
bunch of stuff attached to it, which would lead to a potentially
complicated/expensive hashCode() implementation and I didn't
understand that code well enough to suggest a patch. I also don't
know the protobuf compiler well enough to have an opinion on
generating a constant per-class as you suggest, I could see it leading
to trouble when you regenerate classes that haven't changed unless
it's deterministic somehow. Just going with the KISS principle.

Ben Wright

unread,
May 11, 2011, 5:25:49 PM5/11/11
to Protocol Buffers
Alternatively... instead of putting the onus on the compiler, the
hashcode could be computed by the JVM at initialization time for the
Descriptor instance, (which would also help performance of dynamically
parsed Descriptor instance hashcode calls).

i.e.

private final int computedHashcode;

public Descriptor() {
//initialization

computedHashcode = do_compute_hashCode();
}

public int hashCode() {
return computedHashcode;
}

punlic int do_compute_hashCode(){
return // compute hashcode
}

This is all talking towards optimum performance implementation... the
real problem is the need for a hashCode implementation for Descriptor
based on the actual Descriptor's content...


On May 11, 4:54 pm, Ben Wright <compuware...@gmail.com> wrote:

Ben Wright

unread,
May 11, 2011, 5:32:58 PM5/11/11
to Protocol Buffers
I think we wrote those replies at the same time : )

You're right, at the cost of some additional hash collisions, the
simplest solution is to simply not include the type / descriptor in
the hash calculation at all.

The best / least-collision solutions with good performance would be
what I wrote in my previous post, but that requires that someone
(presumably a current committer) with sufficient knowledge of the
Descriptor types to have enough time to update the compiler and java
libraries accordingly.

Any input from a committer for this issue? Seems the simple solution
would take less than an hour to push into the stream and could make it
into the next release.

Pherl Liu

unread,
May 16, 2011, 10:03:06 AM5/16/11
to Ben Wright, Protocol Buffers
We discussed internally and decided not to make the hashCode() return deterministic result. If you need consistent hashcode in different runs, use toByteString().hashCode().

Quoted from Kenton:

Hashing the content of the descriptor would actually be incorrect, because two descriptors with exactly the same content are still considered different types.  Descriptors are compared by identity, hence they are hashed by pointer.

Removing the descriptor from the calculation would indeed make hashCode() consistent between two runs of the same binary, and probably insignificant runtime cost.  Of course, once you do that, you will never be able to introduce non-determinism again because people will depend on it.

But there's a much bigger risk.  People may actually start depending on hashCode() returning consistent results between two different versions of the binary, or two completely separate binaries that compile in the same protocol, or -- most dangerously -- two different versions of the same protocol (e.g. with fields added or removed).  I think it would be very difficult and limiting to make these guarantees, so I would be extremely cautious about this.

Certainly, there is no implementation of hashCode() that would be any safer than .toByteString().hashCode().  So, I'd advise steering people to the latter.  Note that if unknown fields are present, the results may still be inconsistent.  However, there is no reasonable way to implement a hashCode() that is consistent in the presence of unknown fields.

Jay Booth

unread,
May 18, 2011, 2:48:28 PM5/18/11
to Protocol Buffers
Well, that's your prerogative, I guess, but why even implement
hashcode at all then? Just inherit from object and you're getting
effectively the same behavior. Is that what you're intending?

Jay Booth

unread,
May 18, 2011, 2:50:55 PM5/18/11
to Protocol Buffers
I'd also note that this precludes usage of protobuf objects in
HashSets or as keys in HashMaps -- any time someone does that, it will
be a (subtle and confusing) bug, as hashcode() isn't aligned with
equals().

Jason Hsueh

unread,
May 18, 2011, 3:00:19 PM5/18/11
to Jay Booth, Protocol Buffers
Jumping in late to the thread, and I'm not really a Java person, so I may be misunderstanding something here. But as far as I can tell, you are asking for hashCode() to be a 'consistent' hash across processes. hashCode() as implemented is still useful within a single JVM, allowing you to use protobufs in HashMaps based on content rather than object identity. That was the intended use case.

Jason Hsueh

unread,
May 18, 2011, 3:04:26 PM5/18/11
to Jay Booth, Protocol Buffers
Ah, I didn't get this message. How does this break use in a HashMap? Given two protobufs of the same type in the same JVM, with the same contents, does hashCode() not return the same value?

Ron Reynolds

unread,
May 18, 2011, 3:08:37 PM5/18/11
to Jason Hsueh, Jay Booth, Protocol Buffers
the corner-stone of Hash* containers is:
 (A.equals(B)) => (A.hashCode() == B.hashCode()) for all A, B.  

tho it's not explicitly stated it would seem to be implied that is within a single JVM.
not sure if the code in question maintains that rule within a JVM (if not that's a big deal).  
if so that would seem sufficient for all but the most distributed of Hash* containers, such as where a client which is remote from the storage (i.e, in another JVM) determines the bucket (based on hashCode()) to find that the element in question has been placed into another bucket because the hashCode() within the containing JVM has evaluated to another value.  that's a pretty far-fetched, but not unimaginable, situation.


From: Jason Hsueh <jas...@google.com>
To: Jay Booth <jayb...@gmail.com>
Cc: Protocol Buffers <prot...@googlegroups.com>
Sent: Wed, May 18, 2011 12:00:19 PM
Subject: Re: [protobuf] Re: Generated hashcode() returns different values across JVM instances?

Jason Hsueh

unread,
May 18, 2011, 3:13:32 PM5/18/11
to Ron Reynolds, Jay Booth, Protocol Buffers
On Wed, May 18, 2011 at 12:08 PM, Ron Reynolds <tequi...@ymail.com> wrote:
the corner-stone of Hash* containers is:
 (A.equals(B)) => (A.hashCode() == B.hashCode()) for all A, B.  

From what I"ve gathered, this invariant is true for protocol buffer's implementation of equals() and hashCode(). The descriptor object identity is included in both methods.

Jay Booth

unread,
May 18, 2011, 3:19:24 PM5/18/11
to Protocol Buffers
Right, I forgot that the Descriptor object is statically initialized
so it'll be consistent within the JVM.

I still think that in the context of serializable objects, (i.e.
objects intended to be transported between JVMs in some way, shape or
form) a consistent hashCode would be useful for a lot of cases (mine
included). If it can't be consistent in the presence of unknown
fields, perhaps a well-documented caution to that (relatively
uncommon?) case would be more useful than completely punting on all
cases.

On May 18, 3:08 pm, Ron Reynolds <tequila...@ymail.com> wrote:
> the corner-stone of Hash* containers is:
>  (A.equals(B)) => (A.hashCode() == B.hashCode()) for all A, B.  
>
> tho it's not explicitly stated it would seem to be implied that is within a
> single JVM.
> not sure if the code in question maintains that rule within a JVM (if not that's
> a big deal).  
> if so that would seem sufficient for all but the most distributed of Hash*
> containers, such as where a client which is remote from the storage (i.e, in
> another JVM) determines the bucket (based on hashCode()) to find that the
> element in question has been placed into another bucket because the hashCode()
> within the containing JVM has evaluated to another value.  that's a pretty
> far-fetched, but not unimaginable, situation.
>
> ________________________________
> From: Jason Hsueh <jas...@google.com>
> To: Jay Booth <jaybo...@gmail.com>
> Cc: Protocol Buffers <prot...@googlegroups.com>
> Sent: Wed, May 18, 2011 12:00:19 PM
> Subject: Re: [protobuf] Re: Generated hashcode() returns different values across
> JVM instances?
>
> Jumping in late to the thread, and I'm not really a Java person, so I may be
> misunderstanding something here. But as far as I can tell, you are asking for
> hashCode() to be a 'consistent' hash across processes. hashCode() as implemented
> is still useful within a single JVM, allowing you to use protobufs in HashMaps
> based on content rather than object identity. That was the intended use case.
>
> To post to this group, send email to ...
>
> read more »

Jason Hsueh

unread,
May 18, 2011, 3:33:44 PM5/18/11
to Jay Booth, Protocol Buffers
Well, hashing the serialization as Pherl/Kenton suggested gives you that, and seems cleaner and more direct than making hashCode() of the in-memory object assume that you are doing something across JVMs.

(Also, unknown fields are pretty common when you may have a jungle of binaries, which is true here.)

Jay Booth

unread,
May 18, 2011, 3:44:29 PM5/18/11
to Jason Hsueh, Protocol Buffers
That definitely seems acceptable. Any reason not to make the default
implementation return toByteString().hashCode() and get the best of
both worlds? It's a little more expensive probably, but
Object.hashCode() is already deceptively expensive to begin with.

I personally switched my particular problem to use Strings as keys,
since that works for the use case here and is easier for corresponding
command-line tools, I just generally see a consistent hashcode as a
good idea if supportable.

Jason Hsueh

unread,
May 18, 2011, 5:43:18 PM5/18/11
to Jay Booth, Protocol Buffers
On Wed, May 18, 2011 at 12:44 PM, Jay Booth <jayb...@gmail.com> wrote:
That definitely seems acceptable.  Any reason not to make the default
implementation return toByteString().hashCode() and get the best of
both worlds?  It's a little more expensive probably, but
Object.hashCode() is already deceptively expensive to begin with.

I'm not well-enough versed in Java to say with authority, but as far as I can tell, hashCode() is intended for HashMap/HashSet, and assuming that Object.hashCode() provides some consistent hash is incorrect. Yes, protobuf could attempt to provide a consistent hash with toByteString().hashCode(), except we already know that there are caveats to that, so it's not like we can provide any contractual guarantee. Applications like yours that need a stable hash could implement a wrapper object that overrides hashCode() to use this approach, or perhaps simply use a different method altogether to express the stability requirement.
 
I personally switched my particular problem to use Strings as keys,

I think you want to avoid using Strings to store the serializations since this is binary data, and String will coerce that to UTF.
Reply all
Reply to author
Forward
0 new messages