Is it worth commenting that Strings.common* expose potential memory leak from JDK String?

212 views
Skip to first unread message

Emily Soldal

unread,
Jun 19, 2012, 9:34:32 AM6/19/12
to guava-...@googlegroups.com
I noticed that Strings.commonSuffix and Strings.commonPrefix don't actually make the CharSequence into a concrete instance after calling subSequence which means that if the input "a" is a string, then it will show a view rather than a new char array.

* shrug *

Louis Wasserman

unread,
Jun 19, 2012, 10:18:56 AM6/19/12
to Emily Soldal, guava-...@googlegroups.com
Why would that be a bad thing?
On Tue, Jun 19, 2012 at 9:34 AM, Emily Soldal <em...@soldal.org> wrote:
I noticed that Strings.commonSuffix and Strings.commonPrefix don't actually make the CharSequence into a concrete instance after calling subSequence which means that if the input "a" is a string, then it will show a view rather than a new char array.

* shrug *

Osvaldo Doederlein

unread,
Jun 19, 2012, 10:55:22 AM6/19/12
to Louis Wasserman, Emily Soldal, guava-...@googlegroups.com
In a related note, Oracle is now changing the implementation of java.lang.String; in JDK 7u6, the char[] sharing goes away, so the offset and length fields are not necessary anymore. Operations like substring will simply copy the characters they need to build the new string, so this old pitfall of potential "leaking" (when somebody keeps a substring of a huge source string that could be discarded) will disappear. I guess we'll get this change in some future update of Google JDK. In any event, code like Strings.commonSuffix() will play by the new rules as it invokes JDK's subSequence(), which is implemented with substring(), which will return an independent String instead of a "view".

P.S.: This change is supposed to improve performance, s most strings are small so the overhead of offset/length is worth saving, many methods will be simpler too. As for the extra allocation/GC overhead, recent HotSpot releases have been gaining many new optimizations to better intrinsify arraycopy and also String/StringBuffer/StringBuilder operations, so lots of redundant coping should be elided.

A+
Osvaldo

Louis Wasserman

unread,
Jun 26, 2012, 9:22:57 AM6/26/12
to Osvaldo Doederlein, Emily Soldal, guava-...@googlegroups.com
...Huh.

That seems like a change that could go wrong in lots of different ways.  Mind, I can think of slightly more reasonable alternatives -- for example, always doing the copy if the substring is small, but just doing the view if the substring is comparatively large.

Osvaldo Doederlein

unread,
Jun 29, 2012, 6:14:36 AM6/29/12
to Ian Robertson, guava-...@googlegroups.com, Emily Soldal
The only thing that could go wrong with the new implementation is worse performance for code that makes lots of substring operations where the source string is not discarded, so the heap has to keep two copies of the common characters. More exotic scenarios (involving JNI or privileged reflection) are illegal and would cause failure in the current impl too.

Making the copy/view decision dynamically is not viable because the main objective of this change is to get rid of the offset and count fields, plus the extra code to handle them; the new String class becomes just a wrapper over a single field with the char[].  (Well, that and the cached hash too, they tried to get rid of that but the experiment was not successful.)  The savings from the two removed fields are significant, Strings are typically the most common objects in Java heaps and most Strings are small -- see the classic paper http://www.cs.ucsb.edu/~urs/oocsb/papers/TRCS98-33.pdf, maybe Urs wants to update us on this ;)

Frankly this change should have come years before, if not for conservatism. Too bad that Sun/Oracle didn't also bit the bullet to fix String's very bad hashCode function, just the opposite they recently documented it because the traditional function had become a de-facto standard and some weird stuff would break if that function changed.

A+
Osvaldo


On Fri, Jun 29, 2012 at 2:19 AM, Ian Robertson <ian.b.r...@gmail.com> wrote:
One of the few changes Azul made to the core JDK classes in their open sourced Managed Runtime Initiative was a similar change to java.lang.String. I was never able to find any documentation as to why, however.

Colin Decker

unread,
Jun 29, 2012, 10:38:28 AM6/29/12
to Osvaldo Doederlein, Ian Robertson, guava-...@googlegroups.com, Emily Soldal
If I recall correctly, String's hashCode function has been documented all along, which is precisely why they haven't been able to change it to something better.





--
Colin Decker | Software Engineer | cgde...@google.com | Java Core Libraries Team

Osvaldo Doederlein

unread,
Jun 29, 2012, 12:24:48 PM6/29/12
to Colin Decker, Ian Robertson, guava-...@googlegroups.com, Emily Soldal
On Fri, Jun 29, 2012 at 11:38 AM, Colin Decker <cgde...@google.com> wrote:
If I recall correctly, String's hashCode function has been documented all along, which is precisely why they haven't been able to change it to something better.

Hummm... thanks for the correction. I can see that this documentation exists since 1.2 (it's not there in 1.1.8, but that's pre-history now).

Still, there was debate about the algorithm in the JDK 1.7 project. The original reason for this debate was actually the strings-in-switch feature, which can be implemented more efficiently if the hashcodes of 'case' literals can be computed at compile time and stored in the classfile (desugared switch over the hashcode).  So there was some polemic about wether the hash function, even though described in the javadoc, was "de juris", but they finally decided that it was, another reason is that all independent implementations like Classpath and Harmony use the same exact function.

Emily Soldal

unread,
Jun 29, 2012, 1:17:03 PM6/29/12
to guava-...@googlegroups.com, Colin Decker, Ian Robertson, Emily Soldal
The Java string hashcode function is downright awful because of how it may vulnerability to equivalent substring attacks.

As for the Strings.common*, at least we could select the smallest of the two strings to reduce the impact of any accidental leak.

Éamonn McManus

unread,
Jul 2, 2012, 6:10:54 AM7/2/12
to Osvaldo Doederlein, Colin Decker, Ian Robertson, guava-...@googlegroups.com, Emily Soldal
If I recall correctly, the String.hashCode() algorithm used to be
specified directly in the JLS. And, the original algorithm was
different: it would sample characters from longer strings rather than
examining all of them. So back then it was felt that the algorithm
*could* be changed, but meanwhile such a huge amount of Java code has
built up that changing it now would indeed be sure to break a
nontrivial amount of code. Too bad, though, that when it could be
changed it wasn't changed to something less awful.

Éamonn

Maaartin G

unread,
Jul 2, 2012, 10:17:07 AM7/2/12
to guava-...@googlegroups.com, Osvaldo Doederlein, Colin Decker, Ian Robertson, Emily Soldal
On Monday, July 2, 2012 12:10:54 PM UTC+2, Éamonn McManus wrote:
If I recall correctly, the String.hashCode() algorithm used to be
specified directly in the JLS. And, the original algorithm was
different: it would sample characters from longer strings rather than
examining all of them. So back then it was felt that the algorithm
*could* be changed, but meanwhile such a huge amount of Java code has
built up that changing it now would indeed be sure to break a
nontrivial amount of code. Too bad, though, that when it could be
changed it wasn't changed to something less awful.

In the first place it should never have been specified. But I would not call the algorithm awful, although I see no reason for it working the way it does:

- Multiplication by 31 can't be optimized especially well using instructions like LEA.
- There's no reason for using a prime here, any odd number sort of works here.
- It leads to many trivial collision even for strings of length 2.

Switching to any other deterministic algorithm does NOT help at all. Even with the (truncated) secure hash, finding collisions in 32 bits is very easy and finding multicollisions is feasible. Finding N strings hashing to 0 needs about N*2**31 operations, so in a few hours I could compute thousands of them.

The determinism is the main problem here, the particular algorithm doesn't matter.


Éamonn McManus

unread,
Jul 2, 2012, 11:00:01 AM7/2/12
to Maaartin G, guava-...@googlegroups.com, Osvaldo Doederlein, Colin Decker, Ian Robertson, Emily Soldal
I call String.hashCode() awful, not because collisions can be found,
which as you note is inevitable, but for the other reason you touch
on: many strings hash to the same values, which is not at all
inevitable.

Éamonn

Osvaldo Doederlein

unread,
Jul 2, 2012, 12:48:50 PM7/2/12
to Éamonn McManus, Colin Decker, Ian Robertson, guava-...@googlegroups.com, Emily Soldal
On Mon, Jul 2, 2012 at 7:10 AM, Éamonn McManus <emcm...@google.com> wrote:
If I recall correctly, the String.hashCode() algorithm used to be
specified directly in the JLS. And, the original algorithm was

This only happened because the JLS1 has an appendix with the entire javadocs of the core libs; this chapter was removed in JLS2+...

And, on the vulnerability to hash collision attacks, I'm skeptic about the value of making any performance compromises in the vain attempt to fight such DoS attacks. The hashcode has only 16 bits so it's impossible to implement a really secure hash function; the hackers will always win. If some web application is fragile to DoS because it contains massive hashtables that are indexed by Strings and these strings can be directly provided by some form of user input, this has to (and can) be fixed at the application (or web-container) layer, not at the JVM platform layer.  We should not lose the perspective that the whole point of hashCode() is optimization: it makes lookups faster; but if computing the hashcode itself is expensive because we're paranoid with malicious datasets, then (even with hashcode caching) this benefit will be severely compromised so the whole thing becomes pointless.

A+
Osvaldo

Osvaldo Doederlein

unread,
Jul 2, 2012, 1:42:41 PM7/2/12
to Maaartin G, guava-...@googlegroups.com, Colin Decker, Ian Robertson, Emily Soldal
On Mon, Jul 2, 2012 at 11:17 AM, Maaartin G <graj...@seznam.cz> wrote:

In the first place it should never have been specified. But I would not call the algorithm awful, although I see no reason for it working the way it does:

- Multiplication by 31 can't be optimized especially well using instructions like LEA.
- There's no reason for using a prime here, any odd number sort of works here.

31 is 0b11111; multiplication by that causes the value to "spread" in a neat way, e.g. b101010101010101010 * 31 = b10100101010101010010110, so you gain 5 [binary] digits, but keep roughly the same entropy (at least that's what seems to me, not an expert in this analysis).

My aggravation with the String.hashcode function is that it cannot be parallelized (with SIMD/ILP-friendly code patterns). The function should allow the hashcode to be computed in chunks, say for each group of 4 or 8 consecutive chars, then combined. Some people also suggested (in coin-dev IIRC) a composable function, so that e.g. a concatenation of strings A,B would produce a string C which hashcode can be derived from A's and B's hashcodes; but this would be difficult to achieve in a parallel-friendly function.

A+
Osvaldo

Ian Robertson

unread,
Jul 2, 2012, 2:58:37 PM7/2/12
to guava-...@googlegroups.com, Éamonn McManus, Colin Decker, Ian Robertson, Emily Soldal
It would have been possible to have a more DoS resistant hash-code; the javadoc for Object.hashCode explicitly states that the computed hashCode "need not remain consistent from one execution of an application to another execution of the same application". Ruby and Perl have fixed this by using some random bits to determine their hash algorithm for a given program run.

Osvaldo Doederlein

unread,
Jul 2, 2012, 3:13:10 PM7/2/12
to Ian Robertson, guava-...@googlegroups.com, Éamonn McManus, Colin Decker, Emily Soldal
On Mon, Jul 2, 2012 at 3:58 PM, Ian Robertson <ian.b.r...@gmail.com> wrote:
It would have been possible to have a more DoS resistant hash-code; the javadoc for Object.hashCode explicitly states that the computed hashCode "need not remain consistent from one execution of an application to another execution of the same application". Ruby and Perl have fixed this by using some random bits to determine their hash algorithm for a given program run.

It's a good solution, but I think this idea wouldn't fly at Sun (at least back in the 90's...), with their obsession for reproducible results across platforms, processes and VM versions/implementations. See serialVersionUID and all of java.lang.Math :)

A+
Osvaldo

Maaartin G

unread,
Jul 2, 2012, 3:38:58 PM7/2/12
to guava-...@googlegroups.com, Ian Robertson, Éamonn McManus, Colin Decker, Emily Soldal
On Monday, July 2, 2012 7:42:41 PM UTC+2, Osvaldo Doederlein wrote:
> 31 is 0b11111; multiplication by that causes the value to "spread" in a neat way, ...

I don't think it works really nice, since
- 31 = 32-1, so it sort of has only 2 bits set.
- 31 is a small number and higher bits get set only slowly

> My aggravation with the String.hashcode function is that it cannot be parallelized
> (with SIMD/ILP-friendly code patterns). The function should allow the hashcode
> to be computed in chunks, say for each group of 4 or 8 consecutive chars, then combined.

A paralelization is actually possible, e.g., like
hash(s) = 31 * (31 * (31 * s5 + s3) + s1) + (31 * (31 * s4 + s2) + s0)
These two parts can be computed in paralel, that's nothing for SSE,
but it can use instruction level paralelism.

Combining the chunks is possible, too:
Assert.assertEquals((s1+s0).hashCode(), pow(31, s0.length()) * s1.hashCode() + s0.hashCode());

On Monday, July 2, 2012 6:48:50 PM UTC+2, Osvaldo Doederlein wrote:
> And, on the vulnerability to hash collision attacks, I'm skeptic about the value
> of making any performance compromises in the vain attempt to fight such DoS attacks.
> The hashcode has only 16 bits so it's impossible to implement
> a really secure hash function; the hackers will always win.

Sure, with 32 bits collisions are easy to compute, but with the non-determinism,
the hacker must first learn String.HASHING_SEED used in
sun.misc.Hashing.murmur3_32(HASHING_SEED, value, 0, value.length).
This HASHING_SEED gets generated when String.class gets loaded using
some unknown values, like System.nanoTime() and System.currentTimeMillis(),
so it should be hard to guess.

>  but if computing the hashcode itself is expensive because
> we're paranoid with malicious datasets, then (even with hashcode caching)
> this benefit will be severely compromised so the whole thing becomes pointless.

I don't think the Murmur hash is expensive.

On Monday, July 2, 2012 9:13:10 PM UTC+2, Osvaldo Doederlein wrote:
> It's a good solution, but I think this idea wouldn't fly at Sun
> (at least back in the 90's...), with their obsession for reproducible
> results across platforms, processes and VM versions/implementations.

Instead of changing String.hashCode they're changing code using it:

I actually doubt there's much code depending on the current implementation
of String.hashCode, so I'd vote for changing it directly.
It's just that nobody's asking me for my opinion. :D

In any case, a better hashCode could be made available using JVM arguments or properties.

Osvaldo Doederlein

unread,
Jul 2, 2012, 4:28:27 PM7/2/12
to Maaartin G, guava-...@googlegroups.com, Ian Robertson, Éamonn McManus, Colin Decker, Emily Soldal
On Mon, Jul 2, 2012 at 4:38 PM, Maaartin G <graj...@seznam.cz> wrote:
On Monday, July 2, 2012 7:42:41 PM UTC+2, Osvaldo Doederlein wrote:
> 31 is 0b11111; multiplication by that causes the value to "spread" in a neat way, ...
I don't think it works really nice, since
- 31 = 32-1, so it sort of has only 2 bits set.
- 31 is a small number and higher bits get set only slowly

Strings typically have multiple characters, so the entire 32bit range of hashcode will be used.  I think the purpose of that "spreading" is just making the composition of consecutive characters more complex: a larger number of bits are combined by the addition.
 
> My aggravation with the String.hashcode function is that it cannot be parallelized
> (with SIMD/ILP-friendly code patterns). The function should allow the hashcode
> to be computed in chunks, say for each group of 4 or 8 consecutive chars, then combined.
A paralelization is actually possible, e.g., like
hash(s) = 31 * (31 * (31 * s5 + s3) + s1) + (31 * (31 * s4 + s2) + s0)
These two parts can be computed in paralel, that's nothing for SSE,
but it can use instruction level paralelism.
Combining the chunks is possible, too:
Assert.assertEquals((s1+s0).hashCode(), pow(31, s0.length()) * s1.hashCode() + s0.hashCode());

Hum... for the parallelization, I think you're right, because you can do that in chunks and then combine the pieces with a multiplier that's scaled in a fixed 31ˆN factor each step.  For the composition though it doesn't work, you need the full power of pow() but you need infinite precision as the exponent can be very large and you always need the last 32 bits (at most), and BigInteger is not an option also for perf.
 
On Monday, July 2, 2012 6:48:50 PM UTC+2, Osvaldo Doederlein wrote:
> And, on the vulnerability to hash collision attacks, I'm skeptic about the value
> of making any performance compromises in the vain attempt to fight such DoS attacks.
> The hashcode has only 16 bits so it's impossible to implement
> a really secure hash function; the hackers will always win.
Sure, with 32 bits collisions are easy to compute, but with the non-determinism,
the hacker must first learn String.HASHING_SEED used in
sun.misc.Hashing.murmur3_32(HASHING_SEED, value, 0, value.length).
This HASHING_SEED gets generated when String.class gets loaded using
some unknown values, like System.nanoTime() and System.currentTimeMillis(),
so it should be hard to guess.

>  but if computing the hashcode itself is expensive because
> we're paranoid with malicious datasets, then (even with hashcode caching)
> this benefit will be severely compromised so the whole thing becomes pointless.
I don't think the Murmur hash is expensive.

Well yes, a random seed makes this much better.
 
On Monday, July 2, 2012 9:13:10 PM UTC+2, Osvaldo Doederlein wrote:
> It's a good solution, but I think this idea wouldn't fly at Sun
> (at least back in the 90's...), with their obsession for reproducible
> results across platforms, processes and VM versions/implementations.

Instead of changing String.hashCode they're changing code using it:

I actually doubt there's much code depending on the current implementation
of String.hashCode, so I'd vote for changing it directly.
It's just that nobody's asking me for my opinion. :D

In any case, a better hashCode could be made available using JVM arguments or properties.

Not sure if I like this change.  Even if murmur is cheap enough (= execution is limited by L1 fill rate), you lose the hashcode caching. Strings stored as keys in hashtables have their hashcode cached in the entries, but each string that is looked up in operations like get() doesn't benefit from that, the murmur hash will have to be computed every time. Using this scheme only for large hashtables doesn't sem to avoid the potential regressions, as these would happen with large lookup keys, not large hashtables.  Of course, people don't usually use large strings as keys, but they can ;-) and even this limited use of the new hash can break some code, so they should just bite the bullet and put murmur in String.hashCode().
 

Maaartin G

unread,
Jul 2, 2012, 7:11:08 PM7/2/12
to guava-...@googlegroups.com, Maaartin G, Ian Robertson, Éamonn McManus, Colin Decker, Emily Soldal
On Monday, July 2, 2012 10:28:27 PM UTC+2, Osvaldo Doederlein wrote:
Hum... for the parallelization, I think you're right, because you can do that in chunks and then combine the pieces with a multiplier that's scaled in a fixed 31ˆN factor each step.  For the composition though it doesn't work, you need the full power of pow() but you need infinite precision

NO! You need mod 32 arithmetic, i.e., the fastest thing for current computers. Moreover, you could precompute a couple of them.
 
Not sure if I like this change.  Even if murmur is cheap enough (= execution is limited by L1 fill rate), you lose the hashcode caching. Strings stored as keys in hashtables have their hashcode cached in the entries, but each string that is looked up in operations like get() doesn't benefit from that, the murmur hash will have to be computed every time.

No, there's also a field called hash32 storing this value. Moreover, 0 gets replaced by 1 so you never have to recompute it again.
 
Using this scheme only for large hashtables doesn't sem to avoid the potential regressions, as these would happen with large lookup keys, not large hashtables.  Of course, people don't usually use large strings as keys, but they can ;-)

At the first glance, Murmur seems to do about ten times as much work, but processes 2 chars at once and there seem to be enough ILP. For large strings I'd suppose that the cache miss penalty dominates.

 and even this limited use of the new hash can break some code, so they should just bite the bullet and put murmur in String.hashCode().

Agreed. Doing it optionally, defaulting to the old behavior now, and switching the default to the new one in JDK8 or 9 could be the best.

Osvaldo Doederlein

unread,
Jul 2, 2012, 10:07:46 PM7/2/12
to Maaartin G, guava-...@googlegroups.com, Ian Robertson, Éamonn McManus, Colin Decker, Emily Soldal
On Mon, Jul 2, 2012 at 8:11 PM, Maaartin G <graj...@seznam.cz> wrote:
On Monday, July 2, 2012 10:28:27 PM UTC+2, Osvaldo Doederlein wrote:
Hum... for the parallelization, I think you're right, because you can do that in chunks and then combine the pieces with a multiplier that's scaled in a fixed 31ˆN factor each step.  For the composition though it doesn't work, you need the full power of pow() but you need infinite precision

NO! You need mod 32 arithmetic, i.e., the fastest thing for current computers. Moreover, you could precompute a couple of them.

I don't see how to optimize (say) 31^500?  Well, caching would work I agree...

 
Not sure if I like this change.  Even if murmur is cheap enough (= execution is limited by L1 fill rate), you lose the hashcode caching. Strings stored as keys in hashtables have their hashcode cached in the entries, but each string that is looked up in operations like get() doesn't benefit from that, the murmur hash will have to be computed every time.

No, there's also a field called hash32 storing this value. Moreover, 0 gets replaced by 1 so you never have to recompute it again.

Which field? the field in Map.Entry will cache the hashcode from strings stored as keys, but not the hashes of strings passed as parameters for any operation. The field in the String object cannot cache the murmur codes.
 
 
Using this scheme only for large hashtables doesn't sem to avoid the potential regressions, as these would happen with large lookup keys, not large hashtables.  Of course, people don't usually use large strings as keys, but they can ;-)

At the first glance, Murmur seems to do about ten times as much work, but processes 2 chars at once and there seem to be enough ILP. For large strings I'd suppose that the cache miss penalty dominates.

What about the minimum overheads / setup code? I was previously worried about exceptional cases (huge keys), but in practice, most objects that are used as keys of anything are small. So if murmur is 3x slower for strings of length 0-10, that's pretty bad because this is by far the dominant case.
 
A+
Osvaldo

Maaartin G

unread,
Jul 3, 2012, 8:24:01 AM7/3/12
to guava-...@googlegroups.com, Maaartin G, Ian Robertson, Éamonn McManus, Colin Decker, Emily Soldal
On Tuesday, July 3, 2012 4:07:46 AM UTC+2, Osvaldo Doederlein wrote:
> I don't see how to optimize (say) 31^500?  Well, caching would work I agree...

Look at `com.google.common.math.IntMath.pow`.

> > > the murmur hash will have to be computed every time.

> > No, there's also a field called hash32 storing this value. Moreover, 0 gets replaced by 1 so you never have to recompute it again.

> Which field?

int String.hash32;

> the field in Map.Entry will cache the hashcode from strings stored as keys, but not the hashes of strings passed as parameters for any operation.

No, `HashMap.Entry.hash` caches whatever gets used for the map lookup, so it's fine.

> The field in the String object cannot cache the murmur codes.

There are two caching fields, `hash` and `hash32`. I don't like it either.

> What about the minimum overheads / setup code? I was previously worried about exceptional cases (huge keys), but in practice, most objects that are used as keys of anything are small. So if murmur is 3x slower for strings of length 0-10, that's pretty bad because this is by far the dominant case.

I can't tell. But it might be offset by better quality and thus fewer collisions.

Osvaldo Doederlein

unread,
Jul 3, 2012, 8:41:40 AM7/3/12
to Maaartin G, guava-...@googlegroups.com, Ian Robertson, Éamonn McManus, Colin Decker, Emily Soldal
 
> The field in the String object cannot cache the murmur codes.
There are two caching fields, `hash` and `hash32`. I don't like it either.

Wow I had not noticed that--it's only in the hashing patch, and only at the end of String.java; sorry.

OMG this is a terrible idea.  So, just after changing String to remove the offset/count fields, dropping substring sharing optimizations to save some precious memory for each string, they are adding a new field that's only useful to store strings as keys of very large hashtables. String will have no less than 64 bits of overhead for hashing, with two fields to cache different hash functions.  And to add ugliness to inefficiency, this needs a public hash32() method. :( Makes sense to anybody here?

A+
Osvaldo 

Martin Buchholz

unread,
Jul 3, 2012, 2:30:43 PM7/3/12
to Osvaldo Doederlein, Maaartin G, guava-...@googlegroups.com, Ian Robertson, Éamonn McManus, Colin Decker, Emily Soldal
I am also pretty unhappy about the state of attempts to fix the denial-of-service vulnerabilities.

String and HashMap will get a little slower and more bloated, and a small cottage industry of alternatives that remove the security overhead ("HashMap Classic" ?) will arise.

There is no hashCode method for String that will provide any real security advantage, unless it includes randomization.

The current String.hashCode is baked into the String switch feature bytecode, making it very hard or impossible to ever change.

The one thing we should certainly do is to introduce a new map type that uses both hash codes and an ordering, a hash table of balanced binary trees, so that we can have expected O(1) and worst case O(log N) performance, and encourage the use of that for all maps that may contain untrusted data.

Martin
Reply all
Reply to author
Forward
0 new messages