I noticed that Strings.commonSuffix and Strings.commonPrefix don't actually make the CharSequence into a concrete instance after calling subSequence which means that if the input "a" is a string, then it will show a view rather than a new char array.* shrug *
--
guava-...@googlegroups.com
Project site: http://guava-libraries.googlecode.com
This group: http://groups.google.com/group/guava-discuss
This list is for general discussion.
To report an issue: http://code.google.com/p/guava-libraries/issues/entry
To get help: http://stackoverflow.com/questions/ask (use the tag "guava")
One of the few changes Azul made to the core JDK classes in their open sourced Managed Runtime Initiative was a similar change to java.lang.String. I was never able to find any documentation as to why, however.
--
guava-...@googlegroups.com
Project site: http://guava-libraries.googlecode.com
This group: http://groups.google.com/group/guava-discuss
This list is for general discussion.
To report an issue: http://code.google.com/p/guava-libraries/issues/entry
To get help: http://stackoverflow.com/questions/ask (use the tag "guava")
If I recall correctly, String's hashCode function has been documented all along, which is precisely why they haven't been able to change it to something better.
If I recall correctly, the String.hashCode() algorithm used to be
specified directly in the JLS. And, the original algorithm was
different: it would sample characters from longer strings rather than
examining all of them. So back then it was felt that the algorithm
*could* be changed, but meanwhile such a huge amount of Java code has
built up that changing it now would indeed be sure to break a
nontrivial amount of code. Too bad, though, that when it could be
changed it wasn't changed to something less awful.
If I recall correctly, the String.hashCode() algorithm used to be
specified directly in the JLS. And, the original algorithm was
In the first place it should never have been specified. But I would not call the algorithm awful, although I see no reason for it working the way it does:- Multiplication by 31 can't be optimized especially well using instructions like LEA.- There's no reason for using a prime here, any odd number sort of works here.
It would have been possible to have a more DoS resistant hash-code; the javadoc for Object.hashCode explicitly states that the computed hashCode "need not remain consistent from one execution of an application to another execution of the same application". Ruby and Perl have fixed this by using some random bits to determine their hash algorithm for a given program run.
On Monday, July 2, 2012 7:42:41 PM UTC+2, Osvaldo Doederlein wrote:> 31 is 0b11111; multiplication by that causes the value to "spread" in a neat way, ...I don't think it works really nice, since- 31 = 32-1, so it sort of has only 2 bits set.- 31 is a small number and higher bits get set only slowly
> My aggravation with the String.hashcode function is that it cannot be parallelized> (with SIMD/ILP-friendly code patterns). The function should allow the hashcode> to be computed in chunks, say for each group of 4 or 8 consecutive chars, then combined.A paralelization is actually possible, e.g., likehash(s) = 31 * (31 * (31 * s5 + s3) + s1) + (31 * (31 * s4 + s2) + s0)These two parts can be computed in paralel, that's nothing for SSE,but it can use instruction level paralelism.Combining the chunks is possible, too:Assert.assertEquals((s1+s0).hashCode(), pow(31, s0.length()) * s1.hashCode() + s0.hashCode());
On Monday, July 2, 2012 6:48:50 PM UTC+2, Osvaldo Doederlein wrote:> And, on the vulnerability to hash collision attacks, I'm skeptic about the value> of making any performance compromises in the vain attempt to fight such DoS attacks.> The hashcode has only 16 bits so it's impossible to implement> a really secure hash function; the hackers will always win.Sure, with 32 bits collisions are easy to compute, but with the non-determinism,the hacker must first learn String.HASHING_SEED used insun.misc.Hashing.murmur3_32(HASHING_SEED, value, 0, value.length).This HASHING_SEED gets generated when String.class gets loaded usingsome unknown values, like System.nanoTime() and System.currentTimeMillis(),so it should be hard to guess.> but if computing the hashcode itself is expensive because> we're paranoid with malicious datasets, then (even with hashcode caching)> this benefit will be severely compromised so the whole thing becomes pointless.I don't think the Murmur hash is expensive.
On Monday, July 2, 2012 9:13:10 PM UTC+2, Osvaldo Doederlein wrote:> It's a good solution, but I think this idea wouldn't fly at Sun> (at least back in the 90's...), with their obsession for reproducible> results across platforms, processes and VM versions/implementations.Instead of changing String.hashCode they're changing code using it:I actually doubt there's much code depending on the current implementationof String.hashCode, so I'd vote for changing it directly.It's just that nobody's asking me for my opinion. :DIn any case, a better hashCode could be made available using JVM arguments or properties.
Hum... for the parallelization, I think you're right, because you can do that in chunks and then combine the pieces with a multiplier that's scaled in a fixed 31ˆN factor each step. For the composition though it doesn't work, you need the full power of pow() but you need infinite precision
Not sure if I like this change. Even if murmur is cheap enough (= execution is limited by L1 fill rate), you lose the hashcode caching. Strings stored as keys in hashtables have their hashcode cached in the entries, but each string that is looked up in operations like get() doesn't benefit from that, the murmur hash will have to be computed every time.
Using this scheme only for large hashtables doesn't sem to avoid the potential regressions, as these would happen with large lookup keys, not large hashtables. Of course, people don't usually use large strings as keys, but they can ;-)
On Monday, July 2, 2012 10:28:27 PM UTC+2, Osvaldo Doederlein wrote:Hum... for the parallelization, I think you're right, because you can do that in chunks and then combine the pieces with a multiplier that's scaled in a fixed 31ˆN factor each step. For the composition though it doesn't work, you need the full power of pow() but you need infinite precisionNO! You need mod 32 arithmetic, i.e., the fastest thing for current computers. Moreover, you could precompute a couple of them.
Not sure if I like this change. Even if murmur is cheap enough (= execution is limited by L1 fill rate), you lose the hashcode caching. Strings stored as keys in hashtables have their hashcode cached in the entries, but each string that is looked up in operations like get() doesn't benefit from that, the murmur hash will have to be computed every time.No, there's also a field called hash32 storing this value. Moreover, 0 gets replaced by 1 so you never have to recompute it again.
Using this scheme only for large hashtables doesn't sem to avoid the potential regressions, as these would happen with large lookup keys, not large hashtables. Of course, people don't usually use large strings as keys, but they can ;-)At the first glance, Murmur seems to do about ten times as much work, but processes 2 chars at once and there seem to be enough ILP. For large strings I'd suppose that the cache miss penalty dominates.
> The field in the String object cannot cache the murmur codes.There are two caching fields, `hash` and `hash32`. I don't like it either.