Benchmark Mode Samples Mean Mean error Units JMH3.MyBenchmark.testCowMap avgt 10 6.467 0.051 ns/op JMH3.MyBenchmark.testCowMapCounter avgt 10 8.160 0.146 ns/op JMH3.MyBenchmark.testHashMap avgt 10 7.516 0.029 ns/op JMH3.MyBenchmark.testHashMapCounter avgt 10 8.840 0.145 ns/op JMH3.MyBenchmark.testImmutableMap avgt 10 5.699 0.109 ns/op JMH3.MyBenchmark.testImmutableMapCounter avgt 10 7.328 0.050 ns/op JMH3.MyBenchmark.testTMap avgt 10 7.795 0.053 ns/op JMH3.MyBenchmark.testTMapCounter avgt 10 8.325 0.102 ns/op JMH3.MyBenchmark.testVolatileImmutableMap avgt 10 5.745 0.045 ns/op JMH3.MyBenchmark.testVolatileImmutableMapCounter avgt 10 7.297 0.172 ns/op
Benchmark (m_count) Mode Samples Mean Mean error Units
JMH3.MyBenchmark.testCowMap 1024 avgt 5 18582.845 234.249 ns/op
JMH3.MyBenchmark.testCowMap 128 avgt 5 996.411 3.943 ns/op
JMH3.MyBenchmark.testCowMap 16 avgt 5 131.175 9.319 ns/op
JMH3.MyBenchmark.testCowMap 256 avgt 5 2393.750 88.229 ns/op
JMH3.MyBenchmark.testCowMap 32 avgt 5 278.281 16.156 ns/op
JMH3.MyBenchmark.testCowMap 512 avgt 5 6283.906 315.368 ns/op
JMH3.MyBenchmark.testCowMap 64 avgt 5 545.406 8.250 ns/op
JMH3.MyBenchmark.testHashMap 1024 avgt 5 11116.434 441.358 ns/op
JMH3.MyBenchmark.testHashMap 128 avgt 5 1192.682 38.543 ns/op
JMH3.MyBenchmark.testHashMap 16 avgt 5 131.985 1.475 ns/op
JMH3.MyBenchmark.testHashMap 256 avgt 5 2578.242 84.384 ns/op
JMH3.MyBenchmark.testHashMap 32 avgt 5 269.071 0.170 ns/op
JMH3.MyBenchmark.testHashMap 512 avgt 5 5467.981 129.064 ns/op
JMH3.MyBenchmark.testHashMap 64 avgt 5 557.057 27.698 ns/op
JMH3.MyBenchmark.testImmutableMap 1024 avgt 5 18583.994 707.223 ns/op
JMH3.MyBenchmark.testImmutableMap 128 avgt 5 913.665 7.975 ns/op
JMH3.MyBenchmark.testImmutableMap 16 avgt 5 109.837 2.221 ns/op
JMH3.MyBenchmark.testImmutableMap 256 avgt 5 2163.555 59.064 ns/op
JMH3.MyBenchmark.testImmutableMap 32 avgt 5 242.334 2.021 ns/op
JMH3.MyBenchmark.testImmutableMap 512 avgt 5 5839.833 32.638 ns/op
JMH3.MyBenchmark.testImmutableMap 64 avgt 5 495.163 2.683 ns/op
JMH3.MyBenchmark.testTMap 1024 avgt 5 8572.425 130.532 ns/op
JMH3.MyBenchmark.testTMap 128 avgt 5 1130.815 25.054 ns/op
JMH3.MyBenchmark.testTMap 16 avgt 5 141.371 2.886 ns/op
JMH3.MyBenchmark.testTMap 256 avgt 5 2252.436 26.358 ns/op
JMH3.MyBenchmark.testTMap 32 avgt 5 281.937 2.609 ns/op
JMH3.MyBenchmark.testTMap 512 avgt 5 4381.460 7.444 ns/op
JMH3.MyBenchmark.testTMap 64 avgt 5 568.282 4.086 ns/op
JMH3.MyBenchmark.testVolatileImmutableMap 1024 avgt 5 18136.249 244.302 ns/op
JMH3.MyBenchmark.testVolatileImmutableMap 128 avgt 5 966.954 190.930 ns/op
JMH3.MyBenchmark.testVolatileImmutableMap 16 avgt 5 115.455 3.071 ns/op
JMH3.MyBenchmark.testVolatileImmutableMap 256 avgt 5 2244.283 69.049 ns/op
JMH3.MyBenchmark.testVolatileImmutableMap 32 avgt 5 254.552 4.415 ns/op
JMH3.MyBenchmark.testVolatileImmutableMap 512 avgt 5 5990.018 123.352 ns/op
JMH3.MyBenchmark.testVolatileImmutableMap 64 avgt 5 520.538 3.223 ns/op
According to a flight recorder profile my application spends 6% of it's time in ImmutableMap or HashMap get. Of the call sites, most of them are lookup tables that I would expect to be cached fairly well since the cardinality of values being inspected is small as is the cardinality of the entire map. I have done what I can to reduce the number of lookups and now I trying to see if I can improve on the performance and locality of the maps. I compared ImmutableHashMap, HashMap, and Trove's Long to object hash map.
Benchmark (m_count) Mode Samples Mean Mean error Units
JMH3.MyBenchmark.testArrayBinary 16 avgt 5 198.829 23.230 ns/op
JMH3.MyBenchmark.testArrayBinary 24 avgt 5 320.971 13.531 ns/op
JMH3.MyBenchmark.testArrayBinary 32 avgt 5 468.476 6.551 ns/op
JMH3.MyBenchmark.testArrayBinary 8 avgt 5 74.012 0.885 ns/op
JMH3.MyBenchmark.testArrayScan 16 avgt 5 129.399 0.862 ns/op
JMH3.MyBenchmark.testArrayScan 24 avgt 5 231.715 5.599 ns/op
JMH3.MyBenchmark.testArrayScan 32 avgt 5 359.104 18.624 ns/op
JMH3.MyBenchmark.testArrayScan 8 avgt 5 48.804 0.495 ns/op
JMH3.MyBenchmark.testHashMap 16 avgt 5 158.899 2.509 ns/op
JMH3.MyBenchmark.testHashMap 24 avgt 5 240.234 20.686 ns/op
JMH3.MyBenchmark.testHashMap 32 avgt 5 316.633 7.020 ns/op
JMH3.MyBenchmark.testHashMap 8 avgt 5 81.146 9.180 ns/op
JMH3.MyBenchmark.testImmutableMap 16 avgt 5 111.630 2.572 ns/op
JMH3.MyBenchmark.testImmutableMap 24 avgt 5 173.245 6.796 ns/op
JMH3.MyBenchmark.testImmutableMap 32 avgt 5 240.638 5.967 ns/op
JMH3.MyBenchmark.testImmutableMap 8 avgt 5 54.846 1.372 ns/op
JMH3.MyBenchmark.testImmutableSortedMap 16 avgt 5 436.898 10.064 ns/op
JMH3.MyBenchmark.testImmutableSortedMap 24 avgt 5 760.812 261.168 ns/op
JMH3.MyBenchmark.testImmutableSortedMap 32 avgt 5 975.969 29.037 ns/op
JMH3.MyBenchmark.testImmutableSortedMap 8 avgt 5 175.758 8.764 ns/op
According to a flight recorder profile my application spends 6% of it's time in ImmutableMap or HashMap get. Of the call sites, most of them are lookup tables that I would expect to be cached fairly well since the cardinality of values being inspected is small as is the cardinality of the entire map. I have done what I can to reduce the number of lookups and now I trying to see if I can improve on the performance and locality of the maps. I compared ImmutableHashMap, HashMap, and Trove's Long to object hash map.
You can see the code http://pastebin.com/rnUn2KU9 and results http://pastebin.com/kfSSjEHv
Benchmark Mode Samples Mean Mean error Units JMH3.MyBenchmark.testCowMap avgt 10 6.467 0.051 ns/op JMH3.MyBenchmark.testCowMapCounter avgt 10 8.160 0.146 ns/op JMH3.MyBenchmark.testHashMap avgt 10 7.516 0.029 ns/op JMH3.MyBenchmark.testHashMapCounter avgt 10 8.840 0.145 ns/op JMH3.MyBenchmark.testImmutableMap avgt 10 5.699 0.109 ns/op JMH3.MyBenchmark.testImmutableMapCounter avgt 10 7.328 0.050 ns/op JMH3.MyBenchmark.testTMap avgt 10 7.795 0.053 ns/op JMH3.MyBenchmark.testTMapCounter avgt 10 8.325 0.102 ns/op JMH3.MyBenchmark.testVolatileImmutableMap avgt 10 5.745 0.045 ns/op JMH3.MyBenchmark.testVolatileImmutableMapCounter avgt 10 7.297 0.172 ns/op
For starters how is my benchmark hygiene?
The maps are all pretty close and I can't help but wonder if the performance results would be different if the benchmark accurately represented a real world use case where things aren't always in L1 cache. How can I compare these data structures so that the costs accurately represent how they would perform in my application.Thanks,Ariel
Faster Hash Tables
fastutil6.1.0 changes significantly the implementation of hash-based classes. Instead of double hashing, we use linear probing. This has some consequences:
I still don't follow when I will run into this situation and with which map implementations. I found that NonBlockingHashMap would reliably OOM if I added and removed incrementing integers. Is that something that can happen with open addressing as implemented by Trove and Fastutil?
Generally speaking my map usages are either fixed lookup tables that are completely rewritten or in flight state tracking with IDs that will never be reused, will only be in the map for a few milliseconds, and will be removed FIFO. If there is a hiccup the maps will have to grow accommodate a backlog of several thousand things until backpressure kicks in at the producers, but that is a failure state where all it has to do is not break.
In this situation would consistently removing every entry prevent long chains from forming? I can afford to go with pretty low load factors if that will keep the chain length and frequency down. I am not clear on how tombstones get removed in open address hash maps.
Resize performance isn't important for me because the maps grow to a steady state that isn't very large.
Yes, optimization is a black art :-). Profilers help in finding the "big ones". But at a certain point they might even mislead you.At first you need to split your measurement/optimization into singlethreaded and concurrency-realated ones. Since I spend the last week optimizing my serialization (hunting 2 digit nano gains), here is my list of surprising candidates in single threaded optimization
- avoid tmp allocation. Profilers rarely show this but effects on locality frequently matter a lot. Just check what your program is allocating in a typical test run. Effects show up in throughput, not in a profiler
- HashMap lookups aren't that cheap. Don't use Collections with boxed types. Remove any redundant hash lookup (happens accidentally as code evolves). I had significant gains by caching hash lookup results. If it is likely to do the same lookup several times in a row, caching last result and compare before doing another hash can improve.
- Instanceof can be expensive !! I had an "instanceof Serializable". Removement sped up from 1400ns to 1200ns, which is a lot given that my program has been optimized to death. If possible use x.class == Other.class, even better define constants and compare them instead. (like "getTypeId() { return MY_ID; }". only works if the classes to be checked are a limited set ofc.
- Because of branchprediction, highly predictable checks (like "if (DEBUG)" or "if (fastmode)" are very cheap.
- switch() can be slower than if else chains, if there are frequent pathes. Don't know why, maybe I am wrong here, but I had this in my code.
- comparing pointers is more expensive than comparing ints (maybe related to 64bit with "compressed refs").
- Locality: Try to reduce pointer chasing. E.g. if you have a configuration object with some flags etc. and some processing reads from this, you can improve locality by extracting the relevant parameters at init time and store them redudantely.
Am Donnerstag, 13. März 2014 19:13:33 UTC+1 schrieb Ariel Weisberg:
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.