Hi Nathan,
While working on the new hash table, we'd spent a lot of time trying to devise a criterion that could tell us which of two given implementations is faster. We'd written and threw away a dozen artificial benchmarks that were trying to simulate real-life loads and eventually settled on the following procedure. Run the high-level load test for the Search Web Server. The Search Web Server is a huge binary responsible for a lion's share of CPU consumed by Google's machines. The transitive closure of the server's source code contains uncountable references to hash tables, all with different load patterns. The load test starts a separate instance of the server and sends it real production traffic.
This gave us a way to move forward when considering various implementation techniques. However, running this massive benchmark is expensive and time consuming. Moreover, it doesn't tell us *why* a certain implementation is faster than the alternative. For this, we'd written a bundle of focused synthetic benchmarks. One group of benchmarks is targeting "hot" workflows where every memory address accessed by the benchmark is in L1. The other group of benchmarks are "cold" -- with all memory accesses resulting in cache misses. With such benchmarks we can make claims of the following kind: Implementation X is faster than implementation Y when looking up existing values in hot hash tables. However, even if one implementation is faster in all benchmarks-- hot and cold--than another, it doesn't mean that it'll win the Search Web Server contest. The reason is that the same workload can result in "hot" conditions for one implementation and "cold" for another. In the synthetic benchmarks the hot conditions are so hot, they result in 100% L1 cache hits even for the most terrible implementation with a lot of pointer chasing. However, in production all hashtables are under loads that are somewhere in between hot and cold. Some data would be in L1, some in L2, and so on all the way to uncached RAM. What matters the most for real-life performance is how hot a hash table would get under the given load.
Thus, to speed up a specific hash table operation, such as a lookup for an existing value, we can do three things:
1. Make it faster under hot conditions.
2. Make it faster under cold conditions.
3. Turn some cold conditions into hot(er) ones.
The performance difference between hot and cold is so vast (an order of magnitude or so), improvements to (3) are crucial for overall performance gain. Unfortunately, it's not something we could measure with a synthetic benchmark (it's likely possible but we simple don't yet have benchmarks of this kind). Thankfully, the rules of thumb for achieving (3) are fairly simple. First, minimize the number of cache lines you have to access. If you must access two cache lines, it's better if they are adjacent. The penalty for a random memory access is so big that pretty much nothing can overcome it.
To summarize, at Google we fully realize that targeted, small, artificial benchmarks cannot tell us which hash table implementation will perform better in production. We've attempted to write realistic benchmarks exhibiting load patterns similar to the ones found in the real world but eventually settled for running a high-level load test (not specific to hash tables) and treat its answers as the arbiter. Nevertheless, synthetic benchmarks can be very useful in generating insights into the behavior of different implementations, but for this they must NOT attempt to be realistic. The simpler the benchmark's model, the better. We've found that separating benchmarks into "100% hot" and "100% cold" is conductive to this goal.
Roman.