microbenchmark vs. production

556 views
Skip to first unread message

Nathan Bronson

unread,
Jul 19, 2018, 4:51:08 PM7/19/18
to hashtable-benchmarks
Hi hashtable-benchmarks,

I'm the original author of folly::F14. Thanks for including it.

Facebook's experience deploying F14 has been that microbenchmarks are not a very good predictor of the actual production impact: in the majority of cases F14 has been a bigger win than we expected, but sometimes we don't even get the direction of change correct. Better benchmarks have the potential to be very useful.

Different execution contexts produce a different relative cost for straight-line CPU instructions, mispredicted branches, cache misses, and cache pressure (evictions that slow down the surrounding code). Different algorithms have different sequences of possible bucket_count, so benchmarking at a small set of sizes also misses something important (load_factors never match up). There are far too many variables to grid-sweep them all. Our internal microbenchmarks have a reasonable way of handling the load-factor problem (we sweep n in [x,2x) and weight each sample by 1/n), but we haven't solved the other problems.

I think that a great outcome would be a small set (10 to 20) of configurations modeled after real-world use cases. For each of these we could deterministically pollute the I-cache and D-cache, use multiple maps/sets with a distribution of sizes, and use a realistic combination of operations. 

Thanks,
  Nathan

Roman Perepelitsa

unread,
Jul 20, 2018, 3:54:29 AM7/20/18
to Nathan Bronson, hashtable-benchmarks
Hi Nathan,

Disclaimer: I'm one of the authors of the Google's hash table implementation that was the subject of Matt Kulukundis's talk (https://cppcon2017.sched.com/event/Bgte/designing-a-fast-efficient-cache-friendly-hash-table-step-by-step). These benchmarks were written as part of this project. I'm no longer with Google.

While working on the new hash table, we'd spent a lot of time trying to devise a criterion that could tell us which of two given implementations is faster. We'd written and threw away a dozen artificial benchmarks that were trying to simulate real-life loads and eventually settled on the following procedure. Run the high-level load test for the Search Web Server. The Search Web Server is a huge binary responsible for a lion's share of CPU consumed by Google's machines. The transitive closure of the server's source code contains uncountable references to hash tables, all with different load patterns. The load test starts a separate instance of the server and sends it real production traffic.

This gave us a way to move forward when considering various implementation techniques. However, running this massive benchmark is expensive and time consuming. Moreover, it doesn't tell us *why* a certain implementation is faster than the alternative. For this, we'd written a bundle of focused synthetic benchmarks. One group of benchmarks is targeting "hot" workflows where every memory address accessed by the benchmark is in L1. The other group of benchmarks are "cold" -- with all memory accesses resulting in cache misses. With such benchmarks we can make claims of the following kind: Implementation X is faster than implementation Y when looking up existing values in hot hash tables. However, even if one implementation is faster in all benchmarks-- hot and cold--than another, it doesn't mean that it'll win the Search Web Server contest. The reason is that the same workload can result in "hot" conditions for one implementation and "cold" for another. In the synthetic benchmarks the hot conditions are so hot, they result in 100% L1 cache hits even for the most terrible implementation with a lot of pointer chasing. However, in production all hashtables are under loads that are somewhere in between hot and cold. Some data would be in L1, some in L2, and so on all the way to uncached RAM. What matters the most for real-life performance is how hot a hash table would get under the given load.

Thus, to speed up a specific hash table operation, such as a lookup for an existing value, we can do three things:
1. Make it faster under hot conditions.
2. Make it faster under cold conditions.
3. Turn some cold conditions into hot(er) ones.

The performance difference between hot and cold is so vast (an order of magnitude or so), improvements to (3) are crucial for overall performance gain. Unfortunately, it's not something we could measure with a synthetic benchmark (it's likely possible but we simple don't yet have benchmarks of this kind). Thankfully, the rules of thumb for achieving (3) are fairly simple. First, minimize the number of cache lines you have to access. If you must access two cache lines, it's better if they are adjacent. The penalty for a random memory access is so big that pretty much nothing can overcome it.

To summarize, at Google we fully realize that targeted, small, artificial benchmarks cannot tell us which hash table implementation will perform better in production. We've attempted to write realistic benchmarks exhibiting load patterns similar to the ones found in the real world but eventually settled for running a high-level load test (not specific to hash tables) and treat its answers as the arbiter. Nevertheless, synthetic benchmarks can be very useful in generating insights into the behavior of different implementations, but for this they must NOT attempt to be realistic. The simpler the benchmark's model, the better. We've found that separating benchmarks into "100% hot" and "100% cold" is conductive to this goal.

Roman.

--
You received this message because you are subscribed to the Google Groups "hashtable-benchmarks" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hashtable-benchmarks+unsub...@googlegroups.com.
To post to this group, send email to hashtable-benchmarks@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/hashtable-benchmarks/16e8a23b-5c0a-456b-bdfc-4997c067ab5d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Matthew Fowles Kulukundis

unread,
Jul 20, 2018, 11:02:02 AM7/20/18
to Roman Perepelitsa, ngbr...@gmail.com, hashtable-...@googlegroups.com
Nathan~

We also discovered a ton of issues around the quality of the hash function used by tables.  dense_hash_set uses an identity has by default which often has very bad performance.  The benchmarks in this repo fill the table with random data so they can safely use the identity hash and focus on the overhead imposed by the table, but in real world settings hash function quality is also very important.

We have some stuff coming down the pipe on that front, but nothing to share yet.

Matt

To unsubscribe from this group and stop receiving emails from it, send an email to hashtable-benchm...@googlegroups.com.
To post to this group, send email to hashtable-...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "hashtable-benchmarks" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hashtable-benchm...@googlegroups.com.
To post to this group, send email to hashtable-...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/hashtable-benchmarks/CAN%3D4vMqBAB%2BsitCM%2BduL7Uo90pCU6HPj2sNyoQTz-kaN63rdpw%40mail.gmail.com.

Nathan Bronson

unread,
Jul 20, 2018, 12:47:36 PM7/20/18
to hashtable-benchmarks
Roman, that makes a lot of sense. Some of our workloads are front-end bound, so we also worked on the code size. #3 is important enough that we also extended the API so that we could manually pipeline lookups when it is easy to predict future keys. For example, when fetching from an array of keys we can do something like for (i..) { token[(i+7)%8] = map.prehash(k[i+7]); ... map.find(token[i%8], k[i]); }. The prehash method does both bit mixing and a software prefetch -- https://github.com/facebook/folly/blob/master/folly/container/F14Map.h#L662

Is there any interest in trying to help those doing hashtable research that don't have access to production workloads? Something that could be used in the motivation section of an academic paper that proposes a better hashtable ("my new algo X is better than baseline across all of the workloads proposed by industry in github.com/google/...").

Matt, F14 applies a bit mixer to uint32_t keys by default. It's controlled by a template specialization folly::IsAvalanchingHasher that lets us turn it off when we know that something like cityhash or murmur was used and there is a full size_t of bits available. It sounds like I should disable that for purposes of this benchmark. In production the mixer uses a hardware CRC instruction, but it looks like the results that are currently uploaded use the fallback (more expensive) mixer.

- Nathan

Matthew Fowles Kulukundis

unread,
Jul 20, 2018, 1:20:14 PM7/20/18
to ngbr...@gmail.com, hashtable-...@googlegroups.com
Nathan~

Yeah, it sounds like this benchmark should disable that extra mixing for F14.

Matt

--
You received this message because you are subscribed to the Google Groups "hashtable-benchmarks" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hashtable-benchm...@googlegroups.com.
To post to this group, send email to hashtable-...@googlegroups.com.

Roman Perepelitsa

unread,
Jul 20, 2018, 2:16:53 PM7/20/18
to Nathan Bronson, hashtable-benchmarks
On Fri, Jul 20, 2018 at 6:47 PM, Nathan Bronson <ngbr...@gmail.com> wrote:
Is there any interest in trying to help those doing hashtable research that don't have access to production workloads? Something that could be used in the motivation section of an academic paper that proposes a better hashtable ("my new algo X is better than baseline across all of the workloads proposed by industry in github.com/google/...").

Yes, definitely. Such benchmarks would be extremely useful within Google, too. Running large scale load tests is expensive both in terms of time and machine costs.

I should've been more explicit in my prior email. I didn't mean to say that benchmarks approximating production loads are unnecessary or undesired. My main point was that Google doesn't claim to have such benchmarks. The benchmarks released on https://github.com/google/hashtable-benchmarks are the opposite of realistic.

Roman.

Reply all
Reply to author
Forward
0 new messages