Fastest V10 F1 Car

0 views
Skip to first unread message

Ena Baccari

unread,
Jul 25, 2024, 2:33:10 AM7/25/24
to chiemesnueca

Every Formula 1 driver is fast, but is the race winner really the fastest? Since 2007 DHL has defined a new standard of speed with the 'DHL Fastest Lap Award'. One driver sets the fastest lap at each race - the award will go to the man who sets the most over the season. To win will require pure speed - something DHL, as the world's leading logistics provider and Official Logistics Partner of Formula 1, uses to achieve its goals, shortening international routes, facilitating global trade and making the world a smaller place.

It seemed to me there were two possibilities. One was that I was asking Claire to run farther than she was ready to. But this seemed unlikely to me, as I was actually being rather conservative in extending her range. The other possibility was that Claire was simply running too fast, and a closer inspection of her pace data confirmed this was the case. So I asked Claire to slow down, but she faded yet again in her next long run and in the one after that.

fastest v10 f1 car


Download Zip ⚙⚙⚙ https://urlgoal.com/2zN64d



Then I got an idea. With another Sunday approaching, I imposed this special fastest rule last on Claire: the last mile had to be the fastest of the entire 16-mile run, even if by only a second or two. It worked. Knowing that delivering on this requirement would be excruciating at best and impossible at worst if she paced the run as she had her previous long runs, Claire started off more cautiously and was able to finish strong.

The trick is to use Robin Hood hashing with an upper limit on the number of probes. If an element has to be more than X positions away from its ideal position, you grow the table and hope that with a bigger table every element can be close to where it wants to be. Turns out that this works really well. X can be relatively small which allows some nice optimizations for the inner loop of a hashtable lookup.

Linear probing means that if you try to insert an element into the array and the current slot is already full, you just try the next slot over. If that one is also full, you pick the slot next to that etc. There are known problems with this simple approach, but I believe that putting an upper limit on the probe count resolves that.

The prime number amount of slots means that the underlying array has a prime number size. Meaning it grows for example from 5 slots to 11 slots to 23 slots to 47 slots etc. Then to find the insertion point you simply use the modulo operator to assign the hash value of an element to a slot. The other most common choice is to use powers of two to size your array. Later in this blog post I will go more into why I chose prime numbers by default and when you want to use which.

So in this new benchmark, where I try to force a cache miss, we can see big differences very early on. What we find is that the pattern that we saw at the end of the last graph emerges very early in this graph: Starting at ten elements in the table there are clear winners in terms of performance. This is actually pretty impressive: All of these hash tables maintain consistent performance across many different orders of magnitude.

What I take from these graphs is that my new table is a really big improvement: The red line, with the powers of two, is my table configured the same way as dense_hash_map: With max_load_factor 0.5 and using a power of two to size the table so that a hash can be mapped to a slot just by looking at the lower bits. The only big difference is that my table requires one byte of extra storage (plus padding) per slot in the table. So my table will use slightly more memory than dense_hash_map.

Another example of bad performance due to using powers of two is how the standard hashtable in Rust was accidentally quadratic when inserting keys from one table into another. So using powers of two can bite you in non-obvious ways.

In your custom hash function you typedef ska::power_of_two_hash_policy as hash_policy. Then my flat_hash_map will switch to using powers of two. Also if you know that std::hash is good enough in your case, I provide a type called power_of_two_std_hash that will just call std::hash but will use the power_of_two_hash_policy:

This graph is also spiky, but the spikes point in the other direction. Any time that the table has to reallocate the average cost shoots up. Then that cost gets amortized until the table has to reallocate again.

The other point about this graph is that on the left half you once again only have tables that fit entirely in the L3 cache. I decided to not write a cache-miss-triggering test for this one because that would take time and we learned above that just looking at the right half is a good approximation for a cache miss.

The node based containers are slow once again, and the flat containers are all roughly equally fast. dense_hash_map is slightly faster than my hash table, but not by much: It takes roughly 20 nanoseconds to erase something from dense_hash_map and it takes roughly 23 nanoseconds to erase something from my hash table. Overall these are both very fast.

What this means though is that the table will get slightly slower once you have tombstones in your table. So dense_hash_map has a fast erase at the cost of slowing down lookups after an erase. Measuring the impact of that is a bit difficult, but I believe I have found a test that works for this purpose: I insert and erase elements over and over again:

This is the same graph as the very first graph in this blog post, except all the tables use a max_load_factor of 0.5. And then I wanted to only measure these tables when they really do have the same load factor, so I measured each table just before it would reallocate its internal storage. So if you look back at the very first graph in this blog post, imagine that I drew lines from one peak to the next. If we want to directly compare performance of hashtables and we want to eradicate the effect of different hash tables using different max_load_factor values and different strategies for when they reallocate, I think this is the right graph.

But the main point of this was to compare boost::multi_index and std::unordered_map, which use a max_load_factor of 1.0 to my flat_hash_map and dense_hash_map which use a max_load_factor of 0.5. As you can see even if we use the same max_load_factor for every table, the flat tables are faster.

This was expected, but I still think this was worth measuring. In a sense this is the truest measure of hash table performance because here all hash tables are configured the same way and have the same load factor: Every single data point has a current load factor of 0.5. That being said I did not use this method of measuring for my other graphs, because in the real world you probably will never change the max_load_factor. And in the real world you will see the spiky performance of the initial graph where similar tables can have very different performance, depending on how many hash collisions there are. (and the load factor is actually only one part of that, as I also discussed above when talking about powers of two vs prime numbers) And also this graph hides one benefit of my table: Limiting the probe count leads to more consistent performance, making the lines of my hash_map less spiky than the lines of other tables.

So far every graph was measuring performance of a map from int to int. However there might be differences in performance when using different keys or larger values. First, here are the graphs for successful lookups and unsuccessful lookups when using strings as keys:

That being said a string comparison that only compares a single character should be really cheap. And indeed the overhead is not that big. It just looks big above because every lookup is a cache hit. The cache miss picture looks different:

Once again dense_hash_map is slow because it initializes all those bytes. The other tables are pretty much the same because the copying cost dominates. Except that my flat_hash_map_power_of_two has that same weird spike at exactly 16385 elements due to increased time spent in clear_page_c_e that I also had when inserting ints with a 1024 byte value.

Lesson learned from this: If you have a large type, inserts will be equally slow in all tables, you should call reserve ahead of time, and the node based containers are a much more competitive option for large types than they are for small types.

Otherwise the main difference here is that erasing from flat_hash_map has gotten much more spiky than it was in the other erase picture above, and the line has moved up considerably, getting almost as expensive as in the node based containers. I think the reason for this is that the flat_hash_map has to move elements around when an item gets erased, and that is expensive if each element is 1028 bytes of data.

So I really like the fastrange idea, but you run into similar problems as when using powers of two: You have to make extra sure that there are no patterns of similar integers in the outputs of your hash function.

Nice work! What about an attacker constructing a request that invokes table growth with each single insert? Once you get that covered well and can also set a linear worst case boundary on the growth, this table is cool. Btw, without bounding the growth under attack, your worst case performance will be terrible (because an attack is just a worst case), so I think you might want to do all those analysis a bit more in-depth.

That being said this does not affect the worst case for lookups. Lookups will actually be very fast in this case because the table is mostly empty. This only affects the worst case for inserts, which you can make really slow with malicious data.

Or to put it another way: My hashtable detects really slow cases and tries to "save itself" by reallocating. If the input data is really bad so that even after reallocating all the keys hash to the same value, that trick won't work. Most of the time it will work though and then my hash table will be faster than others.

These are significant improvements with probe size of 10. The probability to touch two cache lines instead of one decrease a lot. Also the hash table becomes more dense, meaning that the probability to find existing data in L3 cache increases (further improving performance). Most of additional indexing math should be performed at compile time (template type is known at compile time).

4a15465005
Reply all
Reply to author
Forward
0 new messages