--
You received this message because you are subscribed to the Google Groups "hashtable-benchmarks" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hashtable-benchm...@googlegroups.com.
To post to this group, send email to hashtable-...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/hashtable-benchmarks/CAApERuYUrqA667-BjtVSoUfK6f7ya9nUbe_Vsc8FAQeZV84CwQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
To view this discussion on the web visit https://groups.google.com/d/msgid/hashtable-benchmarks/CAApERuaAz7Fby9xZ%3DP5Lm73tnokBcBQL4KiTC6vB0UezxGW9nw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
From a performance perspective the main design difference between SwissTable and F14 is whether the metadata and data are separated (as in SwissTable) or colocated (as in F14). I would expect colocation to work better when the entries are small, and separation to work better the elements are large. F14 also pushes users toward a third storage strategy, F14VectorSet, for medium and large values. This stores elements in a contiguous array and puts a 4-byte index into the main hash table, which makes lookup slower but iteration (and hence destruction) faster. This indirect strategy also saves memory, because the vector effectively has a perfect load factor.
To see how the tradeoffs change with the size of the item, I expanded the figure so that each row shows a single configuration at 5 different value sizes. I've also separated out the Hot case from the Cold case. To help make small relative differences look small I have included the origin on the y axis.
- The FindHit Cold rows show that F14Vector is always the slowest (because of its indirection), that F14Value is the fastest for small values (because the tag/H2 values are in the same cache line or cache line pair as the values), and that flat_hash_set is the fastest for large values. The tradeoff between F14Value and flat_hash_set changes smoothly across value sizes, a surprisingly clean result.
- FindMiss Cold shows that separated metadata is better, which is also not surprising since most FindMiss calls only need to access the metadata. In fact, for the larger value sizes the entire flat_hash_set ctrl_ array fits in the L3. flat_hash_set's quadratic probing might also help locality in this case. F14 uses double-hashing, which reduces both clumping (good) and probe locality (bad).
To me these benchmark results show that you'd need to know a lot about the operation profile and working set of a program to predict whether SwissTable or F14 would be faster. For example, if you take a lot of cache misses reading from tables with large entries then flat_hash_set is clearly best, but if you construct and destroy those tables a lot then F14Vector seems like the most compelling.
- Iterate Cold shows that F14Vector's contiguously-packed elements allow much faster iteration, as expected. This helps destruction (and copy construction) performance.
- The Hot results show that the tables are all quite close to each other in performance.
- FindHit Hot shows a 1 or 2 cycle advantage for F14Value. I don't know if this is because it executes fewer instructions, because it achieves higher IPC, or if this result would go away with a stateless equality functor.
- flat_hash_set for FindMiss Hot seems to get a bit of a slowdown with Max load factor. My guess is that this is because quadratic probing leads to longer probe lengths than double hashing. Also, flat_hash_set needs to probe when a chunk is full but hasn't actually overflowed, whereas F14 only probes if a live (non-erased) key was actually displaced.
If I was going to try to drop absl::flat_hash_map/set into a program already using F14FastMap/Set, the most likely blocker would be the memory overhead of small hash tables.
Both SwissTable and F14 have more variation between Min and Max load factor than I expected. Perhaps we should revisit the design choice of using a fixed max_load_factor?
To view this discussion on the web visit https://groups.google.com/d/msgid/hashtable-benchmarks/CALo6A%2Bso6ZkjwEm%3DzxdQH482EDzgcikJvRYdYAgD1aipvCQGDg%40mail.gmail.com.
On Wed, Oct 3, 2018 at 3:28 PM Nathan Bronson <ngbr...@gmail.com> wrote:From a performance perspective the main design difference between SwissTable and F14 is whether the metadata and data are separated (as in SwissTable) or colocated (as in F14). I would expect colocation to work better when the entries are small, and separation to work better the elements are large. F14 also pushes users toward a third storage strategy, F14VectorSet, for medium and large values. This stores elements in a contiguous array and puts a 4-byte index into the main hash table, which makes lookup slower but iteration (and hence destruction) faster. This indirect strategy also saves memory, because the vector effectively has a perfect load factor.
To see how the tradeoffs change with the size of the item, I expanded the figure so that each row shows a single configuration at 5 different value sizes. I've also separated out the Hot case from the Cold case. To help make small relative differences look small I have included the origin on the y axis.
- The FindHit Cold rows show that F14Vector is always the slowest (because of its indirection), that F14Value is the fastest for small values (because the tag/H2 values are in the same cache line or cache line pair as the values), and that flat_hash_set is the fastest for large values. The tradeoff between F14Value and flat_hash_set changes smoothly across value sizes, a surprisingly clean result.
- FindMiss Cold shows that separated metadata is better, which is also not surprising since most FindMiss calls only need to access the metadata. In fact, for the larger value sizes the entire flat_hash_set ctrl_ array fits in the L3. flat_hash_set's quadratic probing might also help locality in this case. F14 uses double-hashing, which reduces both clumping (good) and probe locality (bad).
We can actually experiment with this. That is, we could easily test flat_hash_set with double-hashing. Making F14 try quadratic probing should not be hard also, though I don't know the code enough to know.Double-hashing does lead to smaller code size, which might just mean that it is faster in some cases because it is less code to run.
To me these benchmark results show that you'd need to know a lot about the operation profile and working set of a program to predict whether SwissTable or F14 would be faster. For example, if you take a lot of cache misses reading from tables with large entries then flat_hash_set is clearly best, but if you construct and destroy those tables a lot then F14Vector seems like the most compelling.
- Iterate Cold shows that F14Vector's contiguously-packed elements allow much faster iteration, as expected. This helps destruction (and copy construction) performance.
- The Hot results show that the tables are all quite close to each other in performance.
- FindHit Hot shows a 1 or 2 cycle advantage for F14Value. I don't know if this is because it executes fewer instructions, because it achieves higher IPC, or if this result would go away with a stateless equality functor.
- flat_hash_set for FindMiss Hot seems to get a bit of a slowdown with Max load factor. My guess is that this is because quadratic probing leads to longer probe lengths than double hashing. Also, flat_hash_set needs to probe when a chunk is full but hasn't actually overflowed, whereas F14 only probes if a live (non-erased) key was actually displaced.
If I was going to try to drop absl::flat_hash_map/set into a program already using F14FastMap/Set, the most likely blocker would be the memory overhead of small hash tables.Funny that you mention this. We are considering a change that reduces the minimum capacity to 3 without impacting the find/insert code paths, only the reallocation.
Both SwissTable and F14 have more variation between Min and Max load factor than I expected. Perhaps we should revisit the design choice of using a fixed max_load_factor?Note that they use a different max load factor. F14 uses 12/14, while flat_hash_set uses 14/16, so there is that too.
--
You received this message because you are subscribed to the Google Groups "hashtable-benchmarks" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hashtable-benchm...@googlegroups.com.
To post to this group, send email to hashtable-...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/hashtable-benchmarks/CALo6A%2BvnycaRN-w997XsNgxJp3U8gtAfYTvkmUQ3Unow7te_MQ%40mail.gmail.com.
On Wed, Oct 3, 2018 at 5:23 PM Samuel Benzaquen <sbe...@google.com> wrote:On Wed, Oct 3, 2018 at 3:28 PM Nathan Bronson <ngbr...@gmail.com> wrote:From a performance perspective the main design difference between SwissTable and F14 is whether the metadata and data are separated (as in SwissTable) or colocated (as in F14). I would expect colocation to work better when the entries are small, and separation to work better the elements are large. F14 also pushes users toward a third storage strategy, F14VectorSet, for medium and large values. This stores elements in a contiguous array and puts a 4-byte index into the main hash table, which makes lookup slower but iteration (and hence destruction) faster. This indirect strategy also saves memory, because the vector effectively has a perfect load factor.
To see how the tradeoffs change with the size of the item, I expanded the figure so that each row shows a single configuration at 5 different value sizes. I've also separated out the Hot case from the Cold case. To help make small relative differences look small I have included the origin on the y axis.
- The FindHit Cold rows show that F14Vector is always the slowest (because of its indirection), that F14Value is the fastest for small values (because the tag/H2 values are in the same cache line or cache line pair as the values), and that flat_hash_set is the fastest for large values. The tradeoff between F14Value and flat_hash_set changes smoothly across value sizes, a surprisingly clean result.
- FindMiss Cold shows that separated metadata is better, which is also not surprising since most FindMiss calls only need to access the metadata. In fact, for the larger value sizes the entire flat_hash_set ctrl_ array fits in the L3. flat_hash_set's quadratic probing might also help locality in this case. F14 uses double-hashing, which reduces both clumping (good) and probe locality (bad).
We can actually experiment with this. That is, we could easily test flat_hash_set with double-hashing. Making F14 try quadratic probing should not be hard also, though I don't know the code enough to know.Double-hashing does lead to smaller code size, which might just mean that it is faster in some cases because it is less code to run.I don't think double-hashing saves code size, since we need an iteration counter anyway to avoid a rare infinite loop.
That means double-hashing occupies two registers in the core loop rather than one (assuming you do the multiply instead of doing power reduction). It should be trivial to try quadratic probing in F14 once I get some more free cycles.To me these benchmark results show that you'd need to know a lot about the operation profile and working set of a program to predict whether SwissTable or F14 would be faster. For example, if you take a lot of cache misses reading from tables with large entries then flat_hash_set is clearly best, but if you construct and destroy those tables a lot then F14Vector seems like the most compelling.
- Iterate Cold shows that F14Vector's contiguously-packed elements allow much faster iteration, as expected. This helps destruction (and copy construction) performance.
- The Hot results show that the tables are all quite close to each other in performance.
- FindHit Hot shows a 1 or 2 cycle advantage for F14Value. I don't know if this is because it executes fewer instructions, because it achieves higher IPC, or if this result would go away with a stateless equality functor.
- flat_hash_set for FindMiss Hot seems to get a bit of a slowdown with Max load factor. My guess is that this is because quadratic probing leads to longer probe lengths than double hashing. Also, flat_hash_set needs to probe when a chunk is full but hasn't actually overflowed, whereas F14 only probes if a live (non-erased) key was actually displaced.
If I was going to try to drop absl::flat_hash_map/set into a program already using F14FastMap/Set, the most likely blocker would be the memory overhead of small hash tables.Funny that you mention this. We are considering a change that reduces the minimum capacity to 3 without impacting the find/insert code paths, only the reallocation.Nice, it should be quite easy given that you materialize the number of insertions before resize. How did you pick 3?
Hmm, another followup for me. It should be possible to pick an initial capacity with minimal internal fragmentation from the allocator. For example if sizeof(value_type) was 4 or 8 then an initial capacity of 4 will actually reserve the same amount of memory as 3 due to 16 byte alignment.
On Wed, Oct 3, 2018 at 6:03 PM Nathan Bronson <ngbr...@gmail.com> wrote:On Wed, Oct 3, 2018 at 5:23 PM Samuel Benzaquen <sbe...@google.com> wrote:On Wed, Oct 3, 2018 at 3:28 PM Nathan Bronson <ngbr...@gmail.com> wrote:From a performance perspective the main design difference between SwissTable and F14 is whether the metadata and data are separated (as in SwissTable) or colocated (as in F14). I would expect colocation to work better when the entries are small, and separation to work better the elements are large. F14 also pushes users toward a third storage strategy, F14VectorSet, for medium and large values. This stores elements in a contiguous array and puts a 4-byte index into the main hash table, which makes lookup slower but iteration (and hence destruction) faster. This indirect strategy also saves memory, because the vector effectively has a perfect load factor.
To see how the tradeoffs change with the size of the item, I expanded the figure so that each row shows a single configuration at 5 different value sizes. I've also separated out the Hot case from the Cold case. To help make small relative differences look small I have included the origin on the y axis.
- The FindHit Cold rows show that F14Vector is always the slowest (because of its indirection), that F14Value is the fastest for small values (because the tag/H2 values are in the same cache line or cache line pair as the values), and that flat_hash_set is the fastest for large values. The tradeoff between F14Value and flat_hash_set changes smoothly across value sizes, a surprisingly clean result.
- FindMiss Cold shows that separated metadata is better, which is also not surprising since most FindMiss calls only need to access the metadata. In fact, for the larger value sizes the entire flat_hash_set ctrl_ array fits in the L3. flat_hash_set's quadratic probing might also help locality in this case. F14 uses double-hashing, which reduces both clumping (good) and probe locality (bad).
We can actually experiment with this. That is, we could easily test flat_hash_set with double-hashing. Making F14 try quadratic probing should not be hard also, though I don't know the code enough to know.Double-hashing does lead to smaller code size, which might just mean that it is faster in some cases because it is less code to run.I don't think double-hashing saves code size, since we need an iteration counter anyway to avoid a rare infinite loop.Oh. It did in my case because we don't use the iteration counter for anything else other than quadratic probing.
That means double-hashing occupies two registers in the core loop rather than one (assuming you do the multiply instead of doing power reduction). It should be trivial to try quadratic probing in F14 once I get some more free cycles.To me these benchmark results show that you'd need to know a lot about the operation profile and working set of a program to predict whether SwissTable or F14 would be faster. For example, if you take a lot of cache misses reading from tables with large entries then flat_hash_set is clearly best, but if you construct and destroy those tables a lot then F14Vector seems like the most compelling.
- Iterate Cold shows that F14Vector's contiguously-packed elements allow much faster iteration, as expected. This helps destruction (and copy construction) performance.
- The Hot results show that the tables are all quite close to each other in performance.
- FindHit Hot shows a 1 or 2 cycle advantage for F14Value. I don't know if this is because it executes fewer instructions, because it achieves higher IPC, or if this result would go away with a stateless equality functor.
- flat_hash_set for FindMiss Hot seems to get a bit of a slowdown with Max load factor. My guess is that this is because quadratic probing leads to longer probe lengths than double hashing. Also, flat_hash_set needs to probe when a chunk is full but hasn't actually overflowed, whereas F14 only probes if a live (non-erased) key was actually displaced.
If I was going to try to drop absl::flat_hash_map/set into a program already using F14FastMap/Set, the most likely blocker would be the memory overhead of small hash tables.Funny that you mention this. We are considering a change that reduces the minimum capacity to 3 without impacting the find/insert code paths, only the reallocation.Nice, it should be quite easy given that you materialize the number of insertions before resize. How did you pick 3?Our capacity has to be a power-of-two-minus-one right now.And our find/insert algorithm assumes there is at least one empty slot in the table.This means that a capacity of 1 would not actually allow any object in it.A capacity of 3 would allow 2 elements in.
This thread actually made me think about this problem a little more and I think we can forgo the empty slot requirement for very small tables, which would make capacity of 1 feasible and allow a max load factor of 1.0 these small tables.Hmm, another followup for me. It should be possible to pick an initial capacity with minimal internal fragmentation from the allocator. For example if sizeof(value_type) was 4 or 8 then an initial capacity of 4 will actually reserve the same amount of memory as 3 due to 16 byte alignment.we don't have a 16 byte alignment requirement on the data.The memory usage right now is (capacity + 1) * sizeof(value_type) + 16. The 16 comes from sentinel control bytes we add at the end of the control block. This allows for unaligned queries on find and iterator's operator++. Having floating groups reduced the average probe chain length by distributing the elements more uniformly.With the proposed change we would still have those 16 bytes, though. So a capacity of 3 would use 4*sizeof(value_type)+16. It is not very useful for `int`, but can save a lot of memory for larger types.
We could push it to get a minimum allocation of 16 bytes, but might require some clever tricks to add the special small-table logic without affecting regular probing.If you have good knowledge of the allocation classes it might work to squeeze out a free item for other sizes as well.Both SwissTable and F14 have more variation between Min and Max load factor than I expected. Perhaps we should revisit the design choice of using a fixed max_load_factor?Note that they use a different max load factor. F14 uses 12/14, while flat_hash_set uses 14/16, so there is that too.Yes, although that's only a 2% difference. Also, when storing 4-byte items (including the vector indexes for F14Vector) F14 actually uses a chunk size of 12 and a max load factor of 10/12, to make a chunk fit exactly in one cache line.
It's outside the scope of benchmarking, but is there somebody with whom I could follow up on regarding the correctness of the slot_type union in absl/container/internal/container_memory.h ? We faced a similar problem and concluded that it required relying on undefined behavior (which we have attempted to quarantine with std::launder).- Nathan
...snip...
--
You received this message because you are subscribed to the Google Groups "hashtable-benchmarks" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hashtable-benchm...@googlegroups.com.
To post to this group, send email to hashtable-...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/hashtable-benchmarks/CAM9aRsJFSkKfpASfHicsFzd07oTGBE9Lk0qfk8PRhfMEHGJxQA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
On Thu, Oct 4, 2018 at 11:15 AM 'Samuel Benzaquen' via hashtable-benchmarks <hashtable-...@googlegroups.com> wrote:On Wed, Oct 3, 2018 at 6:03 PM Nathan Bronson <ngbr...@gmail.com> wrote:On Wed, Oct 3, 2018 at 5:23 PM Samuel Benzaquen <sbe...@google.com> wrote:On Wed, Oct 3, 2018 at 3:28 PM Nathan Bronson <ngbr...@gmail.com> wrote:From a performance perspective the main design difference between SwissTable and F14 is whether the metadata and data are separated (as in SwissTable) or colocated (as in F14). I would expect colocation to work better when the entries are small, and separation to work better the elements are large. F14 also pushes users toward a third storage strategy, F14VectorSet, for medium and large values. This stores elements in a contiguous array and puts a 4-byte index into the main hash table, which makes lookup slower but iteration (and hence destruction) faster. This indirect strategy also saves memory, because the vector effectively has a perfect load factor.
To see how the tradeoffs change with the size of the item, I expanded the figure so that each row shows a single configuration at 5 different value sizes. I've also separated out the Hot case from the Cold case. To help make small relative differences look small I have included the origin on the y axis.
- The FindHit Cold rows show that F14Vector is always the slowest (because of its indirection), that F14Value is the fastest for small values (because the tag/H2 values are in the same cache line or cache line pair as the values), and that flat_hash_set is the fastest for large values. The tradeoff between F14Value and flat_hash_set changes smoothly across value sizes, a surprisingly clean result.
- FindMiss Cold shows that separated metadata is better, which is also not surprising since most FindMiss calls only need to access the metadata. In fact, for the larger value sizes the entire flat_hash_set ctrl_ array fits in the L3. flat_hash_set's quadratic probing might also help locality in this case. F14 uses double-hashing, which reduces both clumping (good) and probe locality (bad).
We can actually experiment with this. That is, we could easily test flat_hash_set with double-hashing. Making F14 try quadratic probing should not be hard also, though I don't know the code enough to know.Double-hashing does lead to smaller code size, which might just mean that it is faster in some cases because it is less code to run.I don't think double-hashing saves code size, since we need an iteration counter anyway to avoid a rare infinite loop.Oh. It did in my case because we don't use the iteration counter for anything else other than quadratic probing.That's because you always keep at least 1 empty slot, right?
That means double-hashing occupies two registers in the core loop rather than one (assuming you do the multiply instead of doing power reduction). It should be trivial to try quadratic probing in F14 once I get some more free cycles.To me these benchmark results show that you'd need to know a lot about the operation profile and working set of a program to predict whether SwissTable or F14 would be faster. For example, if you take a lot of cache misses reading from tables with large entries then flat_hash_set is clearly best, but if you construct and destroy those tables a lot then F14Vector seems like the most compelling.
- Iterate Cold shows that F14Vector's contiguously-packed elements allow much faster iteration, as expected. This helps destruction (and copy construction) performance.
- The Hot results show that the tables are all quite close to each other in performance.
- FindHit Hot shows a 1 or 2 cycle advantage for F14Value. I don't know if this is because it executes fewer instructions, because it achieves higher IPC, or if this result would go away with a stateless equality functor.
- flat_hash_set for FindMiss Hot seems to get a bit of a slowdown with Max load factor. My guess is that this is because quadratic probing leads to longer probe lengths than double hashing. Also, flat_hash_set needs to probe when a chunk is full but hasn't actually overflowed, whereas F14 only probes if a live (non-erased) key was actually displaced.
If I was going to try to drop absl::flat_hash_map/set into a program already using F14FastMap/Set, the most likely blocker would be the memory overhead of small hash tables.Funny that you mention this. We are considering a change that reduces the minimum capacity to 3 without impacting the find/insert code paths, only the reallocation.Nice, it should be quite easy given that you materialize the number of insertions before resize. How did you pick 3?Our capacity has to be a power-of-two-minus-one right now.And our find/insert algorithm assumes there is at least one empty slot in the table.This means that a capacity of 1 would not actually allow any object in it.A capacity of 3 would allow 2 elements in.That makes sense. Floating groups are very nice, but they mean that you might have a hole in the elements even in a very small map. Fixed groups would let you guarantee that if there was only a single group the elements were all packed into the bottom.
If you could map all hash values to index 0 when the capacity was <= 1 group (and disable the debug end-of-group insert) then you could do any sub-group size exactly while retaining floating groups for the case when they matter.
This thread actually made me think about this problem a little more and I think we can forgo the empty slot requirement for very small tables, which would make capacity of 1 feasible and allow a max load factor of 1.0 these small tables.Hmm, another followup for me. It should be possible to pick an initial capacity with minimal internal fragmentation from the allocator. For example if sizeof(value_type) was 4 or 8 then an initial capacity of 4 will actually reserve the same amount of memory as 3 due to 16 byte alignment.we don't have a 16 byte alignment requirement on the data.The memory usage right now is (capacity + 1) * sizeof(value_type) + 16. The 16 comes from sentinel control bytes we add at the end of the control block. This allows for unaligned queries on find and iterator's operator++. Having floating groups reduced the average probe chain length by distributing the elements more uniformly.With the proposed change we would still have those 16 bytes, though. So a capacity of 3 would use 4*sizeof(value_type)+16. It is not very useful for `int`, but can save a lot of memory for larger types.16 byte alignment is a restriction on malloc, and comes from alignof(std::max_align_t). malloc doesn't know about the type that you are going to put in allocated memory, so it has to give you max_align_t-aligned memory. We use jemalloc, so the underlying allocation sizes are 8, 16, 32, 48, {64, 80, 96, 112} * {1, 2, 3, ...}.
If you could map all hash values to index 0 when the capacity was <= 1 group (and disable the debug end-of-group insert) then you could do any sub-group size exactly while retaining floating groups for the case when they matter.The 16 extra control bytes for floating groups actually also help with the iterator implementation.The iterator does SSE lookups to find the next element on operator++, but that requires a 16-byte read at arbitrary positions in the control block.Before floating groups we were overflowing the control block and reading some of the payload bytes. It theoretically doesn't affect program logic because we never actually used the payload bytes there. We would always find and stop at the sentinel control byte.However, the payload bytes can be uninitialized memory which is UB and this was continuously triggering Memory Sanitizer.One solution we tried is to explicitly initialize the first 16 bytes of the payload block to zeros to avoid the UB, but this doesn't really work when you have user-defined payload types. The common source of MSan reports were user-defined types with padding. MSan considered those padding bytes to be uninitialized after the constructor of those objects ran. We didn't have any control over those bytes after we constructed the object.If we are willing to have specialized implementations, we could go to a more compact layout for small tables of types we know are safe, like fundamental types. Overflowing the control block there would be fine if we ensure it is not uninitialized to begin with. I don't know if specializing for fundamental types is worth it as it is extra complexity, but they are very common and are also the types that would mostly benefit from a compact control block representation.Having an overhead of 16 bytes for a 1 element flat_hash_set<std::string> is not that much of a problem since the string itself is already large enough.Having the overhead for a 1 element flat_hash_set<int> is more relevant.This thread actually made me think about this problem a little more and I think we can forgo the empty slot requirement for very small tables, which would make capacity of 1 feasible and allow a max load factor of 1.0 these small tables.Hmm, another followup for me. It should be possible to pick an initial capacity with minimal internal fragmentation from the allocator. For example if sizeof(value_type) was 4 or 8 then an initial capacity of 4 will actually reserve the same amount of memory as 3 due to 16 byte alignment.we don't have a 16 byte alignment requirement on the data.The memory usage right now is (capacity + 1) * sizeof(value_type) + 16. The 16 comes from sentinel control bytes we add at the end of the control block. This allows for unaligned queries on find and iterator's operator++. Having floating groups reduced the average probe chain length by distributing the elements more uniformly.With the proposed change we would still have those 16 bytes, though. So a capacity of 3 would use 4*sizeof(value_type)+16. It is not very useful for `int`, but can save a lot of memory for larger types.16 byte alignment is a restriction on malloc, and comes from alignof(std::max_align_t). malloc doesn't know about the type that you are going to put in allocated memory, so it has to give you max_align_t-aligned memory. We use jemalloc, so the underlying allocation sizes are 8, 16, 32, 48, {64, 80, 96, 112} * {1, 2, 3, ...}.Yes. malloc is discouraged here for this and other reasons, like not supporting sized deletes.We prefer allocating memory through std::allocator. The alignment is limited to the requested object's alignment.We use TCMalloc, as its preferred alignment is 8 bytes. Using malloc is significantly slower than std::allocator in these cases.
----We could push it to get a minimum allocation of 16 bytes, but might require some clever tricks to add the special small-table logic without affecting regular probing.If you have good knowledge of the allocation classes it might work to squeeze out a free item for other sizes as well.Both SwissTable and F14 have more variation between Min and Max load factor than I expected. Perhaps we should revisit the design choice of using a fixed max_load_factor?Note that they use a different max load factor. F14 uses 12/14, while flat_hash_set uses 14/16, so there is that too.Yes, although that's only a 2% difference. Also, when storing 4-byte items (including the vector indexes for F14Vector) F14 actually uses a chunk size of 12 and a max load factor of 10/12, to make a chunk fit exactly in one cache line.
It's outside the scope of benchmarking, but is there somebody with whom I could follow up on regarding the correctness of the slot_type union in absl/container/internal/container_memory.h ? We faced a similar problem and concluded that it required relying on undefined behavior (which we have attempted to quarantine with std::launder).- Nathan
...snip...
You received this message because you are subscribed to the Google Groups "hashtable-benchmarks" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hashtable-benchm...@googlegroups.com.
To post to this group, send email to hashtable-...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/hashtable-benchmarks/CAM9aRsJFSkKfpASfHicsFzd07oTGBE9Lk0qfk8PRhfMEHGJxQA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
--Nathan Grasso Bronson
ngbr...@gmail.com
You received this message because you are subscribed to the Google Groups "hashtable-benchmarks" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hashtable-benchm...@googlegroups.com.
To post to this group, send email to hashtable-...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/hashtable-benchmarks/CAM9aRsJrYnDhL%2Bpi_73jv1t_1jjaXBQrfPLpRsR4ih3cCFWV7g%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
We use jemalloc underneath. Do you plumb the alignment constraints through to TCMalloc from operator new?
If you don't then it can't know whether the type being allocated has 8 or 16 byte alignment (assuming the allocation itself is >= 16 bytes), so it seems like it must be conservative and give you 16 byte alignment.
Yes, this is great! Did you consider supporting other sub-group capacities if the user calls explicit reserve() or provides an initial capacity?