Date: Wed, 3 Jul 2024 11:36:29 +0200
From: "Steinar H. Gunderson" <
se...@google.com>
To:
platform-arc...@chromium.org
Subject: Replacing StringHasher with rapidhash
[Resending from @
google.com; evidently my last email got stuck in some
list spam filter]
Hi,
I'm throwing this out to get some early feedback before I go way too deep
in the code; both positive and negative sentiments are welcome. :-)
We spend a lot of time making AtomicStrings (e.g., 10% of CSS parsing is
creating them), and a significant of that goes into making hashes.
StringHasher is an implementation of SuperFastHash from 2013, which is now
looking rather dated. In particular, it has really poor hash properties
(it fails nearly all the SMHasher tests), and it's generally not using
the 64-bit ALUs present in effectively all modern CPUs.
I'd like to replace SuperFastHash with rapidhash, which is currently topping
the SMHasher lists in terms of performance for short strings (and also is
about ten times as fast as SuperFastHash for long strings). It is licensed
under a 2-clause BSD license, and comes as a single header file
(see
https://github.com/Nicoshev/rapidhash), although we'd need to make some
modifications to get e.g. constexpr support and the ability to lowercase
strings on the fly as we're going.
This is not a small change, but it is also not a huge one. What we primarily
lose is the ability to do incremental hashing, but there are actually not
that many users of that (fewer than ten). I'm also planning to retire the
current concept of the string hash expanding 8-bit data on the fly to 16-bit
(SuperFastHash works natively in 16-bit input increments, rapidhash does
not), which will require a little bit of fiddly code. For instance, it means
that whenever we have a 16-bit String that we want to hash, we'll need to
check whether it contains only 8-bit data or not, and if so, hash only ever
other byte. (The check plus new hash should already be much faster than the
old hash alone, since it already needs to do similar checking.) Similarly,
when adding UTF-8 data to AtomicStringTable, we'll probably need to convert
it to UTF-16 ahead of hashing if it's non-ASCII, as opposed to after hashing
and lookup like today.
The benefits, though, are worth it IMO. We gain an immediate ~2% on CSS
parsing, and a prototype shows ~0.5% Speedometer3 and MotionMark gain.
Furthermore, we end up with a just better hash, so fewer collisions.
It is also possible to parameterize much easier then SuperFastHash (you can
generate and choose secrets, randomizing the hash), so if we wanted to, we
could leverage ASLR-like techniques in the future to make DoS-ing hash tables
harder. This is not something I intend to do at this point, though.
What do you think?
/* Steinar */
----- End forwarded message -----