Hi,
TL;DR: I'd like to implement compressed pointers (like Oilpan has)
to reduce WTF::String, and by extension AtomicString, to 32 bits
on 64-bit platforms.
I've been looking at ways to make our data structures more compact,
and one of the things that keep standing out is that AtomicString
is pointer-sized and thus 64 bits long on 64-bit platforms.
I experimented at some point with SSO for AtomicString, without
much success, but it seems like shrinking it to 32 bits may fare
better.
The scheme is deliberately very similar to Oilpan/V8's; StringImpls
are allocated in a special 2G cage of memory where the top bit is
always set. (I.e., the memory pool consists of the addresses from
xxxxxxxx8000000 to xxxxxxxxffffffff, for some xxxxxxxx fixed at
process initialization.) This allows compression to be just a
truncation to 32 bits, and decompression to be (val >> 31) & mask,
where >> is a signed shift. (Notably, this allows null pointers to
be represented as simply 0 without a branch.)
Oilpan also has a shift on top of this, so that they can store
more than 2GB of data (at the cost of an extra shift and of course
pointer alignment limitations). I don't know if we need to store
more than 2GB of strings in a given process, but the option is
there should we wish to explore it.
I've made a prototype but without a real allocator; it just mmaps
a huge cage and uses a bump allocator, so it never frees anything.
Nevertheless, the savings are real, e.g., loading a given DOM and
styling it goes down from ~30MB to ~29MB of Oilpan RAM. (There's
probably also some PartitionAlloc RAM saved -- perhaps even more --
but since strings are now now allocated on the main partition
but instead in my mmap hack, measurements would be unfair. I can
try to investigate this if people are interested.) The prototype
is fully-featured on the platforms it supports, and passes all
tests on e.g. mac-rel.
There is, of course, some performance cost in compressing and
decompressing, but it's not bad at all. Speedometer3 is about -0.2%
performance on M1 (pretty much the smallest we can measure reliably
on Pinpoint), +0.2% on Pixel 6 and +0.3% on Pixel 9 (although the
Android measurements have somewhat wide confidence intervals);
presumably the wins on Android are due to better L1/L2 cache usage.
Of course, adding a real allocator will influence this _somehow_,
although it's not obvious in which way; when never freeing anything,
we get poor cache locality and we keep needing to fault in more
pages from the OS (which, from a quick profile on Linux, seems to
be at least as much CPU as PartitionAlloc::Alloc()). I assume that
PGO won't move this much either way, but as usual, it's hard to say
since we don't have full PGO bots. Of course, if your system is
actually strapped for RAM (and with RAM prices nearly tripling
in 2025, we cannot really expect low-end devices to increase their
RAM amounts anytime soon!), the win may be huge.
It's possible that we'd also want to include some other adjacent
structures, such as QualifiedNameImpl. I haven't checked the effect
of this. I also haven't tried to really shuffle structs around to
fill any holes that may occur when String/AtomicString shrinks in
size and leaves extra padding (except the few cases where needed
for correctness).
We'd need some help from PartitionAlloc to see if we can actually
get it to give us a separate memory pool with the given alignment
constraints; I'm not doing to reinvent malloc for this. (Of course,
on 32-bit platforms, we'll just keep using pointers as before.)
As with V8 pointer compression, we gain the added nice benefit
that corrupting a compressed pointer cannot corrupt anything else
than other strings (since it will always stay in the cage);
gaining a write primitive to a String will not allow you to
e.g. overwrite a vtable pointer to gain code execution. So there
is a small defense-in-depth security gain.
Any thoughts? Insights?
/* Steinar */
--
Homepage:
https://www.sesse.net/