Compressed pointers in String/AtomicString

3 views
Skip to first unread message

Steinar H. Gunderson

unread,
Dec 17, 2025, 3:06:41 PM (2 days ago) Dec 17
to platform-arc...@chromium.org
Hi,

TL;DR: I'd like to implement compressed pointers (like Oilpan has)
to reduce WTF::String, and by extension AtomicString, to 32 bits
on 64-bit platforms.

I've been looking at ways to make our data structures more compact,
and one of the things that keep standing out is that AtomicString
is pointer-sized and thus 64 bits long on 64-bit platforms.
I experimented at some point with SSO for AtomicString, without
much success, but it seems like shrinking it to 32 bits may fare
better.

The scheme is deliberately very similar to Oilpan/V8's; StringImpls
are allocated in a special 2G cage of memory where the top bit is
always set. (I.e., the memory pool consists of the addresses from
xxxxxxxx8000000 to xxxxxxxxffffffff, for some xxxxxxxx fixed at
process initialization.) This allows compression to be just a
truncation to 32 bits, and decompression to be (val >> 31) & mask,
where >> is a signed shift. (Notably, this allows null pointers to
be represented as simply 0 without a branch.)

Oilpan also has a shift on top of this, so that they can store
more than 2GB of data (at the cost of an extra shift and of course
pointer alignment limitations). I don't know if we need to store
more than 2GB of strings in a given process, but the option is
there should we wish to explore it.

I've made a prototype but without a real allocator; it just mmaps
a huge cage and uses a bump allocator, so it never frees anything.
Nevertheless, the savings are real, e.g., loading a given DOM and
styling it goes down from ~30MB to ~29MB of Oilpan RAM. (There's
probably also some PartitionAlloc RAM saved -- perhaps even more --
but since strings are now now allocated on the main partition
but instead in my mmap hack, measurements would be unfair. I can
try to investigate this if people are interested.) The prototype
is fully-featured on the platforms it supports, and passes all
tests on e.g. mac-rel.

There is, of course, some performance cost in compressing and
decompressing, but it's not bad at all. Speedometer3 is about -0.2%
performance on M1 (pretty much the smallest we can measure reliably
on Pinpoint), +0.2% on Pixel 6 and +0.3% on Pixel 9 (although the
Android measurements have somewhat wide confidence intervals);
presumably the wins on Android are due to better L1/L2 cache usage.
Of course, adding a real allocator will influence this _somehow_,
although it's not obvious in which way; when never freeing anything,
we get poor cache locality and we keep needing to fault in more
pages from the OS (which, from a quick profile on Linux, seems to
be at least as much CPU as PartitionAlloc::Alloc()). I assume that
PGO won't move this much either way, but as usual, it's hard to say
since we don't have full PGO bots. Of course, if your system is
actually strapped for RAM (and with RAM prices nearly tripling
in 2025, we cannot really expect low-end devices to increase their
RAM amounts anytime soon!), the win may be huge.

It's possible that we'd also want to include some other adjacent
structures, such as QualifiedNameImpl. I haven't checked the effect
of this. I also haven't tried to really shuffle structs around to
fill any holes that may occur when String/AtomicString shrinks in
size and leaves extra padding (except the few cases where needed
for correctness).

We'd need some help from PartitionAlloc to see if we can actually
get it to give us a separate memory pool with the given alignment
constraints; I'm not doing to reinvent malloc for this. (Of course,
on 32-bit platforms, we'll just keep using pointers as before.)

As with V8 pointer compression, we gain the added nice benefit
that corrupting a compressed pointer cannot corrupt anything else
than other strings (since it will always stay in the cage);
gaining a write primitive to a String will not allow you to
e.g. overwrite a vtable pointer to gain code execution. So there
is a small defense-in-depth security gain.

Any thoughts? Insights?

/* Steinar */
--
Homepage: https://www.sesse.net/

Jeremy Roman

unread,
Dec 17, 2025, 4:19:13 PM (2 days ago) Dec 17
to Steinar H. Gunderson, platform-arc...@chromium.org
Interesting! There's some complexity here (but not an unmanageable amount), so I guess the question is -- is this a compelling enough performance improvement?

It sounds like it's neutral to negative on Speedometer in addition, so do we have an idea of how much we value the memory savings? (And I guess tangentially, though I'm not asking you to answer this question, I'm curious whether there are particular places where we could have our cake and eat it too -- are there some String use sites that are very time- but not very memory-sensitive, and the reverse? Obviously that comes with its own headaches.)

I'm particularly curious to hear what people involved in PartitionAlloc and Oilpan think from our past experiences with both (some may be on this list already, not sure).

--
You received this message because you are subscribed to the Google Groups "platform-architecture-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to platform-architect...@chromium.org.
To view this discussion visit https://groups.google.com/a/chromium.org/d/msgid/platform-architecture-dev/hdww6pllq5hd4phxoiq4q32t3krk2z7xtuakrvptsjrzwesspl%40qe567ah5qaua.

Dave Tapuska

unread,
Dec 17, 2025, 4:40:13 PM (2 days ago) Dec 17
to Jeremy Roman, Steinar H. Gunderson, platform-architecture-dev
Do you have numbers of how many String(s) are in the heap on a set of benchmark pages (say top 100 URLs). How many are static, atomic, non-static & non-atomic? Then we would know what the savings would be. 

I've always wanted to reduce the number of atomic strings that are static. There really should be no allocation for them because they come from the ro.data. I believe WebKit did something with AsciiLiteral here. But alas the size I computed was only like 10k-25 or something like that. So it didn't justify the work/complexity. 

Dave

Kentaro Hara

unread,
Dec 17, 2025, 6:24:16 PM (2 days ago) Dec 17
to Dave Tapuska, Jeremy Roman, Steinar H. Gunderson, platform-architecture-dev
+1 to get more numbers and evaluate the benefit and cost.

A few technical questions:

1. We are talking about AtomicString but can we apply the optimization to String as well?

2. This means that we can support <2GB for strings in total. Is this restriction okay?

3. Why 2 GB? Pointers in 64 bit systems are aligned with 8 bit boundaries. Can we support up to 8 GB?


Kentaro Hara, Tokyo

2025年12月18日(木) 6:40 Dave Tapuska <dtap...@chromium.org>:

Steinar H. Gunderson

unread,
Dec 18, 2025, 2:38:41 AM (yesterday) Dec 18
to Kentaro Hara, Dave Tapuska, Jeremy Roman, platform-architecture-dev
On Thu, Dec 18, 2025 at 08:24:03AM +0900, Kentaro Hara wrote:
> 1. We are talking about AtomicString but can we apply the optimization to
> String as well?

Yes. AtomicString is just a String with a special flag that says it has been
interned, so my prototype applies equally to both.

In theory, we could have an UncompressedString that keeps the same
performance as before, but I haven't seen the need for it yet.

> 2. This means that we can support <2GB for strings in total. Is this
> restriction okay?

This is unclear to me. But pages needing >2GB would surely just crash on
32-bit browsers.

> 3. Why 2 GB? Pointers in 64 bit systems are aligned with 8 bit boundaries.
> Can we support up to 8 GB?

Yes, we can support up to 8 GB (or even 16 GB if we accept 16-byte alignment,
which I believe is already the case for String under PartitionAlloc today;
although I think it is somewhat wasteful) if we add an extra shift on
compression and decompression. I haven't looked into how this affects
benchmarks yet.

Steinar H. Gunderson

unread,
Dec 18, 2025, 2:40:30 AM (yesterday) Dec 18
to Dave Tapuska, Jeremy Roman, platform-architecture-dev
On Wed, Dec 17, 2025 at 04:39:58PM -0500, Dave Tapuska wrote:
> Do you have numbers of how many String(s) are in the heap on a set of
> benchmark pages (say top 100 URLs). How many are static, atomic, non-static
> & non-atomic? Then we would know what the savings would be.

It's hard to know, but I could make an attempt? Since we have the refcount,
we could probably “just” add all of them together, although we'd include
the ones on the stack by that.

> I've always wanted to reduce the number of atomic strings that are static.
> There really should be no allocation for them because they come from the
> ro.data.

My pet peeve there is that we should be able to get Hash() for them
compile-time, but C++ probably isn't smart enough for that :-)

Michael Lippautz

unread,
Dec 18, 2025, 3:48:15 AM (yesterday) Dec 18
to Steinar H. Gunderson, Dave Tapuska, Jeremy Roman, platform-architecture-dev
Since this was asked further up: When we introduced this in Oilpan (blog) there were clear incentives set up to optimize for memory and we were able to measure relatively large overall increases. With performance being neutral (w/ PGO) this was a clear go for us back then. 

We do not want to regress competitive performance for memory at this point. (Semi related: We'd rather invest any budget that we possibly have here in security mitigations.)

IIUC, then we'd be doing this to save a little memory?  I realize Strings are quite ubiquitous, so some more data could help here. Can you run the left over system health benchmarks as a first proxy? (e.g. system_health.memory_desktop and system_health.memory_mobile) 

--
You received this message because you are subscribed to the Google Groups "platform-architecture-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to platform-architect...@chromium.org.

Steinar H. Gunderson

unread,
Dec 18, 2025, 9:23:57 AM (yesterday) Dec 18
to Dave Tapuska, Jeremy Roman, platform-architecture-dev
On Wed, Dec 17, 2025 at 04:39:58PM -0500, Dave Tapuska wrote:
> Do you have numbers of how many String(s) are in the heap on a set of
> benchmark pages (say top 100 URLs). How many are static, atomic, non-static
> & non-atomic? Then we would know what the savings would be.

I was trying to find this, but realized I don't have a good URL list.
Do you know where I want find one?

As an example, I picked a random popular page and got out 28k strings,
of which 2k were static and 24k were atomic, but with a total refcount
of 170k at onload. That sounds about right to me, and in line with Anton's
earlier findings; ~1.4MB of String pointers that we can cut in half.

But pages are very variable; for e.g. the HTML spec, which has a large DOM
but doesn't really generate that many Blink strings, we “only” got 6k strings
and a refcount of 35k.

In the meantime, I hacked in a modified copy of mimalloc so that I could
measure with a real allocator (of course, a production solution would adapt
PartitionAlloc, but I don't believe the numbers will be significantly
different); it's not the prettiest solution, but it allowed me to validate
the -0.2% number on M1. (I'm similarly re-running the Pixel 6 and 9
benchmarks, although the number of available bots on Pinpoint is pretty thin
and it takes a long time. I believe that these will remain ever so slightly
positive in performance, though.)
Reply all
Reply to author
Forward
0 new messages