Making multi-threaded allocations on PartitionAlloc faster

151 views
Skip to first unread message

Kentaro Hara

unread,
Jan 11, 2017, 12:12:41 AM1/11/17
to Chris Palmer, Justin Schuh, hpayer, Michael Lippautz, platform-architecture-dev, Project TRIM
Hi

palmer@ moved PartitionAlloc to base/ in preparation for using PartitionAlloc for more objects in Chromium. Specifically, we want to use PartitionAlloc for V8 (*), Skia, CC, PDFium etc.

However, (as some people have showed concern in the past,) the current PartitionAlloc is optimized for single-threaded allocations using a spin lock. To move more objects in Chromium to PartitionAlloc, we'll need to improve the performance of multi-threaded allocations.

On this document (see the first comment), palmer@ is saying that he has an idea to replace the spin lock with futex, but Hannes is saying that we'll need a bit more work than introducing futex.

What is our plan here?

Also what benchmarks should we use to confirm that PartitionAlloc is robust for multi-threaded cases? I think blink_perf.*, Speedometer etc in telemetry benchmarks are not useful since they are single-threaded.

(*) V8 means Zone allocators and other V8 objects currently allocated on malloc. It does not mean V8 objects allocated on GC-managed heaps.


--
Kentaro Hara, Tokyo, Japan

Primiano Tucci

unread,
Jan 12, 2017, 6:39:10 AM1/12/17
to Kentaro Hara, Chris Palmer, Justin Schuh, hpayer, Michael Lippautz, platform-architecture-dev, Project TRIM
On Wed, Jan 11, 2017 at 5:12 AM Kentaro Hara <har...@chromium.org> wrote:
Hi

palmer@ moved PartitionAlloc to base/ in preparation for using PartitionAlloc for more objects in Chromium. Specifically, we want to use PartitionAlloc for V8 (*), Skia, CC, PDFium etc.

However, (as some people have showed concern in the past,) the current PartitionAlloc is optimized for single-threaded allocations using a spin lock. To move more objects in Chromium to PartitionAlloc, we'll need to improve the performance of multi-threaded allocations.

On this document (see the first comment), palmer@ is saying that he has an idea to replace the spin lock with futex, but Hannes is saying that we'll need a bit more work than introducing futex.

IMHO the key here is not really futex vs spinlock. My memory is that PA has not been designed to be used under contention (hence the spinlock: once you assume low contention go for the simplest and lowest-latency option). What really makes multi-thread scenarios go fast is either thread caching (but that comes with cost and complexity, see tc-malloc) or isolation of contention domains (read: ideally each thread should have its own partition)
A futex will just make it so that in the case of contention, threads will be deschedule away, avoiding burning CPU, at the cost of an increased latency.
My fear here is about introducing contention in the first place. futex vs spinlock doesn't solve the problem that, in case of contention, some threads will be blocked on others.
While developing the heap profiler, I recall seeing rates of 400 K alloc(and frees) / second during page load and scrolling. I don't know/remember on which threads does the contention originate, but if it is >1 thread we might want to be careful. (tip: with some tweaks the heap profiler might be a good way to figure out which are the areas subjects to major contention, happy to start a separate thread on this)
 
So, to summarize my comment above, I'm not against replacing the spinlock with a futex. I'm just not convinced that that's sufficient to be good.


What is our plan here?

Also what benchmarks should we use to confirm that PartitionAlloc is robust for multi-threaded cases? I think blink_perf.*, Speedometer etc in telemetry benchmarks are not useful since they are single-threaded.
I'd definitely use benchmarks that stimulate page load and scrolling for the reasons mentioned above. TTFMP and TTFI in system_health.common_desktop might be good choices.
Likewise the smoothness/silk benchmarks, provided they are still up and running.
 

(*) V8 means Zone allocators and other V8 objects currently allocated on malloc. It does not mean V8 objects allocated on GC-managed heaps.


--
Kentaro Hara, Tokyo, Japan

--
You received this message because you are subscribed to the Google Groups "Project TRIM" group.
To unsubscribe from this group and stop receiving emails from it, send an email to project-trim...@chromium.org.
To post to this group, send email to projec...@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/project-trim/CABg10jy5F%3D3r73jWaRTiWyF4AgXUnseyRuvNhOS2vz%3DhDBUy8Q%40mail.gmail.com.

Kentaro Hara

unread,
Jan 12, 2017, 6:53:27 AM1/12/17
to Primiano Tucci, Chris Palmer, Justin Schuh, hpayer, Michael Lippautz, platform-architecture-dev, Project TRIM
On Thu, Jan 12, 2017 at 8:38 PM, Primiano Tucci <prim...@chromium.org> wrote:
On Wed, Jan 11, 2017 at 5:12 AM Kentaro Hara <har...@chromium.org> wrote:
Hi

palmer@ moved PartitionAlloc to base/ in preparation for using PartitionAlloc for more objects in Chromium. Specifically, we want to use PartitionAlloc for V8 (*), Skia, CC, PDFium etc.

However, (as some people have showed concern in the past,) the current PartitionAlloc is optimized for single-threaded allocations using a spin lock. To move more objects in Chromium to PartitionAlloc, we'll need to improve the performance of multi-threaded allocations.

On this document (see the first comment), palmer@ is saying that he has an idea to replace the spin lock with futex, but Hannes is saying that we'll need a bit more work than introducing futex.

IMHO the key here is not really futex vs spinlock. My memory is that PA has not been designed to be used under contention (hence the spinlock: once you assume low contention go for the simplest and lowest-latency option). What really makes multi-thread scenarios go fast is either thread caching (but that comes with cost and complexity, see tc-malloc) or isolation of contention domains (read: ideally each thread should have its own partition)
A futex will just make it so that in the case of contention, threads will be deschedule away, avoiding burning CPU, at the cost of an increased latency.
My fear here is about introducing contention in the first place. futex vs spinlock doesn't solve the problem that, in case of contention, some threads will be blocked on others.
While developing the heap profiler, I recall seeing rates of 400 K alloc(and frees) / second during page load and scrolling. I don't know/remember on which threads does the contention originate, but if it is >1 thread we might want to be careful. (tip: with some tweaks the heap profiler might be a good way to figure out which are the areas subjects to major contention, happy to start a separate thread on this) 

Ah, this is a very good point.

Another concern would be (CPU's) cache performance. If we share one partition among multiple threads, it might pollute their caches.

Would it be crazy (=waste memory too much) to allocate one partition per thread? Then we can eliminate the lock completely. Oilpan is doing that.


So, to summarize my comment above, I'm not against replacing the spinlock with a futex. I'm just not convinced that that's sufficient to be good.

What is our plan here?

Also what benchmarks should we use to confirm that PartitionAlloc is robust for multi-threaded cases? I think blink_perf.*, Speedometer etc in telemetry benchmarks are not useful since they are single-threaded.
I'd definitely use benchmarks that stimulate page load and scrolling for the reasons mentioned above. TTFMP and TTFI in system_health.common_desktop might be good choices.
Likewise the smoothness/silk benchmarks, provided they are still up and running.

Thanks for the benchmarks. Yeah, it looks better to collect some numbers before speculating a lot.


 

(*) V8 means Zone allocators and other V8 objects currently allocated on malloc. It does not mean V8 objects allocated on GC-managed heaps.


--
Kentaro Hara, Tokyo, Japan

--
You received this message because you are subscribed to the Google Groups "Project TRIM" group.
To unsubscribe from this group and stop receiving emails from it, send an email to project-trim+unsubscribe@chromium.org.

--
You received this message because you are subscribed to the Google Groups "platform-architecture-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to platform-architecture-dev+unsub...@chromium.org.
To post to this group, send email to platform-architecture-dev@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/platform-architecture-dev/CA%2ByH71fK%2ByFLkJH6DWa9Pi8S%3Dc%2Bo9o4XHmD7TwhMEFoFr4kpSQ%40mail.gmail.com.

Primiano Tucci

unread,
Jan 12, 2017, 10:54:04 AM1/12/17
to Kentaro Hara, Chris Palmer, Justin Schuh, hpayer, Michael Lippautz, platform-architecture-dev, Project TRIM
On Thu, Jan 12, 2017 at 11:53 AM Kentaro Hara <har...@chromium.org> wrote:
On Thu, Jan 12, 2017 at 8:38 PM, Primiano Tucci <prim...@chromium.org> wrote:
On Wed, Jan 11, 2017 at 5:12 AM Kentaro Hara <har...@chromium.org> wrote:
Hi

palmer@ moved PartitionAlloc to base/ in preparation for using PartitionAlloc for more objects in Chromium. Specifically, we want to use PartitionAlloc for V8 (*), Skia, CC, PDFium etc.

However, (as some people have showed concern in the past,) the current PartitionAlloc is optimized for single-threaded allocations using a spin lock. To move more objects in Chromium to PartitionAlloc, we'll need to improve the performance of multi-threaded allocations.

On this document (see the first comment), palmer@ is saying that he has an idea to replace the spin lock with futex, but Hannes is saying that we'll need a bit more work than introducing futex.

IMHO the key here is not really futex vs spinlock. My memory is that PA has not been designed to be used under contention (hence the spinlock: once you assume low contention go for the simplest and lowest-latency option). What really makes multi-thread scenarios go fast is either thread caching (but that comes with cost and complexity, see tc-malloc) or isolation of contention domains (read: ideally each thread should have its own partition)
A futex will just make it so that in the case of contention, threads will be deschedule away, avoiding burning CPU, at the cost of an increased latency.
My fear here is about introducing contention in the first place. futex vs spinlock doesn't solve the problem that, in case of contention, some threads will be blocked on others.
While developing the heap profiler, I recall seeing rates of 400 K alloc(and frees) / second during page load and scrolling. I don't know/remember on which threads does the contention originate, but if it is >1 thread we might want to be careful. (tip: with some tweaks the heap profiler might be a good way to figure out which are the areas subjects to major contention, happy to start a separate thread on this) 

Ah, this is a very good point.

Another concern would be (CPU's) cache performance. If we share one partition among multiple threads, it might pollute their caches.

Would it be crazy (=waste memory too much) to allocate one partition per thread? Then we can eliminate the lock completely. Oilpan is doing that.

I don't think this would work too well in the browser, because: (i) conversely to what might happen in the renderer, the threading situation in the browser is a bit more organic. (ii) This situation might soon-ish change with LuckyLuke

About (i):
  • To begin with we don't know upfront the threads that we have, and dynamically creating a partition might just create memory problems like:
    • exhausting virtual address space on 32 bit (IIRC PA requires large-ish chunks of virtual address space for red zones etc, because is essentially designed for security)
    • have a memory cost due to the overhead (metadata + fragmentation) introduced by each partition.
  • A number of those threads are thread pools. When you have shared thread pools (think to the blocking pool, aka the new FILE thread) I think what we really need is a notion of partition per task (or group of tasks) regardless of where those tasks run onto. Say that you have a bunch of tasks that do network I/O and a bunch of tasks that do disk I/O to handle the profile, and they are sharded on the same thread pool. if we have the partition bound to the thread, we will still hit contentions because the net task on T0 will want to keep accessing the data they allocated by the previous net task in T1, so T0 will be effectively locking T1's heap now, preventing the new profile task running in T1 to perform its own allocation/Frees.
  • What said above does not apply only to thread pools, but to thread hopping in general. Very frequently in the browser we have tasks hopping between IO, UI, and other threads, and they keep passing ownership of heap allocated structures when hopping. Having a partition per thread is IMHO just going to introduce inter-thread locking.
  • I think a winning design here is having one partition per subsystem, regardless of the threads. This gives the good property of performance isolation and to not degenerate performance if the tasks of a subsystem are hoppy.
  • Still, this doesn't solve the problem of subsystems that are parallel by design, i.e. thread pools where tasks belonging to the same subsystem are scheduled in parallel on multiple thread (e.g., the cc raster pool). I think honestly these are the cases where PA is just not a good solution (at least as-is right now), and might need to stick to the default allocator. On the good side, I don't think we have those many cases like that. 
 About (ii): if we go for a partition per thread, once LuckyLuke will be a thing that will amplify the effects of contentions by squashing separate tasks in the same threads. Instead a partition per subsystem will still keep performance isolated and bound to the thread parallelism within that subsystem.
 


So, to summarize my comment above, I'm not against replacing the spinlock with a futex. I'm just not convinced that that's sufficient to be good.

What is our plan here?

Also what benchmarks should we use to confirm that PartitionAlloc is robust for multi-threaded cases? I think blink_perf.*, Speedometer etc in telemetry benchmarks are not useful since they are single-threaded.
I'd definitely use benchmarks that stimulate page load and scrolling for the reasons mentioned above. TTFMP and TTFI in system_health.common_desktop might be good choices.
Likewise the smoothness/silk benchmarks, provided they are still up and running.

Thanks for the benchmarks. Yeah, it looks better to collect some numbers before speculating a lot.



(*) V8 means Zone allocators and other V8 objects currently allocated on malloc. It does not mean V8 objects allocated on GC-managed heaps.


--
Kentaro Hara, Tokyo, Japan

--
You received this message because you are subscribed to the Google Groups "Project TRIM" group.
To unsubscribe from this group and stop receiving emails from it, send an email to project-trim...@chromium.org.

--
You received this message because you are subscribed to the Google Groups "platform-architecture-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to platform-architect...@chromium.org.
To post to this group, send email to platform-arc...@chromium.org.



--
Kentaro Hara, Tokyo, Japan

--
You received this message because you are subscribed to the Google Groups "Project TRIM" group.
To unsubscribe from this group and stop receiving emails from it, send an email to project-trim...@chromium.org.

To post to this group, send email to projec...@chromium.org.

Chris Palmer

unread,
Jan 12, 2017, 2:10:44 PM1/12/17
to Primiano Tucci, Kentaro Hara, Justin Schuh, hpayer, Michael Lippautz, platform-architecture-dev, Project TRIM
Re: large red zones for security: PA uses 4 guard pages per super page (4 * 4 KiB → 16 KiB). How wasteful that is depends on how large the super page is — how many slot spans the super page contains. See the comments in base/allocator/partition_allocator/partition_alloc.h. PA's design intends for the guard page cost to be amortized over time, if a given partition ends up having a lot of allocations in it. If some caller turns out to have many lightly-used partitions, then yes those guard pages could become relatively a lot of overhead. But I don't think we generally have that problem now, AFAIK.

Re: reducing contention by having 1 partition per thread: That might be a good idea, but it has the potential problems you discussed. My original idea was to reduce contention in the reverse manner: by having 1 lock per partition (instead of the global lock we have now). In essence, this:

struct BASE_EXPORT PartitionRootBase {
  ...
  static subtle::SpinLock gInitializedLock;
  static bool gInitialized;
  ...
};

becomes:

struct BASE_EXPORT PartitionRootBase {
  ...
  subtle::SpinLock initialized_lock_;
  bool initialized_;
  ...
};

which then becomes:

struct BASE_EXPORT PartitionRootBase {
  ...
  ABetterKindOfLock initialized_lock_;
  bool initialized_;
  ...
};

For ABetterKindOfLock, read "something that is a simple userspace atomic until the first contention, at which time it falls back to something kernel-mediated". I've been using "futex" to mean that so far, but whatever. We'll figure out something that works when we come to it. :)

I think this solution does not have problems with thread groups or thread-hopping.

Primiano Tucci

unread,
Jan 12, 2017, 2:25:27 PM1/12/17
to Chris Palmer, Kentaro Hara, Justin Schuh, hpayer, Michael Lippautz, platform-architecture-dev, Project TRIM
Hmm I am missing something. gInitializedLock you mentioned should not be a problem anyways as it is used only to initialize partitions, that is a quite rare event (or am I misreading here).
When we were talking about contention I was thinking about the per-partition lock in PartitionRootGeneric (here).
In other words, I am a bit confused when you say "having 1 lock per partition (instead of the global lock we have now)". I though that is already the current state (% futex vs spinlock discussion) when using the *Generic api.


Michael

unread,
Jan 12, 2017, 2:49:18 PM1/12/17
to platform-architecture-dev, pal...@chromium.org, har...@chromium.org, jsc...@chromium.org, hpa...@chromium.org, mlip...@chromium.org, projec...@chromium.org, prim...@chromium.org
(Mail bounced, so trying web.)

Thanks for getting the discussion started!

I realize that PA originates from a single-threaded world with potentially
uniform sizes and that's where it really shines! Having done some research
in that area myself, I just wonder if a general lock-based approach in the
fast-path is the way to go. There is already quite some code and a lot of
papers out there for multi-threaded allocation. At the top of my head I can
think of jemalloc, tcmalloc, llalloc, tbb (industry) and hoard, streamflow,
Michael's lockless allocator, supermalloc and scalloc (academia). All of
which have in common that they try to scale for multiple threads and don't
just get away with a single lock.

I guess only experiments can show how much contention we really have to
deal with. I just want to make sure that this is not taking lightly as it
is not only throughput that should be worried about, but also latency.
Dealing with those occasional hiccups can be really painful.

-Michael
To unsubscribe from this group and stop receiving emails from it, send an email to project-trim+unsubscribe@chromium.org.

--
You received this message because you are subscribed to the Google Groups "platform-architecture-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to platform-architecture-dev+unsub...@chromium.org.
To post to this group, send email to platform-architecture-dev@chromium.org.



--
Kentaro Hara, Tokyo, Japan

--
You received this message because you are subscribed to the Google Groups "Project TRIM" group.
To unsubscribe from this group and stop receiving emails from it, send an email to project-trim+unsubscribe@chromium.org.

To post to this group, send email to projec...@chromium.org.

--
You received this message because you are subscribed to the Google Groups "platform-architecture-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to platform-architecture-dev+unsub...@chromium.org.
To post to this group, send email to platform-architecture-dev@chromium.org.

--
You received this message because you are subscribed to the Google Groups "Project TRIM" group.
To unsubscribe from this group and stop receiving emails from it, send an email to project-trim+unsubscribe@chromium.org.

Michael Lippautz

unread,
Jan 12, 2017, 3:10:08 PM1/12/17
to Primiano Tucci, Chris Palmer, Kentaro Hara, Justin Schuh, hpayer, Michael Lippautz, platform-architecture-dev, Project TRIM
Thanks for getting the discussion started! 

I realize that PA originates from a single-threaded world with potentially uniform sizes and that's where it really shines! Having done some research in that area myself, I just wonder if a general lock-based approach in the fast-path is the way to go. There is already quite some code and a lot of papers out there for multi-threaded allocation. At the top of my head I can think of jemalloc, tcmalloc, llalloc, tbb (industry) and hoard, streamflow, Michael's lockless allocator, supermalloc and scalloc (academia). All of which have in common that they try to scale for multiple threads and don't just get away with a single lock.

I guess only experiments can show how much contention we really have to deal with. I just want to make sure that this is not taking lightly as it is not only throughput that should be worried about, but also latency. Dealing with those occasional hiccups can be really painful.

-Michael

Chris Palmer

unread,
Jan 12, 2017, 4:26:50 PM1/12/17
to Primiano Tucci, Kentaro Hara, Justin Schuh, hpayer, Michael Lippautz, platform-architecture-dev, Project TRIM
On Thu, Jan 12, 2017 at 11:25 AM Primiano Tucci <prim...@chromium.org> wrote:

Hmm I am missing something. gInitializedLock you mentioned should not be a problem anyways as it is used only to initialize partitions, that is a quite rare event (or am I misreading here).
When we were talking about contention I was thinking about the per-partition lock in PartitionRootGeneric (here).

Oh yeah, you're right there. Well, I guess what remains is getting a better lock than SpinLock, and measuring thread contention vs. partition overhead.

Kentaro Hara

unread,
Jan 12, 2017, 8:38:41 PM1/12/17
to Chris Palmer, Primiano Tucci, Justin Schuh, hpayer, Michael Lippautz, platform-architecture-dev, Project TRIM
Yeah, I agree with primiano's analysis -- one partition per thread would not be a good idea on a browser process where a task - thread mapping is dynamic.




--
You received this message because you are subscribed to the Google Groups "platform-architecture-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to platform-architecture-dev+unsub...@chromium.org.
To post to this group, send email to platform-architecture-dev@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/platform-architecture-dev/CAOuvq23j%2BuoNnugeQWxT19Vdc5w0V-V46nd_QEwgT9TO_PoRfw%40mail.gmail.com.

Hannes Payer

unread,
Jan 13, 2017, 3:36:02 AM1/13/17
to Kentaro Hara, Chris Palmer, Primiano Tucci, Justin Schuh, Michael Lippautz, platform-architecture-dev, Project TRIM
How about implementing a scalable mostly lock-free allocator (e.g. based on some principle ideas from jemalloc and others) for generic allocations? With that we would get all the profiling benefits, since it would be part of PA and it would provide fast and scalable allocations. Using locks or synchronization for every allocation will not scale. Moreover, we have to make sure that generic allocations do not result in high memory fragmentation; there was quite some research in that area how to reduce that.

On Fri, Jan 13, 2017 at 2:38 AM Kentaro Hara <har...@chromium.org> wrote:
Yeah, I agree with primiano's analysis -- one partition per thread would not be a good idea on a browser process where a task - thread mapping is dynamic.




On Fri, Jan 13, 2017 at 6:26 AM, Chris Palmer <pal...@chromium.org> wrote:
On Thu, Jan 12, 2017 at 11:25 AM Primiano Tucci <prim...@chromium.org> wrote:

Hmm I am missing something. gInitializedLock you mentioned should not be a problem anyways as it is used only to initialize partitions, that is a quite rare event (or am I misreading here).
When we were talking about contention I was thinking about the per-partition lock in PartitionRootGeneric (here).

Oh yeah, you're right there. Well, I guess what remains is getting a better lock than SpinLock, and measuring thread contention vs. partition overhead.

--
You received this message because you are subscribed to the Google Groups "platform-architecture-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to platform-architect...@chromium.org.
To post to this group, send email to platform-arc...@chromium.org.

Sigbjorn Finne

unread,
Feb 13, 2017, 5:14:16 AM2/13/17
to Hannes Payer, Kentaro Hara, Chris Palmer, Primiano Tucci, Justin Schuh, Michael Lippautz, platform-architecture-dev, Project TRIM
Den 1/13/2017 09:35, Hannes Payer skreiv:
> How about implementing a scalable mostly lock-free allocator (e.g. based on
> some principle ideas from jemalloc and others) for generic allocations?
> With that we would get all the profiling benefits, since it would be part
> of PA and it would provide fast and scalable allocations. Using locks or
> synchronization for every allocation will not scale. Moreover, we have to
> make sure that generic allocations do not result in high memory
> fragmentation; there was quite some research in that area how to reduce
> that.
>

[Reviving a thread from a couple of weeks back with a thought..]

Is the multi-threaded allocation workload that we're seeking to support
here described & captured somewhere at some level?

If having a fast multi-threaded FastMalloc-like partition is unavoidable
(i.e., subsystems & their threads do not naturally map to a segregated
partition a la LayoutObject, and thereby avoid contention and other
issues from cross-thread use), then I agree that the above suggestion is
what to push on.

A "fast threaded" partition where each thread is assigned superpage
extents to allocate from, with eager recycling of superpages back to a
shared set/pool once a thread frees its pages, would be something that
fits with standard multi-threaded allocator designs (incl TCMalloc) and
could perhaps work? The security properties of such a partition would
have to be considered.

But this is a non-trivial undertaking, so knowing more about target use,
could only help.. :)

--sigbjorn
Reply all
Reply to author
Forward
0 new messages