Backstabbing single-threaded throughput penalty

mrts

unread,

Jun 15, 2008, 10:16:24 PM6/15/08

to python-safethread

python-safethread is an excellent project. The only major barrier to
general
acceptance is the single-threaded throughput, quoting the Google
project
frontpage:

"The GIL is removed. Each additional thread should run at or near 100%
throughput. However, the base (single-threaded) throughput is only
around
60-65% that of normal CPython, so you'll need several threads for this
to be
worthwhile."

The 40% throughput penalty is hard to accept. So I started thinking of
a
solution.

Let's first define the problem: find the simplest mechanism possible
that
allows single-threaded Python to run at near current CPython speed,
yet allow
multi-threaded Python to run with arbitrarily complex synchronization
machinery.

The most evident solution is a global state switch, say
MULTIPLE_THREADS_RUNNING that would turn off all thread
synchronization code if
false:

>> foo["bar"] = "baz"
-> will result in the following pseudocode
:: if MULTIPLE_THREADS_RUNNING
:: do a synchronized write of "baz" to slot "bar" in foo
:: else
:: write "baz" to slot "bar" in foo

Thus, single-threaded execution would be penalized only by a
negligible boolean
guard check factor.

Implementing the state switch is trivial as each Python program is
switched to
multi-threaded mode explicitly by either creating a threading.Thread()
object
or calling thread.start_new_thread(). The switch would be off by
default, and only
these calls would flip the switch on

What do you think, would this approach help to alleviate the single-
thread throughput
penalty?

Adam Olsen

unread,

Jun 15, 2008, 11:30:29 PM6/15/08

to python-s...@googlegroups.com

On Sun, Jun 15, 2008 at 8:16 PM, mrts <mr...@mrts.pri.ee> wrote:
>
> python-safethread is an excellent project. The only major barrier to
> general
> acceptance is the single-threaded throughput, quoting the Google
> project
> frontpage:

Thanks for taking a look at it!

> Thus, single-threaded execution would be penalized only by a
> negligible boolean
> guard check factor.

This is where you make a wrong assumption. Refcounting is both very
tight and traditionally incredibly simple. Adding a simple branch is
a substantial portion of why safethread is slower. It actually uses a
per-object field indicating if it should use direct refcounting - the
refcounting would actually be faster if it always did asynchronous
refcounting, but I often need the semantics and memory efficiency of
direct refcounting.

Additionally, some of the slowdown is due to disabling the various
specialized allocators (pymalloc, free lists, etc). These aren't
thread-safe, and they're bottlenecks anyway, so I simply have to do
without them.

However, even though I can't use a runtime option, I can use a
compile-time option. A system could have both python and python-mt
binaries installed, the user could pick which package to install, or a
distribution may pick automatically based on the number of CPUs/cores
in the system.

--
Adam Olsen, aka Rhamphoryncus

mrts

unread,

Jun 16, 2008, 6:55:12 AM6/16/08

to python-safethread

I understand that dealing with single-threaded execution as a special
case is messy. On the other hand, failing to bring single-threaded
performance near to current CPython speed may push the project to a
little-used niche instead of becoming integrated to the main
interpreter.

Treating single-threaded execution as a special case looks the only
viable option for performance (as threading simply cannot be free or
low-overhead), "practicality beats purity" after all.

Although I think you have a vision that is not easily swayed (and that
is good), I try to argue a bit below. Admittedly I should thoroughly
examine the source before that -- which I haven't done.

On Jun 16, 6:30 am, "Adam Olsen" <rha...@gmail.com> wrote:

> On Sun, Jun 15, 2008 at 8:16 PM, mrts <m...@mrts.pri.ee> wrote:
> > Thus, single-threaded execution would be penalized only by a
> > negligible boolean
> > guard check factor.
>
> This is where you make a wrong assumption. Refcounting is both very
> tight and traditionally incredibly simple. Adding a simple branch is
> a substantial portion of why safethread is slower. It actually uses a
> per-object field indicating if it should use direct refcounting - the
> refcounting would actually be faster if it always did asynchronous
> refcounting, but I often need the semantics and memory efficiency of
> direct refcounting.

Perhaps the global would help to both gain a little speed (as per-
object fields result in cache misses etc compared to a single
variable) and simplicity:

if MULTIPLE_THREADS_RUNNING:
always use asynchronous refcounting
else:
always use direct refcounting

As an aside -- I'd expect that the removal of the GIL code would
improve performance overall on similar scale that is lost to the
MULTIPLE_THREADS_RUNNING check.

> Additionally, some of the slowdown is due to disabling the various
> specialized allocators (pymalloc, free lists, etc). These aren't
> thread-safe, and they're bottlenecks anyway, so I simply have to do
> without them.

Again, in pseudocode:

// global default
allocator_function_ptr = pymalloc

function init_threading():
MULTIPLE_THREADS_RUNNING = true
allocator_function_ptr = threadsafe_allocator

I.e. you are free to use a highly-optimized single-threaded allocator
when no threads are running and switch to the threadsafe one at
threading start.

I don't get the idea behind "they're bottlenecks anyway" -- as far as
I can understand they are just the opposite, i.e. improve performance
a lot.

> However, even though I can't use a runtime option, I can use a
> compile-time option. A system could have both python and python-mt
> binaries installed, the user could pick which package to install, or a
> distribution may pick automatically based on the number of CPUs/cores
> in the system.

How do you solve the standard library problem? As far as I can see, C
extensions need to be mt-aware. Maintaining ordinary and mt-libs in
parallel is assured to be extremely painful. And Python without
batteries included doesn't sound attractive. Generally, I'd try to
avoid the compile-time dichotomy in preference to run-time.

I won't push this idea further, it just looks like the right way to
cut the Gordian knot of single-threaded performance -- and I don't
mean just "using a global", but the general idea of having two
differently optimized code paths for single-threaded and multi-
threaded cases.

The approach may be wrong, but it's good to get the ball rolling.

Adam Olsen

unread,

Jun 16, 2008, 1:44:59 PM6/16/08

to python-s...@googlegroups.com

On Mon, Jun 16, 2008 at 4:55 AM, mrts <mr...@mrts.pri.ee> wrote:
>
> I understand that dealing with single-threaded execution as a special
> case is messy. On the other hand, failing to bring single-threaded
> performance near to current CPython speed may push the project to a
> little-used niche instead of becoming integrated to the main
> interpreter.

I'm essentially gambling on manycore - if a typical desktop has 100
cores, single-threaded is going to be silly.

> Treating single-threaded execution as a special case looks the only
> viable option for performance (as threading simply cannot be free or
> low-overhead), "practicality beats purity" after all.

Indeed, at least for a CPython based implementation. Alternative
implementations (ironpython, jython, pypy) have much greater
flexibility here.

> Although I think you have a vision that is not easily swayed (and that
> is good), I try to argue a bit below. Admittedly I should thoroughly
> examine the source before that -- which I haven't done.

I'm afraid most of this isn't explained in the source. I should write
up a design document, but in the mean time this email will have to do.
;)

> On Jun 16, 6:30 am, "Adam Olsen" <rha...@gmail.com> wrote:
>> On Sun, Jun 15, 2008 at 8:16 PM, mrts <m...@mrts.pri.ee> wrote:
>> > Thus, single-threaded execution would be penalized only by a
>> > negligible boolean
>> > guard check factor.
>>
>> This is where you make a wrong assumption. Refcounting is both very
>> tight and traditionally incredibly simple. Adding a simple branch is
>> a substantial portion of why safethread is slower. It actually uses a
>> per-object field indicating if it should use direct refcounting - the
>> refcounting would actually be faster if it always did asynchronous
>> refcounting, but I often need the semantics and memory efficiency of
>> direct refcounting.
>
> Perhaps the global would help to both gain a little speed (as per-
> object fields result in cache misses etc compared to a single
> variable) and simplicity:
>
> if MULTIPLE_THREADS_RUNNING:
> always use asynchronous refcounting
> else:
> always use direct refcounting

Consider code like this:

for i in xrange(1000):
pass

In a language like C or Java, a primitive would be used for "i". No
allocation, just a register updated. In contrast, CPython would
allocate a thousand int objects, each of which needs garbage
collection. With direct refcounting that'll happen in the loop, as
soon the object is unreachable. With asynchronous refcounting the
tracing GC needs to kick in, scan them, and find they're unreachable.

They extra delay can increase memory usage and lower cache hit ratios,
and ends up making a big difference.

> As an aside -- I'd expect that the removal of the GIL code would
> improve performance overall on similar scale that is lost to the
> MULTIPLE_THREADS_RUNNING check.

The GIL actually has almost no effect on single-threaded code.
sys.getcheckinterval() is used to make the lock operations fairly
coarse.

>> Additionally, some of the slowdown is due to disabling the various
>> specialized allocators (pymalloc, free lists, etc). These aren't
>> thread-safe, and they're bottlenecks anyway, so I simply have to do
>> without them.
>
> Again, in pseudocode:
>
> // global default
> allocator_function_ptr = pymalloc
>
> function init_threading():
> MULTIPLE_THREADS_RUNNING = true
> allocator_function_ptr = threadsafe_allocator
>
> I.e. you are free to use a highly-optimized single-threaded allocator
> when no threads are running and switch to the threadsafe one at
> threading start.

Since objects from the unsafe allocator will transition to the safe
allocator's free function, the simplest way to implement that would be
just adding some locking around the allocator. Unfortunately, that's
a bottleneck for multithreaded code, so it won't scale up to multiple
cores.

However, it should be possible to design an allocator that's effective
at both single-threaded and multi-threaded work. Not easy of course,
and it requires a great deal of research and experimentation, which is
why it's not done yet.

> I don't get the idea behind "they're bottlenecks anyway" -- as far as
> I can understand they are just the opposite, i.e. improve performance
> a lot.

A lock must be placed around them. Only one thread can be within the
locked region at a time, and so long as it's used often enough, this
means you spend a significant amount of time waiting.

Also note that although linux's futexes (which glibc's pthreads uses
to implement mutexes) are quite effcient when uncontended, they become
substantially more expensive when contended.

Finally, the cache performance is poor. Rather than using some "hot"
memory in your own cache, you're grabbing memory out of another CPUs
cache, which is much slower. You'd be better off using a generic
allocator which did use memory from your own cache.

>> However, even though I can't use a runtime option, I can use a
>> compile-time option. A system could have both python and python-mt
>> binaries installed, the user could pick which package to install, or a
>> distribution may pick automatically based on the number of CPUs/cores
>> in the system.
>
> How do you solve the standard library problem? As far as I can see, C
> extensions need to be mt-aware. Maintaining ordinary and mt-libs in
> parallel is assured to be extremely painful. And Python without
> batteries included doesn't sound attractive. Generally, I'd try to
> avoid the compile-time dichotomy in preference to run-time.

Basically, I haven't decided yet. :) It might be best to make
extensions use runtime checking while the core uses compile-time.
Then again, if there's two separate sets of packages, it's not *that*
painful to install the right one automatically (if the package manager
is sane *cough*).

It's not that I don't prefer runtime, it's that it usually just
doesn't work (too slow.)

> I won't push this idea further, it just looks like the right way to
> cut the Gordian knot of single-threaded performance -- and I don't
> mean just "using a global", but the general idea of having two
> differently optimized code paths for single-threaded and multi-
> threaded cases.
>
> The approach may be wrong, but it's good to get the ball rolling.

Keep the ideas coming. I learned most of this stuff the hard way, so
the least I can do is share that knowledge.

mrts

unread,

Jun 18, 2008, 1:04:53 PM6/18/08

to python-safethread

On Jun 16, 8:44 pm, "Adam Olsen" <rha...@gmail.com> wrote:

> On Mon, Jun 16, 2008 at 4:55 AM, mrts <m...@mrts.pri.ee> wrote:
> > On Jun 16, 6:30 am, "Adam Olsen" <rha...@gmail.com> wrote:
> >> On Sun, Jun 15, 2008 at 8:16 PM, mrts <m...@mrts.pri.ee> wrote:
> > I don't get the idea behind "they're bottlenecks anyway" -- as far as
> > I can understand they are just the opposite, i.e. improve performance
> > a lot.
>
> A lock must be placed around them. Only one thread can be within the
> locked region at a time, and so long as it's used often enough, this
> means you spend a significant amount of time waiting.

That's what I meant really -- they are not bottlenecks *anyway*, only
in multi-threaded case. In single-threaded case they are a performance
boost.

> Keep the ideas coming. I learned most of this stuff the hard way, so
> the least I can do is share that knowledge.

Let that idea settle a bit. Looks that we have a general agreement:
* single-threaded performance is extremely important for the general
usefulness of this project as the majority of Python code is single-
threaded
* unless a low-overhead threading solution emerges, separately
optimized code paths for single- and multi-thread execution may be
needed for achieving that performance
* if separation is needed, runtime would be preferable to compile-
time
* recent research and solutions for low-overhead threading should be
reviewed (a nice project for a CS term paper) and relevant new idioms
put to use

Adam Olsen

unread,

Jun 18, 2008, 5:22:21 PM6/18/08

to python-s...@googlegroups.com

On Wed, Jun 18, 2008 at 11:04 AM, mrts <mr...@mrts.pri.ee> wrote:
> On Jun 16, 8:44 pm, "Adam Olsen" <rha...@gmail.com> wrote:
>> On Mon, Jun 16, 2008 at 4:55 AM, mrts <m...@mrts.pri.ee> wrote:
>> > On Jun 16, 6:30 am, "Adam Olsen" <rha...@gmail.com> wrote:
>> >> On Sun, Jun 15, 2008 at 8:16 PM, mrts <m...@mrts.pri.ee> wrote:
>> > I don't get the idea behind "they're bottlenecks anyway" -- as far as
>> > I can understand they are just the opposite, i.e. improve performance
>> > a lot.
>>
>> A lock must be placed around them. Only one thread can be within the
>> locked region at a time, and so long as it's used often enough, this
>> means you spend a significant amount of time waiting.
>
> That's what I meant really -- they are not bottlenecks *anyway*, only
> in multi-threaded case. In single-threaded case they are a performance
> boost.

In the context of threading, that's what a bottleneck is. One thread
is okay, that's within capacity, but any more and you get a traffic
jam.

>> Keep the ideas coming. I learned most of this stuff the hard way, so
>> the least I can do is share that knowledge.
>
> Let that idea settle a bit. Looks that we have a general agreement:
> * single-threaded performance is extremely important for the general
> usefulness of this project as the majority of Python code is single-
> threaded
> * unless a low-overhead threading solution emerges, separately
> optimized code paths for single- and multi-thread execution may be
> needed for achieving that performance
> * if separation is needed, runtime would be preferable to compile-
> time
> * recent research and solutions for low-overhead threading should be
> reviewed (a nice project for a CS term paper) and relevant new idioms
> put to use

The real problem here is CPython. A completely new implementation
(Jython, IronPython, PyPy) can use well known GC techniques that work
reasonably well for threads.

My current solution (despite the performance loss) is already worth a paper.

mrts

unread,

Jun 21, 2008, 1:01:53 PM6/21/08

to python-safethread

I pondered a bit and as far as I can see, pushing the problem into
language level would make things a bit simpler, bypassing most of the
problems you posed above. See below and feel free to prove me wrong.

Existing single-threaded code would run unmodified and, as no
threading machinery is required for it, probably at near current
CPython speed.

Rationale
---------

Guiding principle: explicit is better than implicit.

By default, nothing is shared. Shared variables have to be
explicitly marked with the ``shared`` keyword. These
variables are in a special scope that is only accessible
with the ``with shared`` statement. This facilitates
implementation and forces programmers to explicitly think
about threading problems when they need it -- or ignore it
entirely if they don't. Threading is not hidden into ordinary
code flow as it is at present.

Nothing changes in the handling of ordinary variables. All
the existing optimizations (memory allocation etc) remain
valid. In contrast, shared variables are allocated from a
different memory pool and they have all the required
machinery attached for synchronizing access to them (i.e.
"monitor" fields for locking etc).

Syntax examples
---------------

1. Globals

>>> foo = 1 # each thread gets a thread-local copy
>>> shared bar = 2 # shared between all threads
>>> print bar
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
NameError: name 'bar' is not defined
>>> with shared bar:
>>> print bar
2
>>> bar = 3
>>> def access_bar():
>>> with shared bar:
>>> baz = bar

2. Shared variables passed as parameters

from threading import Thread

def run_multiple_threads():

class FooThread(Thread):
def __init__(self, shared foo):
shared self.foo = foo
def run(self):
self.foo++

shared foo = 1
bar = 2

t1 = FooThread(foo)
t1.start()
t2 = FooThread(bar) # compile-time error: bar not shared

Adam Olsen

unread,

Jun 22, 2008, 2:40:37 AM6/22/08

to python-s...@googlegroups.com

On Sat, Jun 21, 2008 at 11:01 AM, mrts <mr...@mrts.pri.ee> wrote:
>
> I pondered a bit and as far as I can see, pushing the problem into
> language level would make things a bit simpler, bypassing most of the
> problems you posed above. See below and feel free to prove me wrong.
>
> Existing single-threaded code would run unmodified and, as no
> threading machinery is required for it, probably at near current
> CPython speed.
>
> Rationale
> ---------
>
> Guiding principle: explicit is better than implicit.
>
> By default, nothing is shared. Shared variables have to be
> explicitly marked with the ``shared`` keyword. These
> variables are in a special scope that is only accessible
> with the ``with shared`` statement. This facilitates
> implementation and forces programmers to explicitly think
> about threading problems when they need it -- or ignore it
> entirely if they don't. Threading is not hidden into ordinary
> code flow as it is at present.
>
> Nothing changes in the handling of ordinary variables. All
> the existing optimizations (memory allocation etc) remain
> valid. In contrast, shared variables are allocated from a
> different memory pool and they have all the required
> machinery attached for synchronizing access to them (i.e.
> "monitor" fields for locking etc).

Python doesn't have variables in the traditional sense. It has
objects and names bound to objects. Your approach would have to be
tweaked to make the objects themselves shared (and the reference count
is stored in the object.)

Since any code may be passed any object, even a shared one, all code
that manipulates reference counts must be prepared to handle it.
Thus, you'd always have a branch to switch to an alternate GC mode -
this describes exactly what safethread does today.

Possibly the most common shared objects are the modules, classes, and
functions (and their corresponding dicts). In another language you
could make them static (and not objects at all), but not in python.

Even if you did, some of the most useful objects to be shared are
basics like int, str, or tuple. The only way around that is to copy
everything passed between threads (or monitors/actors technically).
This is what erlang does, but this limits you to use cases that don't
require intense sharing.

Keep trying. :)

mrts

unread,

Jun 22, 2008, 7:18:03 AM6/22/08

to python-safethread

On Jun 22, 9:40 am, "Adam Olsen" <rha...@gmail.com> wrote:
>
> Python doesn't have variables in the traditional sense. It has
> objects and names bound to objects. Your approach would have to be
> tweaked to make the objects themselves shared (and the reference count
> is stored in the object.)

Not *all* objects, only those marked with `shared`.

> Since any code may be passed any object, even a shared one, all code
> that manipulates reference counts must be prepared to handle it.

I handled one direction -- passing non-shared objects to functions
that expect shared ones (not allowed, fails at compile time as you can
see from the example), but failed to notice the other direction --
passing shared objects to functions that expect non-shared ones.

Admittedly, that *is* problematic. The obvious solution -- don't allow
this and fail at compile time -- is quite restrictive. I'll outline
the approach nevertheless:

>>> def expects_shared_foo(foo):
... print 'ok'
...
>>> def expects_nonshared_foo(foo):
... print 'ok'
...
>>> foo = 1
>>> expects_shared_foo(foo)

Traceback (most recent call last):
File "<stdin>", line 1, in <module>

SharingError: object 'foo' is not shared, but this function only
accepts shared params
>>> expects_shared_foo(foo.shared_copy()) # creates a shared copy
ok
>>> shared bar = 2
>>> expects_nonshared_foo(bar)

Traceback (most recent call last):
File "<stdin>", line 1, in <module>

SharingError: object 'foo' is shared, but this function only accepts
non-shared params
>>> expects_nonshared_foo(bar.nonshared_copy()) # creates a non-shared copy
ok

This renders most of the standard library unusable with shared objects
by default, only accessible by copying them to non-shared objects.
Doh.

Using overloaded functions that accept either shared or non-shared
objects is crazy. Using run-time detection brings us back where we
started. No easy way out.

> Possibly the most common shared objects are the modules, classes, and
> functions (and their corresponding dicts). In another language you
> could make them static (and not objects at all), but not in python.

Right, another good point I failed to think of.

> Even if you did, some of the most useful objects to be shared are
> basics like int, str, or tuple. The only way around that is to copy
> everything passed between threads (or monitors/actors technically).
> This is what erlang does, but this limits you to use cases that don't
> require intense sharing.

The point of the `shared` keyword is exactly the opposite: be explicit
what can be efficiently shared. Copying is only required if the object
is not declared `shared`. But as outlined above, there really is no
easy way out.

In conclusion, `shared` doesn't really help. Case closed, nothing to
see here, move on boys and girls :)

Reply all

Reply to author

Forward