Thanks for taking a look at it!
> Thus, single-threaded execution would be penalized only by a
> negligible boolean
> guard check factor.
This is where you make a wrong assumption. Refcounting is both very
tight and traditionally incredibly simple. Adding a simple branch is
a substantial portion of why safethread is slower. It actually uses a
per-object field indicating if it should use direct refcounting - the
refcounting would actually be faster if it always did asynchronous
refcounting, but I often need the semantics and memory efficiency of
direct refcounting.
Additionally, some of the slowdown is due to disabling the various
specialized allocators (pymalloc, free lists, etc). These aren't
thread-safe, and they're bottlenecks anyway, so I simply have to do
without them.
However, even though I can't use a runtime option, I can use a
compile-time option. A system could have both python and python-mt
binaries installed, the user could pick which package to install, or a
distribution may pick automatically based on the number of CPUs/cores
in the system.
--
Adam Olsen, aka Rhamphoryncus
I'm essentially gambling on manycore - if a typical desktop has 100
cores, single-threaded is going to be silly.
> Treating single-threaded execution as a special case looks the only
> viable option for performance (as threading simply cannot be free or
> low-overhead), "practicality beats purity" after all.
Indeed, at least for a CPython based implementation. Alternative
implementations (ironpython, jython, pypy) have much greater
flexibility here.
> Although I think you have a vision that is not easily swayed (and that
> is good), I try to argue a bit below. Admittedly I should thoroughly
> examine the source before that -- which I haven't done.
I'm afraid most of this isn't explained in the source. I should write
up a design document, but in the mean time this email will have to do.
;)
> On Jun 16, 6:30 am, "Adam Olsen" <rha...@gmail.com> wrote:
>> On Sun, Jun 15, 2008 at 8:16 PM, mrts <m...@mrts.pri.ee> wrote:
>> > Thus, single-threaded execution would be penalized only by a
>> > negligible boolean
>> > guard check factor.
>>
>> This is where you make a wrong assumption. Refcounting is both very
>> tight and traditionally incredibly simple. Adding a simple branch is
>> a substantial portion of why safethread is slower. It actually uses a
>> per-object field indicating if it should use direct refcounting - the
>> refcounting would actually be faster if it always did asynchronous
>> refcounting, but I often need the semantics and memory efficiency of
>> direct refcounting.
>
> Perhaps the global would help to both gain a little speed (as per-
> object fields result in cache misses etc compared to a single
> variable) and simplicity:
>
> if MULTIPLE_THREADS_RUNNING:
> always use asynchronous refcounting
> else:
> always use direct refcounting
Consider code like this:
for i in xrange(1000):
pass
In a language like C or Java, a primitive would be used for "i". No
allocation, just a register updated. In contrast, CPython would
allocate a thousand int objects, each of which needs garbage
collection. With direct refcounting that'll happen in the loop, as
soon the object is unreachable. With asynchronous refcounting the
tracing GC needs to kick in, scan them, and find they're unreachable.
They extra delay can increase memory usage and lower cache hit ratios,
and ends up making a big difference.
> As an aside -- I'd expect that the removal of the GIL code would
> improve performance overall on similar scale that is lost to the
> MULTIPLE_THREADS_RUNNING check.
The GIL actually has almost no effect on single-threaded code.
sys.getcheckinterval() is used to make the lock operations fairly
coarse.
>> Additionally, some of the slowdown is due to disabling the various
>> specialized allocators (pymalloc, free lists, etc). These aren't
>> thread-safe, and they're bottlenecks anyway, so I simply have to do
>> without them.
>
> Again, in pseudocode:
>
> // global default
> allocator_function_ptr = pymalloc
>
> function init_threading():
> MULTIPLE_THREADS_RUNNING = true
> allocator_function_ptr = threadsafe_allocator
>
> I.e. you are free to use a highly-optimized single-threaded allocator
> when no threads are running and switch to the threadsafe one at
> threading start.
Since objects from the unsafe allocator will transition to the safe
allocator's free function, the simplest way to implement that would be
just adding some locking around the allocator. Unfortunately, that's
a bottleneck for multithreaded code, so it won't scale up to multiple
cores.
However, it should be possible to design an allocator that's effective
at both single-threaded and multi-threaded work. Not easy of course,
and it requires a great deal of research and experimentation, which is
why it's not done yet.
> I don't get the idea behind "they're bottlenecks anyway" -- as far as
> I can understand they are just the opposite, i.e. improve performance
> a lot.
A lock must be placed around them. Only one thread can be within the
locked region at a time, and so long as it's used often enough, this
means you spend a significant amount of time waiting.
Also note that although linux's futexes (which glibc's pthreads uses
to implement mutexes) are quite effcient when uncontended, they become
substantially more expensive when contended.
Finally, the cache performance is poor. Rather than using some "hot"
memory in your own cache, you're grabbing memory out of another CPUs
cache, which is much slower. You'd be better off using a generic
allocator which did use memory from your own cache.
>> However, even though I can't use a runtime option, I can use a
>> compile-time option. A system could have both python and python-mt
>> binaries installed, the user could pick which package to install, or a
>> distribution may pick automatically based on the number of CPUs/cores
>> in the system.
>
> How do you solve the standard library problem? As far as I can see, C
> extensions need to be mt-aware. Maintaining ordinary and mt-libs in
> parallel is assured to be extremely painful. And Python without
> batteries included doesn't sound attractive. Generally, I'd try to
> avoid the compile-time dichotomy in preference to run-time.
Basically, I haven't decided yet. :) It might be best to make
extensions use runtime checking while the core uses compile-time.
Then again, if there's two separate sets of packages, it's not *that*
painful to install the right one automatically (if the package manager
is sane *cough*).
It's not that I don't prefer runtime, it's that it usually just
doesn't work (too slow.)
> I won't push this idea further, it just looks like the right way to
> cut the Gordian knot of single-threaded performance -- and I don't
> mean just "using a global", but the general idea of having two
> differently optimized code paths for single-threaded and multi-
> threaded cases.
>
> The approach may be wrong, but it's good to get the ball rolling.
Keep the ideas coming. I learned most of this stuff the hard way, so
the least I can do is share that knowledge.
In the context of threading, that's what a bottleneck is. One thread
is okay, that's within capacity, but any more and you get a traffic
jam.
>> Keep the ideas coming. I learned most of this stuff the hard way, so
>> the least I can do is share that knowledge.
>
> Let that idea settle a bit. Looks that we have a general agreement:
> * single-threaded performance is extremely important for the general
> usefulness of this project as the majority of Python code is single-
> threaded
> * unless a low-overhead threading solution emerges, separately
> optimized code paths for single- and multi-thread execution may be
> needed for achieving that performance
> * if separation is needed, runtime would be preferable to compile-
> time
> * recent research and solutions for low-overhead threading should be
> reviewed (a nice project for a CS term paper) and relevant new idioms
> put to use
The real problem here is CPython. A completely new implementation
(Jython, IronPython, PyPy) can use well known GC techniques that work
reasonably well for threads.
My current solution (despite the performance loss) is already worth a paper.
Python doesn't have variables in the traditional sense. It has
objects and names bound to objects. Your approach would have to be
tweaked to make the objects themselves shared (and the reference count
is stored in the object.)
Since any code may be passed any object, even a shared one, all code
that manipulates reference counts must be prepared to handle it.
Thus, you'd always have a branch to switch to an alternate GC mode -
this describes exactly what safethread does today.
Possibly the most common shared objects are the modules, classes, and
functions (and their corresponding dicts). In another language you
could make them static (and not objects at all), but not in python.
Even if you did, some of the most useful objects to be shared are
basics like int, str, or tuple. The only way around that is to copy
everything passed between threads (or monitors/actors technically).
This is what erlang does, but this limits you to use cases that don't
require intense sharing.
Keep trying. :)