I've had similar thoughts, though I don't have a patch.
Another idea I have was pointer compression: if you're moving the
refcounts out of the objects into some kind of lookaside, how about
moving the object pointers themselves into the lookaside, and store
smaller object IDs rather the PyObject* throughout the objects (the
benefit being to avoid doubling heap usage on 64bit archs, due to the
sizeof(PyObject*) doubling. See
http://dmalcolm.livejournal.com/4183.html?thread=11863#t11863 for some
ideas on how this might look.
(please forgive the shameless plug of my blog post)
2009/12/15 Claudio Freire <klauss...@gmail.com>:
Another idea I have was pointer compression: if you're moving the
refcounts out of the objects into some kind of lookaside, how about
moving the object pointers themselves into the lookaside, and store
smaller object IDs rather the PyObject* throughout the objects (the
benefit being to avoid doubling heap usage on 64bit archs, due to the
sizeof(PyObject*) doubling. See
http://dmalcolm.livejournal.com/4183.html?thread=11863#t11863 for some
ideas on how this might look.
To clarify, cache contention prevents linear speedup, ie getting the
same amount of throughput on each additional thread/core.
I've been thinking about doing a separate refcount array, and this
thread finally motivated me to do it. First though, I did a little
poking with my existing hash table, and I got closer to linear results
than I remember. Maybe I never tested the options as thoroughly as I
should have when I switched to quad core, or maybe something else
changed (like the kernel or glibc). Dunno. Anyway, doing a separate
refcount array added about 13% throughput over what I was getting,
which is decent enough.
Python2.6: 57000 pystones
Python3.0: 55000 pystones
safethread normal pystone: 29000
safethread threaded pystone 1: 28000 # Test is in a child thread
safethread threaded pystone 2: 55000 # Test is in 2 child threads
safethread threaded pystone 3: 80000 # etc
safethread threaded pystone 4: 102000
Heh, more like you'd expect than a few days ago (3rd thread was doing
better than 2nd!). Might be worth it on 16 cores with a lot of work
in C libraries (fixed overhead is better than a bottlneck.)
However, without a clear transition plan towards a more flexible VM I
don't think it's worth patching CPython like this. Jython, for
example, would be a lot better, and I doubt it'd be that hard to build
a partial CPython API bridge...
So it looks like (and correct me if I'm wrong). Using a separate
refcount array and safethread (which implements the Recycler
algorithm) we have linear speedup with additional threads, but about
50% the speed per thread. Are there any obvious points for
performance improvements within the patch itself (obviously the rest
of the VM has plenty of them ;))? Also, is this patch available
anywhere for others to look at?
Alex
--
"I disapprove of what you say, but I will defend to the death your
right to say it." -- Voltaire
"The people's good is the highest law." -- Cicero
"Code can always be simpler than you think, but never as simple as you
want" -- Me
Yup, except it's closer to 45%, and that's with a separate array
rather than a hash table. A hash table is closer to 40%.
> Are there any obvious points for
> performance improvements within the patch itself (obviously the rest
> of the VM has plenty of them ;))? Also, is this patch available
> anywhere for others to look at?
I don't believe there's much more to be eeked out. Refcounting is
simply a lot of work done far too often.
It should available via https://launchpad.net/python-safethread . I
haven't yet pushed an update with separate refcounting array. Note
that it's not a simple patch, but rather a huge fork, as the bigger
goal was language extensions so that threading was safe.
--
Adam Olsen, aka Rhamphoryncus
On Dec 16, 2:03 am, Claudio Freire <klaussfre...@gmail.com> wrote:
> Recently on the list someone touched the subject of sharing structured
> readonly data with multiprocessing - ie, not using multiprocessing's shared
> memory model, but rather simply accessing shared read-only structures from
> subprocesses.
[...]
yep, I've touched the subject and am confirming that ugly packing of
the complex stuff into array.array isn't that pleasant at all, albeit
helping.
Your patch aiming a great goal. I do not understand though how can't
you see the impact of your great patch as for multiprocessing...
*** in CPython: any top-level parent process that allocates objects
taking >50% RAM means NO-GO for child processes against these objects
***
Here is the test for the manual benchmarking again:
#################################
import time
from multiprocessing import Pool
MAX_CHILDREN = 2
def f(sec_delay):
'''
1. children are started altogether.
2. However they start their read-access with a 1 sec delay one
after another.
3. Let's ensure that a time point exists when all children keeping
the refcounterS high.
'''
time.sleep(sec_delay)
res = sum(huge_data)
time.sleep(MAX_CHILDREN-sec_delay+2)
res += sum(huge_data) # read-access again
return res
if __name__ == '__main__':
huge_data = [1./i for i in xrange(1,40000000)]# my "huge" read-
only data, about 1 Gb
p = Pool(MAX_CHILDREN)
res= list(p.map(f, xrange(1, MAX_CHILDREN+1)))
print res
#################################
Such test will eat out (1+MAX_CHILDREN) Gbytes of RAM on CPython. How
large is your RAM?
Of course, such a test will freeze your system during the execution of
"control implementation" (CPython), but not your great patched one.
You got a huge RAM on your box? feel free to increase MAX_CHILDREN and/
or size of huge_data :)
regards,
Valery
A 15% or so performance hit would be a lot, but would be worth it in
cases where there just isn't enough memory to duplicate the data
structures. Better still, of course, if u-s could use its magic to
greatly reduce that overhead. Or if this mode of operation could be
easily turned off and on (for instance with a command-line option when
python is invoked).
I understand that you probably couldn't incorporate this into mainline
u-s if you can't substantially reduce the performance hit. But maybe
it could at least be a standard compile-time option.
I agree that getting rid of the GIL would be great, but as a practical
matter, and from the point of view of the kind of code I write, this
patch would make it so that I wouldn't care much anymore about the
GIL. The multiprocessing module lets me do what I need, and using
separate processes is arguably less error-prone than using threads.
The only real problem I have with using the multiprocessing module is
very large, read-only structures that I would like to share between
the processes.
Anyway, those are my personal $.02 on this issue!
Gary
You can't eliminate the performance hit (or reduce it enough to be
acceptable, IMO) without resorting to a compile time option. However,
that doesn't need to go upstream. Just use the patch, report to its
author on how well it works, etc. If it becomes mature and develops a
significant user base then you can look at making an upstream compile-
time option.
Right, it would be crazy to make the refcount behavior a command line
option. Then every inc/decref would need to do a global variable
load, which would kill performance.
However, I think it would really be worthwhile to take a look at the
safethread stuff and try to shave down the work done for every incref
and decref. I think there could be more savings there.
Right now every refcount operation does two memory operations: loading
the refcount offset and storing the modified value.
If we use the recycler/safethread approach with separate refcount
data, that work is offloaded to the collector thread. It isn't the
same as free, but it's almost as good these days.
The main threads should be doing about three memory ops: loading the
queue head, storing the pointer to be refcounted, and storing the
incremented head. That sounds like more work, but those two memory
areas should be hot in the cache from other refcount operations.
Maybe this is how safethread works already, I haven't dove into it
yet.
Using the safethread approach with separate refcounts would be good
for both GIL removal and multiprocessing performance, so I think
someone should take a serious look at it. To start, I'd look at
safethread and try to shave the Py_*REF macros down to as few
operations as possible.
One trick would be to find a way to access thread local values for the
queue head as efficiently as possible. I don't think the naive
approach of putting them on the current PyThreadState would be fast
enough, but it might work if you have a good enough alias analysis so
the tstate is loaded once per C function call. In the JITed code, for
example, we could just use the tstate_ local variable.
Reid
The current safethread approach is actually a fair bit more
complicated than that. Each object also has an "owner" field, which
can either be unowned, a specific thread, or asynchronous (shared).
If owned by a specific thread I use normal refcounting directly in the
object, while asynchronous uses a per-thread hash table (or per-thread
array). It's more costly, but it means under many conditions I never
have to stop all the threads to run the tracing GC.
Hrm, I'm wondering if an object is shared/ownted by another thread
what are the chances you get a mis-predicted branch and blow your
instruction cache? It seems to me things like that are likely to
cause a performance penalty, and it's clear that the speed of the
INCREF and DECREF operations are paramount. Therefore I'm thinking it
would be a prudent experiment to try the recycler approach Reid
suggests, since it's INCREF and DECREF are less expensive.
On Dec 29, 5:48 pm, Reid Kleckner <r...@mit.edu> wrote:
> Right, it would be crazy to make the refcount behavior a command line
> option. Then every inc/decref would need to do a global variable
> load, which would kill performance.
I don't believe that Gary meant some kind of global var to be checked
during the run-time. Rather he meant a flag that you could pass to the
"configure" script, a flag that will influence the macros and/or
structure definitions during compilation time of the u-s compiler
itself. This way, nothing to check in run-time.
P.S. I would be a happy user, who doesn't care about 15% performance
loss. Such a patch is a great *enabler*.
Regards
Valery
I'm not sure what you're saying there when you mention the "configure
script," but if you mean a setting that could be given on the command
line when invoking u-s so that its JIT uses the separate refcount
array approach, then that's the kind of thing I was hoping might be
possible.
I think this would need to invoke "-j always" as well, so that
everything was done by the JIT compiler right from the start. Not sure
if this idea makes any practical sense, but if it happens that it
does, and was implemented in u-s, that would work very well for my
needs.
The reason it would be fine for me for this to require "-j always" is
the obvious one: if I have a job with a huge amount of memory shared
between multiple processes, clearly the compile time is not a big
deal.
On 31 Dez. 2009, 21:57, garyrob <garyr...@gmail.com> wrote:
> I'm not sure what you're saying there when you mention the "configure
> script," but if you mean a setting that could be given on the command
> line when invoking u-s so that its JIT uses the separate refcount
> array approach, then that's the kind of thing I was hoping might be
> possible.
nope :)
I did mean the "configure" script. It is situated in the root of
unladen-swallow source directory and is used to configure the
Makefile. In the end, it configures how the u-s compiler executable
should be built.
http://code.google.com/p/unladen-swallow/wiki/GettingStarted
regards
Valery
I had someone else do the install for me (in fact I've never built
python). That saves a little bit of time, but the disadvantage is
obviously that I am not familiar enough with certain basics!
But yes, it would be great if this was an option that could be used to
configure the Makefile if, as appears to be the case, a runtime
command-line option isn't feasible.
I'd be fine with including this as a configure-time option to switch
between the two modes.
Claudio, have you had a chance to experiment with applying your patch
to Unladen?
Collin