Moving refcounts away from object data

12 views
Skip to first unread message

Claudio Freire

unread,
Dec 15, 2009, 8:03:42 PM12/15/09
to Unladen Swallow
Hi there...

Recently on the list someone touched the subject of sharing structured readonly data with multiprocessing - ie, not using multiprocessing's shared memory model, but rather simply accessing shared read-only structures from subprocesses.

That's a problem I've been having myself, which I've solved by packing the structures into byte arrays and lots of fancy stuff and *then* use mmap'd memory - but that's ugly, so after diagnosing the problem I found the problem to be refcounting: even though the kernel performs COW on pages shared among subprocesses, python's refcounting touches almost every page (writing to them, when it updates refcounts within each PyObject).

So, I started toying with the idea of having indirect reference counts: moving the counter itself into packed arrays and have PyObjects hold only pointers to those counters. There's an obvious performance hit, so I wanted to measure how much of a hit it was, to see if it was worth the benefit of improved multiprocessing.

First, I applied the patch to mainline CPython (Python-2.6.1, the same unladen bases itself on, the same we're using at work)... mostly because I'm familiar with it, and mostly because that's what's of use to me. But we've also been considering unladen-swallow, and although it still isn't as stable and mature as we'd like (for us, for instance, memory efficiency is a real concern, and only recently unladen became concerned with it as well), I thought I would try the same patch in unladen. I'm secretly hoping LLVM will remove some of the redundant checks (not so secret anymore).

But... I've been lurking here enough to know not to attempt this on my own - I know there are two versions of "stuff", one for the CPython-like VM, and one for LLVM. So I'm here asking for pointers:

The patch for Python-2.6.1 can be found in this thread: http://groups.google.com/group/comp.lang.python/browse_thread/thread/6ac1a2fab33586a3/d3cd7932c7c7339b?lnk=gst

That patch, in short, messes with Python's macros:
 INCREF
 DECREF
 DEALLOC
 PyObject_HEAD_INIT
 PyVarObject_HEAD_INIT
  ... and the like

And of course changes to .c files as required (ie: not every source file used Py_REFCNT() to access the reference count - so I had to fix that).

Are there other places I'd have to "sync" in unladen?
Will LLVM pick up those changes in the macros?


BTW: here's unladen's perf.py output (it's a somewhat old perf.py, sorry) comparing 2.6.1 against the patch. I'm still trying to lower the overhead. The benchmarks don't show the BIG improvement in multiprocessing, though - the comp.lang.python thread I linked above says a bit about that... I wonder if there are enough users of this pattern of sharing data among processes to warrant some tests of the sort.

call_simple:
Min: 1.169127 -> 1.414741: 17.36% slower
Avg: 1.192160 -> 1.424129: 16.29% slower
Significant (t=-55.366398, a=0.95)
Stddev: 0.02848 -> 0.00817: 248.42% smaller


django:
Min: 0.829765 -> 0.952384: 12.87% slower
Avg: 0.833870 -> 0.957168: 12.88% slower
Significant (t=-107.005045, a=0.95)
Stddev: 0.00782 -> 0.00230: 240.29% smaller


iterative_count:
Min: 0.141754 -> 0.164258: 13.70% slower
Avg: 0.142417 -> 0.165561: 13.98% slower
Significant (t=-92.760901, a=0.95)
Stddev: 0.00057 -> 0.00167: 65.84% larger


pickle:
Min: 1.160424 -> 1.384847: 16.21% slower
Avg: 1.163739 -> 1.389458: 16.25% slower
Significant (t=-503.695596, a=0.95)
Stddev: 0.00163 -> 0.00272: 40.21% larger


regex_compile:
Min: 0.613939 -> 0.713186: 13.92% slower
Avg: 0.615964 -> 0.716309: 14.01% slower
Significant (t=-180.638163, a=0.95)
Stddev: 0.00140 -> 0.00367: 61.71% larger


regex_effbot:
Min: 0.081347 -> 0.083682: 2.79% slower
Avg: 0.082043 -> 0.085620: 4.18% slower
Significant (t=-12.598481, a=0.95)
Stddev: 0.00041 -> 0.00196: 78.99% larger


regex_v8:
Min: 0.087476 -> 0.094551: 7.48% slower
Avg: 0.089029 -> 0.096162: 7.42% slower
Significant (t=-27.669808, a=0.95)
Stddev: 0.00084 -> 0.00162: 48.40% larger


slowpickle:
Min: 0.586805 -> 0.702721: 16.50% slower
Avg: 0.588886 -> 0.706138: 16.60% slower
Significant (t=-191.081180, a=0.95)
Stddev: 0.00286 -> 0.00326: 12.19% larger


slowspitfire:
Min: 0.565925 -> 0.690271: 18.01% slower
Avg: 0.569953 -> 0.740659: 23.05% slower
Significant (t=-38.638473, a=0.95)
Stddev: 0.00745 -> 0.03034: 75.46% larger


slowunpickle:
Min: 0.262492 -> 0.309159: 15.09% slower
Avg: 0.264876 -> 0.311071: 14.85% slower
Significant (t=-136.654688, a=0.95)
Stddev: 0.00097 -> 0.00218: 55.53% larger


threaded_count:
Min: 0.144455 -> 0.172720: 16.36% slower
Avg: 0.163427 -> 0.186739: 12.48% slower
Significant (t=-10.255454, a=0.95)
Stddev: 0.01193 -> 0.01077: 10.79% smaller


unpack_sequence:
Min: 0.000091 -> 0.000096: 5.22% slower
Avg: 0.000093 -> 0.000099: 6.09% slower
Significant (t=-22.503467, a=0.95)
Stddev: 0.00000 -> 0.00006: 93.96% larger


unpickle:
Min: 0.722425 -> 0.783631: 7.81% slower
Avg: 0.730294 -> 0.794011: 8.02% slower
Significant (t=-57.935990, a=0.95)
Stddev: 0.00450 -> 0.00634: 28.94% larger

Collin Winter

unread,
Dec 15, 2009, 9:13:50 PM12/15/09
to Claudio Freire, Unladen Swallow
Hey Claudio,

On Tue, Dec 15, 2009 at 5:03 PM, Claudio Freire <klauss...@gmail.com> wrote:
> Hi there...
>
> Recently on the list someone touched the subject of sharing structured
> readonly data with multiprocessing - ie, not using multiprocessing's shared
> memory model, but rather simply accessing shared read-only structures from
> subprocesses.
[snip]
> So, I started toying with the idea of having indirect reference counts:
> moving the counter itself into packed arrays and have PyObjects hold only
> pointers to those counters. There's an obvious performance hit, so I wanted
> to measure how much of a hit it was, to see if it was worth the benefit of
> improved multiprocessing.
[snip]
> But... I've been lurking here enough to know not to attempt this on my own -
> I know there are two versions of "stuff", one for the CPython-like VM, and
> one for LLVM. So I'm here asking for pointers:
>
> The patch for Python-2.6.1 can be found in this thread:
> http://groups.google.com/group/comp.lang.python/browse_thread/thread/6ac1a2fab33586a3/d3cd7932c7c7339b?lnk=gst
>
> That patch, in short, messes with Python's macros:
>  INCREF
>  DECREF
>  DEALLOC
>  PyObject_HEAD_INIT
>  PyVarObject_HEAD_INIT
>   ... and the like
>
> And of course changes to .c files as required (ie: not every source file
> used Py_REFCNT() to access the reference count - so I had to fix that).
>
> Are there other places I'd have to "sync" in unladen?

Probably not. I'd recommend you try it and see :)

> Will LLVM pick up those changes in the macros?

It should. We reuse the same macros (Py_INCREF, etc) in our
bytecode->LLVM IR compiler, so any modifications you make to those
macros should be picked up automatically. I do not believe we've done
anything outside those macros.

> BTW: here's unladen's perf.py output (it's a somewhat old perf.py, sorry)
> comparing 2.6.1 against the patch. I'm still trying to lower the overhead.
> The benchmarks don't show the BIG improvement in multiprocessing, though -
> the comp.lang.python thread I linked above says a bit about that... I wonder
> if there are enough users of this pattern of sharing data among processes to
> warrant some tests of the sort.
>
> call_simple:
> Min: 1.169127 -> 1.414741: 17.36% slower
> Avg: 1.192160 -> 1.424129: 16.29% slower
> Significant (t=-55.366398, a=0.95)
> Stddev: 0.02848 -> 0.00817: 248.42% smaller
[snip lots of similar benchmarks]

Thanks for quantifying the performance hit; that's valuable data. I'm
worried about that kind of significant performance hit across the
board, though; do you have ideas for reducing that overhead? I think
this is an interesting patch, and if the overhead can be reduced
somewhat, I think we'd accept the hit to improve COW performance.

Have you looked at how other languages implement this? I'd be curious
to see a comparison of techniques with other implementations that make
use of disjoint reference counts.

Regardless of anything else, I'd be interested in adding the
multiprocessing memory benchmark you used to demonstrate the
improvement to Unladen's benchmark suite so we can track
improvements/regressions along this axis. I'm short on spare cycles
right now, but if you could work up a patch along the lines of
http://code.google.com/p/unladen-swallow/source/detail?r=926, I would
be happy to review and commit it to our tree. This is an important
benchmark to have.

Thanks,
Collin Winter

Reid Kleckner

unread,
Dec 15, 2009, 9:14:37 PM12/15/09
to Claudio Freire, Unladen Swallow
On Tue, Dec 15, 2009 at 8:03 PM, Claudio Freire <klauss...@gmail.com> wrote:
[snip]
Direct link to the patch:
http://www.deeplayer.com/claudio/misc/Python-2.6.1-refcount.patch

> That patch, in short, messes with Python's macros:
>  INCREF
>  DECREF
>  DEALLOC
>  PyObject_HEAD_INIT
>  PyVarObject_HEAD_INIT
>   ... and the like

To summarize for people who aren't going to look at the patch:
PyINCREF(ob) becomes (after cutting down the abstraction)
++*(ob->ob_prefcnt), which adds a single memory load to every incref.
Similar changes for DECREF.

> Are there other places I'd have to "sync" in unladen?
> Will LLVM pick up those changes in the macros?

It will. We've been manipulating the refcounts by calling to
functions in Python/llvm_inline_functions.c, which are marked
always_inline and wrap the underlying macros. They should pick up
your changes.

> BTW: here's unladen's perf.py output (it's a somewhat old perf.py, sorry)
> comparing 2.6.1 against the patch. I'm still trying to lower the overhead.
> The benchmarks don't show the BIG improvement in multiprocessing, though -
> the comp.lang.python thread I linked above says a bit about that... I wonder
> if there are enough users of this pattern of sharing data among processes to
> warrant some tests of the sort.
[snip]

Those are interesting numbers, because they give you an idea of how
much overhead adding a single memory load to every refcount operation
imposes, which is about 16%.

We've recently been discussing the idea of what it would take to move
Python to a more pure GC system to make GIL removal easier, although
we've agreed that any work done in that area should be based on py3k,
not unladen. At first, it seems like this would solve your problem
also, but there's a few problems.

First, GCs usually keep reachable/unreachable bits in the object
header. However, this problem can be solved as you've solved it. We
would pay the penalty for the memory access when doing a collection,
instead of every time the object is passed around.

Second, we would still need to support refcounting to allow C
extension modules to hold references to Python objects. To keep the
objects read-only, you would have to use the same trick that you're
describing.

IMO I'd rather go the extra mile of removing the GIL, so that you can
get parallel performance with threads in a single process rather than
doing this complicated dance to keep read-only objects really
read-only.

Reid

Reid Kleckner

unread,
Dec 15, 2009, 9:22:59 PM12/15/09
to Collin Winter, Claudio Freire, Unladen Swallow
> Have you looked at how other languages implement this? I'd be curious
> to see a comparison of techniques with other implementations that make
> use of disjoint reference counts.

I think most VMs don't support forking. Most of them don't have
something like the multiprocessing module because they have real
threads to achieve parallel performance. It's still an interesting
idea, though, because preforking to achieve shared read-only memory is
a nice design pattern.

Reid

Collin Winter

unread,
Dec 16, 2009, 4:59:25 PM12/16/09
to Claudio Freire, Unladen Swallow, Reid Kleckner
[+the list]

On Wed, Dec 16, 2009 at 10:16 AM, Claudio Freire <klauss...@gmail.com> wrote:
>
>
> On Wed, Dec 16, 2009 at 2:30 PM, Collin Winter <collin...@google.com>
> wrote:
>>
>> Cool. I meant to mention this earlier: the developers of Go
>> (http://golang.org/) are planning to move their refcounts out of the
>> objects, too. It would be worth syncing up with them, seeing exactly
>> what their plans are. That's a group that knows a thing or two about
>> ludicrously low-level optimizations.
>
> I couldn't find any details as to their plans and considered solutions
> though... just a thread in their discussion list mentioning the problem.

Yeah, I think you'd have to mail them.

One other thing: it would be very, very interesting to try your patch
in combination with http://code.google.com/p/python-safethread/.
Safethread sped up Python on two cores but fell over at four cores;
the author's hypothesis was that this was due to cache contention over
refcounts. It would be interesting to study what disjoint reference
counts could do for safethread; that could be a very big deal for
Python if it works.

Thanks,
Collin Winter

Dave Malcolm

unread,
Dec 20, 2009, 9:35:27 PM12/20/09
to Claudio Freire, Unladen Swallow
2009/12/15 Claudio Freire <klauss...@gmail.com>:
[snip the performance notes]

I've had similar thoughts, though I don't have a patch.

Another idea I have was pointer compression: if you're moving the
refcounts out of the objects into some kind of lookaside, how about
moving the object pointers themselves into the lookaside, and store
smaller object IDs rather the PyObject* throughout the objects (the
benefit being to avoid doubling heap usage on 64bit archs, due to the
sizeof(PyObject*) doubling. See
http://dmalcolm.livejournal.com/4183.html?thread=11863#t11863 for some
ideas on how this might look.

(please forgive the shameless plug of my blog post)

Claudio Freire

unread,
Dec 21, 2009, 8:33:25 AM12/21/09
to Unladen Swallow


On Sun, Dec 20, 2009 at 11:35 PM, Dave Malcolm <3583by...@gmail.com> wrote:
2009/12/15 Claudio Freire <klauss...@gmail.com>:

Another idea I have was pointer compression: if you're moving the
refcounts out of the objects into some kind of lookaside, how about
moving the object pointers themselves into the lookaside, and store
smaller object IDs rather the PyObject* throughout the objects (the
benefit being to avoid doubling heap usage on 64bit archs, due to the
sizeof(PyObject*) doubling.  See
http://dmalcolm.livejournal.com/4183.html?thread=11863#t11863 for some
ideas on how this might look.

Well, wouldn't that break source compatibility with extensions a bit too much?

The refcount slots aren't so widely used as the pointers themselves... and the refcount has a macro for being accessed, while pointer dereferencing doesn't.

I bet moving pointers to a lookaside table would be mandatory in order to implement a moving garbage collector though, so it will have to be addressed sometime if unladen wants to pursue that kind of GC (which seems to me to be true in the long term). But still, it's a major change.


Adam Olsen

unread,
Dec 22, 2009, 3:40:25 PM12/22/09
to Unladen Swallow
On Dec 16, 2:59 pm, Collin Winter <collinwin...@google.com> wrote:
> One other thing: it would be very, very interesting to try your patch
> in combination withhttp://code.google.com/p/python-safethread/.

> Safethread sped up Python on two cores but fell over at four cores;
> the author's hypothesis was that this was due to cache contention over
> refcounts. It would be interesting to study what disjoint reference
> counts could do for safethread; that could be a very big deal for
> Python if it works.

To clarify, cache contention prevents linear speedup, ie getting the
same amount of throughput on each additional thread/core.

I've been thinking about doing a separate refcount array, and this
thread finally motivated me to do it. First though, I did a little
poking with my existing hash table, and I got closer to linear results
than I remember. Maybe I never tested the options as thoroughly as I
should have when I switched to quad core, or maybe something else
changed (like the kernel or glibc). Dunno. Anyway, doing a separate
refcount array added about 13% throughput over what I was getting,
which is decent enough.

Python2.6: 57000 pystones
Python3.0: 55000 pystones
safethread normal pystone: 29000
safethread threaded pystone 1: 28000 # Test is in a child thread
safethread threaded pystone 2: 55000 # Test is in 2 child threads
safethread threaded pystone 3: 80000 # etc
safethread threaded pystone 4: 102000

Heh, more like you'd expect than a few days ago (3rd thread was doing
better than 2nd!). Might be worth it on 16 cores with a lot of work
in C libraries (fixed overhead is better than a bottlneck.)

However, without a clear transition plan towards a more flexible VM I
don't think it's worth patching CPython like this. Jython, for
example, would be a lot better, and I doubt it'd be that hard to build
a partial CPython API bridge...

Alex Gaynor

unread,
Dec 23, 2009, 3:09:43 PM12/23/09
to Adam Olsen, Unladen Swallow

So it looks like (and correct me if I'm wrong). Using a separate
refcount array and safethread (which implements the Recycler
algorithm) we have linear speedup with additional threads, but about
50% the speed per thread. Are there any obvious points for
performance improvements within the patch itself (obviously the rest
of the VM has plenty of them ;))? Also, is this patch available
anywhere for others to look at?

Alex

--
"I disapprove of what you say, but I will defend to the death your
right to say it." -- Voltaire
"The people's good is the highest law." -- Cicero
"Code can always be simpler than you think, but never as simple as you
want" -- Me

Adam Olsen

unread,
Dec 23, 2009, 3:37:28 PM12/23/09
to Alex Gaynor, Unladen Swallow
On Wed, Dec 23, 2009 at 13:09, Alex Gaynor <alex....@gmail.com> wrote:
> So it looks like (and correct me if I'm wrong).  Using a separate
> refcount array and safethread (which implements the Recycler
> algorithm) we have linear speedup with additional threads, but about
> 50% the speed per thread.

Yup, except it's closer to 45%, and that's with a separate array
rather than a hash table. A hash table is closer to 40%.


> Are there any obvious points for
> performance improvements within the patch itself (obviously the rest
> of the VM has plenty of them ;))?  Also, is this patch available
> anywhere for others to look at?

I don't believe there's much more to be eeked out. Refcounting is
simply a lot of work done far too often.

It should available via https://launchpad.net/python-safethread . I
haven't yet pushed an update with separate refcounting array. Note
that it's not a simple patch, but rather a huge fork, as the bigger
goal was language extensions so that threading was safe.


--
Adam Olsen, aka Rhamphoryncus

Valery

unread,
Dec 26, 2009, 6:45:45 AM12/26/09
to Unladen Swallow
Hi Claudio,

On Dec 16, 2:03 am, Claudio Freire <klaussfre...@gmail.com> wrote:
> Recently on the list someone touched the subject of sharing structured
> readonly data with multiprocessing - ie, not using multiprocessing's shared
> memory model, but rather simply accessing shared read-only structures from
> subprocesses.

[...]

yep, I've touched the subject and am confirming that ugly packing of
the complex stuff into array.array isn't that pleasant at all, albeit
helping.

Your patch aiming a great goal. I do not understand though how can't
you see the impact of your great patch as for multiprocessing...

*** in CPython: any top-level parent process that allocates objects
taking >50% RAM means NO-GO for child processes against these objects
***

Here is the test for the manual benchmarking again:

#################################
import time
from multiprocessing import Pool

MAX_CHILDREN = 2

def f(sec_delay):
'''
1. children are started altogether.
2. However they start their read-access with a 1 sec delay one
after another.
3. Let's ensure that a time point exists when all children keeping
the refcounterS high.
'''
time.sleep(sec_delay)
res = sum(huge_data)
time.sleep(MAX_CHILDREN-sec_delay+2)
res += sum(huge_data) # read-access again
return res

if __name__ == '__main__':
huge_data = [1./i for i in xrange(1,40000000)]# my "huge" read-
only data, about 1 Gb
p = Pool(MAX_CHILDREN)
res= list(p.map(f, xrange(1, MAX_CHILDREN+1)))
print res
#################################

Such test will eat out (1+MAX_CHILDREN) Gbytes of RAM on CPython. How
large is your RAM?

Of course, such a test will freeze your system during the execution of
"control implementation" (CPython), but not your great patched one.

You got a huge RAM on your box? feel free to increase MAX_CHILDREN and/
or size of huge_data :)

regards,
Valery

garyrob

unread,
Dec 28, 2009, 4:08:09 PM12/28/09
to Unladen Swallow
If it's of any use to hear this, I'll say that I'd love to see this
patch in u-s. In FlyFi's recommendation processing, we're always
dealing with very large data structures which end up being duplicated
between processes (which are controlled with the multiprocessing
module). Sometimes we've had to use binary data and c-types... it's a
pain. This patch would be a very significant help to us.

A 15% or so performance hit would be a lot, but would be worth it in
cases where there just isn't enough memory to duplicate the data
structures. Better still, of course, if u-s could use its magic to
greatly reduce that overhead. Or if this mode of operation could be
easily turned off and on (for instance with a command-line option when
python is invoked).

I understand that you probably couldn't incorporate this into mainline
u-s if you can't substantially reduce the performance hit. But maybe
it could at least be a standard compile-time option.

I agree that getting rid of the GIL would be great, but as a practical
matter, and from the point of view of the kind of code I write, this
patch would make it so that I wouldn't care much anymore about the
GIL. The multiprocessing module lets me do what I need, and using
separate processes is arguably less error-prone than using threads.
The only real problem I have with using the multiprocessing module is
very large, read-only structures that I would like to share between
the processes.

Anyway, those are my personal $.02 on this issue!

Gary

Adam Olsen

unread,
Dec 29, 2009, 2:50:21 AM12/29/09
to Unladen Swallow
On Dec 28, 2:08 pm, garyrob <garyr...@gmail.com> wrote:
> If it's of any use to hear this, I'll say that I'd love to see this
> patch in u-s. In FlyFi's recommendation processing, we're always
> dealing with very large data structures which end up being duplicated
> between processes (which are controlled with the multiprocessing
> module). Sometimes we've had to use binary data and c-types... it's a
> pain. This patch would be a very significant help to us.
>
> A 15% or so performance hit would be a lot, but would be worth it in
> cases where there just isn't enough memory to duplicate the data
> structures. Better still, of course, if u-s could use its magic to
> greatly reduce that overhead. Or if this mode of operation could be
> easily turned off and on (for instance with a command-line option when
> python is invoked).
>
> I understand that you probably couldn't incorporate this into mainline
> u-s if you can't substantially reduce the performance hit. But maybe
> it could at least be a standard compile-time option.

You can't eliminate the performance hit (or reduce it enough to be
acceptable, IMO) without resorting to a compile time option. However,
that doesn't need to go upstream. Just use the patch, report to its
author on how well it works, etc. If it becomes mature and develops a
significant user base then you can look at making an upstream compile-
time option.

Reid Kleckner

unread,
Dec 29, 2009, 11:48:18 AM12/29/09
to Adam Olsen, Unladen Swallow
On Mon, Dec 28, 2009 at 11:50 PM, Adam Olsen <rha...@gmail.com> wrote:
> You can't eliminate the performance hit (or reduce it enough to be
> acceptable, IMO) without resorting to a compile time option.  However,
> that doesn't need to go upstream.  Just use the patch, report to its
> author on how well it works, etc.  If it becomes mature and develops a
> significant user base then you can look at making an upstream compile-
> time option.

Right, it would be crazy to make the refcount behavior a command line
option. Then every inc/decref would need to do a global variable
load, which would kill performance.

However, I think it would really be worthwhile to take a look at the
safethread stuff and try to shave down the work done for every incref
and decref. I think there could be more savings there.

Right now every refcount operation does two memory operations: loading
the refcount offset and storing the modified value.

If we use the recycler/safethread approach with separate refcount
data, that work is offloaded to the collector thread. It isn't the
same as free, but it's almost as good these days.

The main threads should be doing about three memory ops: loading the
queue head, storing the pointer to be refcounted, and storing the
incremented head. That sounds like more work, but those two memory
areas should be hot in the cache from other refcount operations.
Maybe this is how safethread works already, I haven't dove into it
yet.

Using the safethread approach with separate refcounts would be good
for both GIL removal and multiprocessing performance, so I think
someone should take a serious look at it. To start, I'd look at
safethread and try to shave the Py_*REF macros down to as few
operations as possible.

One trick would be to find a way to access thread local values for the
queue head as efficiently as possible. I don't think the naive
approach of putting them on the current PyThreadState would be fast
enough, but it might work if you have a good enough alias analysis so
the tstate is loaded once per C function call. In the JITed code, for
example, we could just use the tstate_ local variable.

Reid

Adam Olsen

unread,
Dec 29, 2009, 3:45:26 PM12/29/09
to Unladen Swallow
On Dec 29, 9:48 am, Reid Kleckner <r...@mit.edu> wrote:
> However, I think it would really be worthwhile to take a look at the
> safethread stuff and try to shave down the work done for every incref
> and decref.  I think there could be more savings there.
>
> Right now every refcount operation does two memory operations: loading
> the refcount offset and storing the modified value.
>
> If we use the recycler/safethread approach with separate refcount
> data, that work is offloaded to the collector thread.  It isn't the
> same as free, but it's almost as good these days.

The current safethread approach is actually a fair bit more
complicated than that. Each object also has an "owner" field, which
can either be unowned, a specific thread, or asynchronous (shared).
If owned by a specific thread I use normal refcounting directly in the
object, while asynchronous uses a per-thread hash table (or per-thread
array). It's more costly, but it means under many conditions I never
have to stop all the threads to run the tracing GC.

Alex Gaynor

unread,
Dec 29, 2009, 3:53:40 PM12/29/09
to Adam Olsen, Unladen Swallow

Hrm, I'm wondering if an object is shared/ownted by another thread
what are the chances you get a mis-predicted branch and blow your
instruction cache? It seems to me things like that are likely to
cause a performance penalty, and it's clear that the speed of the
INCREF and DECREF operations are paramount. Therefore I'm thinking it
would be a prudent experiment to try the recycler approach Reid
suggests, since it's INCREF and DECREF are less expensive.

Valery

unread,
Dec 30, 2009, 4:55:22 AM12/30/09
to Unladen Swallow
Hi Reid,

On Dec 29, 5:48 pm, Reid Kleckner <r...@mit.edu> wrote:
> Right, it would be crazy to make the refcount behavior a command line
> option.  Then every inc/decref would need to do a global variable
> load, which would kill performance.

I don't believe that Gary meant some kind of global var to be checked
during the run-time. Rather he meant a flag that you could pass to the
"configure" script, a flag that will influence the macros and/or
structure definitions during compilation time of the u-s compiler
itself. This way, nothing to check in run-time.

P.S. I would be a happy user, who doesn't care about 15% performance
loss. Such a patch is a great *enabler*.

Regards
Valery

garyrob

unread,
Dec 31, 2009, 3:57:54 PM12/31/09
to Unladen Swallow
> I don't believe that Gary meant some kind of global var to be checked
> during the run-time. Rather he meant a flag that you could pass to the
> "configure" script, a flag that will influence the macros and/or
> structure definitions during compilation time of the u-s compiler
> itself. This way, nothing to check in run-time.

I'm not sure what you're saying there when you mention the "configure
script," but if you mean a setting that could be given on the command
line when invoking u-s so that its JIT uses the separate refcount
array approach, then that's the kind of thing I was hoping might be
possible.

I think this would need to invoke "-j always" as well, so that
everything was done by the JIT compiler right from the start. Not sure
if this idea makes any practical sense, but if it happens that it
does, and was implemented in u-s, that would work very well for my
needs.

The reason it would be fine for me for this to require "-j always" is
the obvious one: if I have a job with a huge amount of memory shared
between multiple processes, clearly the compile time is not a big
deal.

Valery

unread,
Jan 4, 2010, 4:55:44 AM1/4/10
to Unladen Swallow

On 31 Dez. 2009, 21:57, garyrob <garyr...@gmail.com> wrote:
> I'm not sure what you're saying there when you mention the "configure
> script," but if you mean a setting that could be given on the command
> line when invoking u-s so that its JIT uses the separate refcount
> array approach, then that's the kind of thing I was hoping might be
> possible.

nope :)

I did mean the "configure" script. It is situated in the root of
unladen-swallow source directory and is used to configure the
Makefile. In the end, it configures how the u-s compiler executable
should be built.
http://code.google.com/p/unladen-swallow/wiki/GettingStarted

regards
Valery

garyrob

unread,
Jan 5, 2010, 9:16:58 AM1/5/10
to Unladen Swallow
> I did mean the "configure" script. It is situated in the root of
> unladen-swallow source directory and is used to configure the
> Makefile.

I had someone else do the install for me (in fact I've never built
python). That saves a little bit of time, but the disadvantage is
obviously that I am not familiar enough with certain basics!

But yes, it would be great if this was an option that could be used to
configure the Makefile if, as appears to be the case, a runtime
command-line option isn't feasible.

Collin Winter

unread,
Jan 5, 2010, 3:06:24 PM1/5/10
to garyrob, Unladen Swallow, Claudio Freire

I'd be fine with including this as a configure-time option to switch
between the two modes.

Claudio, have you had a chance to experiment with applying your patch
to Unladen?

Collin

Reply all
Reply to author
Forward
0 new messages