a huge shared read-only data in parallel accesses (multithreading/multiprocessing)

93 views
Skip to first unread message

Valery

unread,
Dec 8, 2009, 7:59:49 AM12/8/09
to Unladen Swallow
Hi gurus,

Q: Will Unladen Swallow Q4+ be effective for parallel accesses of
read-only memory?

Details:

I often do the parallel computations via parallelized map(), where
some_side_effect_free() function called from this map().
Also often such a some_side_effect_free function uses a huge read-only
master data.
Something like:

huge_global_data = [...]
traverse_results = map(my_side_effect_free_data_traverser,
traverse_arguments)

To my poor understanding, CPython can't use multi-threading
effectively, because of GIL.
OTOH, the multiprocessing module does a straight-forward data copying
-- also not effective way.

Finally, as I see, Unladen Swallow Q3 have the same performance on
multiprocessing as CPython regarding my issue.

Here is an easy benchmark example:

######################
import time
from multiprocessing import Pool

def f(_):
time.sleep(5) # just to emulate the time used by my computation
res = sum(parent_x) # my sofisticated formula goes here
return res

if __name__ == '__main__':
parent_x = [1./i for i in xrange(1,10000000)]# my huge read-only data
p = Pool(7)
res= list(p.map(f, xrange(10)))
print res
######################

Surely, I'd aviod any IPC...

Comemnts are very welcome,

kind regards,
Valery

Collin Winter

unread,
Dec 8, 2009, 1:15:54 PM12/8/09
to Valery, Unladen Swallow
Hi Valery,

On Tue, Dec 8, 2009 at 4:59 AM, Valery <kham...@gmail.com> wrote:
> Hi gurus,
>
>  Q: Will Unladen Swallow Q4+ be effective for parallel accesses of
> read-only memory?
>
> Details:
>
> I often do the parallel computations via parallelized map(), where
> some_side_effect_free() function called from this map().
> Also often such a some_side_effect_free function uses a huge read-only
> master data.
> Something like:
>
>  huge_global_data = [...]
>  traverse_results = map(my_side_effect_free_data_traverser,
> traverse_arguments)
>
> To my poor understanding, CPython can't use multi-threading
> effectively, because of GIL.
> OTOH, the multiprocessing module does a straight-forward data copying
> -- also not effective way.

That's correct. We don't have any plans to improve this situation
during Q4. Because of the sensitive nature of GIL-related work, we'd
like to do that in mainline CPython (rather than Unladen Swallow) to
avoid introducing subtle bugs during merger.

Thanks,
Collin Winter

Valery Khamenya

unread,
Dec 8, 2009, 5:56:26 PM12/8/09
to Collin Winter, Unladen Swallow
Hi Collin, 
 
That's correct. We don't have any plans to improve this situation
during Q4. Because of the sensitive nature of GIL-related work, we'd
like to do that in mainline CPython (rather than Unladen Swallow) to
avoid introducing subtle bugs during merger.

2.6-2.7 will not address this for sure. 
It's a pity.

More sadly is something like this:

Finally, the Python itself has no evident requirement in such a counter-productive thing. It is about implementation, not about languaguage specification. If the reference Python's implementation is in some way too conservative -- that's understood. But then, I guess, some advanced implementation should address such challenges. 

Who if not Unladen Swallow?.. 
(The last comment from above link was "Solution! Erlang! Erlang! Erlang!"...)

kind regards, 
Valery.

Alex Gaynor

unread,
Dec 8, 2009, 6:13:06 PM12/8/09
to Valery Khamenya, Collin Winter, Unladen Swallow
IMO, that blog post is exceptionally trollish. Guido (and every core
committer AFAIK)'s position has always been that if there's a solution
to the GIL that doesn't compromise single threaded performance or make
writing C extensions overly burdensome then it would be accepted. The
simple fact is that that's a difficult task, really hard.

Alex

--
"I disapprove of what you say, but I will defend to the death your
right to say it." -- Voltaire
"The people's good is the highest law." -- Cicero
"Code can always be simpler than you think, but never as simple as you
want" -- Me

Collin Winter

unread,
Dec 8, 2009, 6:38:30 PM12/8/09
to Valery Khamenya, Unladen Swallow
Hi Valery,

On Tue, Dec 8, 2009 at 2:56 PM, Valery Khamenya <kham...@gmail.com> wrote:
> Hi Collin,
>
>>
>> That's correct. We don't have any plans to improve this situation
>> during Q4. Because of the sensitive nature of GIL-related work, we'd
>> like to do that in mainline CPython (rather than Unladen Swallow) to
>> avoid introducing subtle bugs during merger.
>
> 2.6-2.7 will not address this for sure.
> It's a pity.

Correct. Our work would be done in the 3.x branch.

> More sadly is something like this:
> http://www.grouplens.org/node/244
> Finally, the Python itself has no evident requirement in such a
> counter-productive thing. It is about implementation, not about languaguage
> specification. If the reference Python's implementation is in some way too
> conservative -- that's understood. But then, I guess, some advanced
> implementation should address such challenges.
> Who if not Unladen Swallow?..

Say what you will about the GIL (and I've plenty to say), but it
*does* simplify CPython's internals, and that's no small thing when
CPython is maintained by an all-volunteer workforce.

Removing the GIL also seriously hurts the performance of individual
threads. One of the goals of Unladen Swallow is to increase the
performance of individual threads to the point where GIL removal might
be possible without a net loss of performance.

Unladen Swallow is constrained by a) the existing CPython codebase,
and b) not-unlimited engineering time. The benefits of removing the
GIL are long-term, in that it chiefly enables applications that have
not yet been written (or rewritten). Jython has no GIL, nor does
IronPython IIRC. If you need GIL-less Python *today*, I'd recommend
you use one of those implementations.

Thanks,
Collin Winter

Valery Khamenya

unread,
Dec 8, 2009, 6:58:16 PM12/8/09
to Collin Winter, Unladen Swallow
Hi Collin,
 
Unladen Swallow is constrained by a) the existing CPython codebase,
and b) not-unlimited engineering time. The benefits of removing the
GIL are long-term, in that it chiefly enables applications that have
not yet been written (or rewritten). Jython has no GIL, nor does
IronPython IIRC. If you need GIL-less Python *today*, I'd recommend
you use one of those implementations.

neither of these implementations is easy-to-go:
 - no multiprocessing module
 - not straight-forward working with distutils (with its consequences)

OK, anyway I see the point and don't want to steal you time any further.

Kind regards,
Valery.

Claudio Freire

unread,
Dec 9, 2009, 9:21:33 AM12/9/09
to Unladen Swallow
On Tue, Dec 8, 2009 at 9:59 AM, Valery <kham...@gmail.com> wrote:
Hi gurus,

 Q: Will Unladen Swallow Q4+ be effective for parallel accesses of
read-only memory?

Details:

<snip>


To my poor understanding, CPython can't use multi-threading
effectively, because of GIL.
OTOH, the multiprocessing module does a straight-forward data copying
-- also not effective way.

in
 
import time
from multiprocessing import Pool

def f(_):
       time.sleep(5) # just to emulate the time used by my computation
       res = sum(parent_x) # my sofisticated formula goes here
       return res

if __name__ == '__main__':
       parent_x = [1./i for i in xrange(1,10000000)]# my huge read-only data
       p = Pool(7)
       res= list(p.map(f, xrange(10)))
       print res

Multiprocessing, in this case, does not copy anything other than the parameters sent to map(), ie, the numbers 0-9.
That's very low overhead.

The problem though is that multiprocessing in *nix will fork(), and that will mirror each process' image and flag them read-only, to perform copy-on-write.

So initially the parent_x structure *is* shared, but as python handles it (and modifies the refcounts on each object), pages get copied and copied and... well there goes performance.

I've been toying with a patch (mainline 2.6, not u-s) to solve that: move the refcounts to a packed pool, so that only the refcount pool gets copied, and the bulk of the structure remains shared.

I haven't finished the patch nor have I done any benchmarks on it, but I thought I'd mention.



Valery Khamenya

unread,
Dec 9, 2009, 10:14:55 AM12/9/09
to Claudio Freire, Unladen Swallow
Hi Claudio, 
 
Multiprocessing, in this case, does not copy anything other than the parameters sent to map(), ie, the numbers 0-9.
That's very low overhead.

The problem though is that multiprocessing in *nix will fork(), and that will mirror each process' image and flag them read-only, to perform copy-on-write.

So initially the parent_x structure *is* shared, but as python handles it (and modifies the refcounts on each object), pages get copied and copied and... well there goes performance.

The first explanantion that sound plausible and clear for me, thanks!


I've been toying with a patch (mainline 2.6, not u-s) to solve that: move the refcounts to a packed pool, so that only the refcount pool gets copied, and the bulk of the structure remains shared.
I haven't finished the patch nor have I done any benchmarks on it, but I thought I'd mention.
 
Well, it sounds interesting -- especially if such patch could reach a reliable level :o)

Since it is rather an upstream (CPython) talk, I've started a conversation in comp.lang.python:

Let's move in there?

Anyway, I need a solution. Today I've got data structure >50% RAM!

Kind regards, 
Valery

Reply all
Reply to author
Forward
0 new messages