On Mon, Aug 4, 2014 at 8:48 AM, Christian Stump
<
christi...@gmail.com> wrote:
> Thanks, William!
>
>> It absolutely will use two additional *processes*, as you might see by
>> watching with htop, top, or using ps.
>
> Is it right that the master process is creating all the subprocesses?
> I'd suspect I don't quite see the other processes in action simply
> because they are there only for milliseconds...
Yes, exactly.
I forgot to mention that another approach, which is what
multiprocessing (a python module) does, is to create n subprocesses
and keep feeding them data. Then they stay running. This might make
more sense in your situation. However, it has very significant
drawbacks in some situations as well.
>
>> Yes-ish. Just to be clear there is one single fork that happens,
>> which means that
>> (almost) all state of the process is inherited by the subprocess.
>
> Yes, but every subprocess modified the data structure *within the
> subprocess*, the object in the main process is not modified, or am I
> missing something? (That's at least how I understand the
> documentation, and what I see happening in my computation output.)
That's exactly right and how it is documented to behave, and must behave
(unless one uses shared memory, which @parallel doesn't use).
>
>> That fakes @parallel -- providing the same API -- but actually running
>> everything in serial in the parent process. No forking or anything
>> else happens. It's for testing and development purposes.
>
> I see.
>
>> Break up your computation into far less than 20,000 separate steps, then us @parallel.
>
> Okay, I'll do that.
>
> But I still don't see how I should handle the side effects that are
> supposed to effect objects in the main process.
You didn't ask about that explicitly before. It's impossible to do
that *implicitly* with arbitrarily Python data structures, due to the
global interpreter lock (GIL). To solve this problem using @parallel
(or multiprocessing) you have to work harder and pass data back from
your function, then insert that data in the data structure. E.g., in
my example below, with factor, that's what I do.
With multiprocessing they do have some limited support for shared
memory, but only for certain specific data types, e.g., a bunch of C
ints.
>
> Or are you suggesting that I should (actually also for clarity of the
> code) completely remove all side effects, do the computations in
> parallel, but instead of the side effects return the stuff I need and
> then do the side effect stuff in serial. Sth. like
YES. It's the only solid way to do this sort of thing in Python. And
it can possibly result in clearer and easier to debug code. Functions
without side effects are sometimes easier to reason about.
>
> @parallel
> def f(m):
> return [ factor(k) for k in range(1000*m, 1000*(m+1)) ]
>
> obj = MyObj()
> for x in f([1..20]):
> print x[0]
> for y in x:
> obj.store_new_data(y)
>
> If I should do it this way: is the body of the for-loop executed in
> the main process *in parallel* to the subprocess computing the next
> element of the iterator f([1..20]) ?
Mostly. More precisely, when you execute
for x in f([1..20]) :
...
it unleashes many subprocesses, which start running. When one
completes, it's result is saved to a pickle, and the calling parent
processes notices that subprocess terminates, reads the pickle
(deletes the file), adds the result to the iterator, and starts
another subproc going. So the x's in the example above can come back
in any random order. Also, I think (not tested) if you did:
@parallel
def f(m):
return [ factor(k) for k in range(1000*m, 1000*(m+1)) ]
obj = MyObj()
for x in f([1..20]):
while True: pass
then it would compute k(=number of cores) values of f, then hang and
not even start computing more.
In other words, the code that forks off subprocesses has to run at
some point, and there's no threading involved
here -- the forking off of subprocesses is *caused* by iterating over
f([1..20]). It's a possibly annoying/surprising
trick, but it makes the implementation of @parallel really simple, by
avoiding any use of threads or asynchronous
execution.
I encourage you to read the source code of this @parallel stuff --
it's only about 2 pages of actual code,
which I wrote at some Sage days as my project back in maybe 2008.
https://github.com/sagemath/sage/blob/master/src/sage/parallel/use_fork.py
It's critical to your question to understand how the Python yield
keyword works (on line 183 in the code).
>
> Thanks again for the clarification!
>
> Christian
>
> --
> You received this message because you are subscribed to the Google Groups "sage-support" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
sage-support...@googlegroups.com.
> To post to this group, send email to
sage-s...@googlegroups.com.
> Visit this group at
http://groups.google.com/group/sage-support.
> For more options, visit
https://groups.google.com/d/optout.
--
William Stein
Professor of Mathematics
University of Washington
http://wstein.org
wst...@uw.edu