Embarrassingly parallel workload

786 views
Skip to first unread message

Júlio Hoffimann

unread,
Aug 9, 2015, 5:41:42 PM8/9/15
to julia-users
Hi,

Suppose I have a complicated but embarrassingly parallel loop, namely: https://github.com/juliohm/ImageQuilting.jl/blob/master/src/iqsim.jl#L167

How would you dispatch the iterations so that all cores in the client computer are busy working? There is any construct in the language for that usual scenario?

-Júlio

Tim Holy

unread,
Aug 9, 2015, 6:02:15 PM8/9/15
to julia...@googlegroups.com
Completely possible with SharedArrays, see
http://docs.julialang.org/en/latest/manual/parallel-computing/#shared-arrays

--Tim

Júlio Hoffimann

unread,
Aug 9, 2015, 6:11:29 PM8/9/15
to Julia Users

Thank you Tim, will check it carefully.

-Júlio

Júlio Hoffimann

unread,
Aug 9, 2015, 7:09:12 PM8/9/15
to Julia Users
Consider the following simplified example. There is an algorithm implemented as a function foo(N). This algorithm repeats the same recipe N times in a loop to fill in an array of arrays:

function foo(N)
  # bunch of auxiliary variables goes here
  # ...

  result = []
  for i=1:N
    # complicated use of auxiliary variables to produce result[i]
    # ...

    push!(result, rand(3,3)) # push result[i] to the vector
  end

  return result
end

Each iteration is expensive and we would like to add an option foo(N; parallel=false) to use all cores available. I would start with:

function foo(N; parallel=false)
  if parallel
    addprocs(CPU_CORES - 1)
  end

    # bunch of auxiliary variables goes here
    # ...

  result = SharedVector(???)
  @parallel for i=1:N
    # complicated use of auxiliary variables to produce result[i]
    # ...

    push!(result, rand(3,3))
  end

  return result
end

How to initialize the SharedVector? Is this the correct approach? What about the auxiliary variables, will them be copied to all workers? If yes, I can just replace Arrays by SharedArrays so that they are visible everywhere without copying?

-Júlio

Júlio Hoffimann

unread,
Aug 10, 2015, 6:23:02 PM8/10/15
to Julia Users
What am I doing wrong in the following code?

function foo(N; parallel=false)
  if parallel && nprocs() < CPU_CORES
    addprocs(CPU_CORES - nprocs())
  end

  result = SharedArray(Float64, 9, N)
  @parallel for i=1:N
    sleep(1)
    result[:,i] = rand(3,3)[:]
  end

  result
end

If I call foo(60, parallel=true), result is all zeros. Expected behavior is a random matrix.

-Júlio

Kristoffer Carlsson

unread,
Aug 10, 2015, 10:00:04 PM8/10/15
to julia-users
Something like this?

@everywhere function fill(A::SharedArray)
    for idx in Base.localindexes(A)
        A[idx] = rand()
    end
end

function fill_array(m, n)
    A = SharedArray(Float64, (m, n))
    @sync begin
        for p in procs(q)
            @async remotecall_wait(p, fill, A)
        end
    end
    A
end

fill_array(9, 60)

Júlio Hoffimann

unread,
Aug 19, 2015, 2:03:59 PM8/19/15
to Julia Users
Hi Kristoffer, sorry for the delay and thanks for the code.

What I want to do is very simple: I have an expensive loop for i=1:N such that each iteration is independent and produces a large array of size M. The result of this loop is a matrix of size MxN. I have many CPU cores at my disposal and want to distribute this work.

In the past I accomplished that with MPI in Python: https://github.com/juliohm/HUM/blob/master/pyhum/utils.py Whenever a process in the pool is free it "consumes" an iteration of the loop. What exactly the @parallel macro in Julia is doing? How can I modify the code I previously posted in this thread to achieve such effect?

-Júlio

Ismael VC

unread,
Aug 19, 2015, 2:40:10 PM8/19/15
to julia-users
There is an MPI wrapper for Julia, I don't know if it'll suit your needs thoug:

https://github.com/JuliaParallel/MPI.jl

Júlio Hoffimann

unread,
Aug 19, 2015, 2:46:48 PM8/19/15
to Julia Users

Hi Ismael,

MPI is distributed memory, I'm trying to use all the cores in my single workstation with shared memory instead. Thanks for the link anyways.

-Júlio

Sebastian Nowozin

unread,
Aug 19, 2015, 4:42:59 PM8/19/15
to julia-users

Hi Julio,

I believe this is a very common type of workload, especially in scientific computing.
In C++ one can use OpenMP for this type of computation, in Matlab there is parfor.  From the users perspective both "just work".

In Julia, I have not found an easy and convenient way to do such computation.
The difficulties I have experienced, trying to do this with distributed arrays and the Julia parallel operations:

1. Having to prepend @parallel before every import/require so that all parallel processes have all definitions.
2. Working with the distributed arrays API has given me plenty of headaches; it is more like programming with local/global contexts in OpenCL.
3. (I believe this is fixed now.) There were garbage collection issues and crashes on Windows when using distributed arrays.

What would be very convenient is a type of OpenMP like parallelism, really anything that can enable us to write simply

function compute(X::Vector{Float64}, theta)
    n = length(X)
    A = zeros(Float64, n)
    @smp for i=1:n
        A[i] = embarassing_parallel(X[i], theta);
    end
    A
end

Where @smp would correspond to "#pragma omp parallel for".
I know this may be difficult to implement for a language as dynamic as Julia, but it is hard to argue against this simplicity from the users' point of view.

As Clang/LLVM now support OpenMP (https://clang-omp.github.io/), one perhaps can recycle the same OpenMP runtime for such lightweight parallelism?

Thanks,
Sebastian

Júlio Hoffimann

unread,
Aug 19, 2015, 4:53:32 PM8/19/15
to Julia Users

Hi Sebastian, thanks for sharing your experience in parallelizing Julia code. I used OpenMP in the past too, it was very convenient in my C++ codebase. I remember of an initiative OpenACC that was trying to bring OpenMP and GPU accelerators together, I don't know the current status of it. It may be of interest to Julia devs.

-Júlio

Nils Gudat

unread,
Aug 20, 2015, 4:52:52 AM8/20/15
to julia-users
Sebastian, I'm not sure I understand you correctly, but point (1) in your list can usually be taken care of by wrapping all the necessary usings/requires/includes and definitions in a @everywhere begin ... end block.

Julio, as for your original problem, I think Tim's advice about SharedArrays was perfectly reasonable. Without having looked at your problem in detail, I think you should be able to do something like this (and I also think this gets close enough to what Sebastian was talking about, and to Matlab's parfor, unless I'm completely misunderstanding your problem):

nprocs()==CPU_CORES || addprocs(CPU_CORES-1)
results = SharedArray(Float64, (m,n))

@sync @parallel for i = 1:n
    results[:, i] = complicatedfunction(inputs[i])
end

Ryan Cox

unread,
Aug 20, 2015, 11:49:08 AM8/20/15
to julia...@googlegroups.com
Sebastian,

This talk from JuliaCon 2015 discusses progress on OpenMP-like threading:
Kiran Pamnany and Ranjan Anantharaman: Multi-threading Julia: http://youtu.be/GvLhseZ4D8M?a

Ryan

Sebastian Nowozin

unread,
Aug 20, 2015, 5:55:17 PM8/20/15
to julia...@googlegroups.com

Thanks Ryan for the pointer, this is awesome work, I am looking forward to this becoming part of the Julia release in Q3.

Sebastian

Christopher Alexander

unread,
Jan 30, 2016, 2:31:40 PM1/30/16
to julia-users
I have tried the construction below with no success.  In v0.4.3, I end up getting a segmentation fault.  In the latest v.0.5.0, the run time is 3-4x as long as the non-parallelized version and the array constructed is vastly different than the one that is constructed using the non-parallelized code.  Below is the C++ code that I am essentially trying to emulate:

void TreeLattice<Impl>::stepback(Size i, const Array& values,


                                     Array& newValues) const {


        #pragma omp parallel for


        for (Size j=0; j<this->impl().size(i); j++) {


            Real value = 0.0;


            for (Size l=0; l<n_; l++) {


                value += this->impl().probability(i,j,l) *


                         values[this->impl().descendant(i,j,l)];


            }


            value *= this->impl().discount(i,j);


            newValues[j] = value;


        }


    }


The calls to probability, descendant, and discount all end up accessing data in other objects, so I tried to prepend those function and type definitions with @everywhere.  However, that started me on a long chain of having to eventually wrap each file in my module in @everywhere, and there were still errors complaining about things not being defined.  At this point I am really confused as to how to construct what would appear to be a rather simple parallelized for loop that generates the same results as non-parallelized code.  I've poured over both this forum and other resources, and nothing has really worked.

Any help would be appreciated.

Thanks!

Chris

Christopher Alexander

unread,
Jan 30, 2016, 2:33:36 PM1/30/16
to julia-users
By "construction below", I mean this:

results = SharedArray(Float64, (m,n))

@sync @parallel for i = 1:n
    results[:, i] = complicatedfunction(inputs[i])
end

Reply all
Reply to author
Forward
0 new messages