Implementing mapreduce parallel model (not general multi-threading) ? easy and enough ?

cheng wang

unread,

Oct 5, 2015, 4:52:21 PM10/5/15

to julia-users

Hello everyone,

I am a Julia newbie. I am thrilled by Julia recently. It's an amazing language!

I notice that julia currently does not have good support for multi-threading programming.

So I am thinking that a spark-like mapreduce parallel model + multi-process maybe enough.

It is easy to be thread-safe and It could solve most vector-based computation.

This idea might be too naive. However, I am happy to see your opinions.

Thanks in advance,

Cheng

Andrei Zh

unread,

Oct 6, 2015, 10:24:51 AM10/6/15

to julia-users

Julia supports multiprocessing pretty well, including map-reduce-like jobs. E.g. in the next example I add 3 processes to a "workgroup", distribute simulation between them and then reduce results via (+) operator:

julia> addprocs(3)
3-element Array{Int64,1}:
 2
 3
 4


julia> nheads = @parallel (+) for i=1:200000000
         Int(rand(Bool))
       end
100008845

You can find full example and a lot of other fun in official documentation on parallel computing:

http://julia.readthedocs.org/en/latest/manual/parallel-computing/

Note, though, that it's not real (i.e. Hadoop/Spark-like) map-reduce, since original idea of MR concerns distributed systems and data-local computations, while here we do everything on the same machine. If you are looking for big data solution, search this forum for some (dead or alive) projects for it.

Stefan Karpinski

unread,

Oct 6, 2015, 1:03:57 PM10/6/15

to Julia Users

That works fine in a distributed setting if you start Julia workers on other machines, so it is actually a legitimate form of map reduce. It doesn't do anything for handling machine failures, however, which was arguably the major concern of the original MapReduce design.

Andrei Zh

unread,

Oct 6, 2015, 3:57:17 PM10/6/15

to julia-users

Yet, calling Julia processes on other machines via ssh doesn't address data locality. In big data systems (say, > 1TB) main performance concern is not a number of CPUs, but IO operations and data movement across a cluster, so map reduce tries to do as much as possible on local data without any movement (map phase) and then combine results globally (reduce phase). This way little program is send to data nodes instead of huge data being sent to program's node(s).

As far as I know, Julia doesn't provide any tools for working with huge distributed datasets, that's why I say it doesn't give you Hadoop- (or Spark-, or Google-like) map-reduce. But it's quite easy to add these features of MR too. E.g. one can use Elly.jl to access HDFS (including location of data blocks) and execute tasks using remotecall() on a Julia worker which is closest to data.

Tim Holy

unread,

Oct 6, 2015, 4:29:52 PM10/6/15

to julia...@googlegroups.com

There's

https://github.com/JuliaParallel/DistributedArrays.jl
https://github.com/JuliaParallel/HDFS.jl

in case they help. (See the other packages in JuliaParallel, in case you have
missed that organization.)

--Tim

David van Leeuwen

unread,

Oct 6, 2015, 4:43:49 PM10/6/15

to julia-users

See also an earlier discussion on a similar topic, for an out-of-core approach.

---david

Stefan Karpinski

unread,

Oct 6, 2015, 5:15:19 PM10/6/15

to Julia Users

In my experience, Hadoop is pretty terrible about minimizing data movement; Spark seems to be significantly better. The only codes that really nail it are carefully handcrafted HPC codes.

Andrei Zh

unread,

Oct 6, 2015, 5:53:24 PM10/6/15

to julia-users

In my experience, Hadoop is pretty terrible about minimizing data movement; Spark seems to be significantly better.

If you mean MapReduce (the framework, version 1 or 2), it doesn't move data anywhere unless you tell it to do so in reduce phase. You could experience another issue with MR1 - multiple reads and writes to disk on multistage jobs, which makes them terrrrribly slow. (Recall, that Hadoop was born to efficiently and reliably download and store millions of web pages obtained using Nutch, not to write iterative machine learning algorithms.)

The only codes that really nail it are carefully handcrafted HPC codes.

Could you please elaborate on this? I think I know Spark code quite well, but can't connect it to the notion of handcrafted HPC code.

Steven Sagaert

unread,

Oct 7, 2015, 7:32:48 AM10/7/15

to julia-users

I think what is meant is that in HPC typically this is done via MPI which is just a low level approach where you explicitely have to specify all the data communication (compared to Hadoop & Spark where it is implicit).

cheng wang

unread,

Oct 7, 2015, 8:34:02 AM10/7/15

to julia-users

Thanks all for replying.

I have read th parallel computing document before I post this.

Actually, what I mean is a shared memory model not a distributed model.

My daily research involves extensively using of blas and parallel for-loop.

Julia has a perfect support for blas, as well parallel for-loop could be solved by multi-process.

However, if I want to have a shared array that could do efficient blast and parallel for-loop in the same time,

what is the best solution ??

Jonathan Malmaud

unread,

Oct 7, 2015, 9:00:25 AM10/7/15

to julia-users

Within the next few days, support for native threads will be merged into to the development version of Julia (https://github.com/JuliaLang/julia/pull/13410).

You can also used the SharedArray type which Julia already has, which lets multiple Julia processes running on the same machine share memory. You would use the standard Julia task-parallel tools (like @parfor, etc.) in that model.

cheng wang

unread,

Oct 7, 2015, 9:06:44 AM10/7/15

to julia-users

Thx a lot. You saved my life :)

Reply all

Reply to author

Forward