Can Julia function be serialized and sent by network?

Andrei Zh

unread,

Aug 9, 2015, 3:59:39 PM8/9/15

to julia-users

I'm working on a wrapper for Apache Spark in Julia. Essentially, workflow looks like this:

1. Julia driver creates instance of `JuliaRDD` class in JVM and passes serialized Julia function to it.

2. Spark core copies JuliaRDD to each machine in a cluster and runs its `.compute()` method.

3. `JuliaRDD.compute()` starts new Julia process and invokes function `launch_worker`.

4. Launched worker reads and deserializes original function and applies it to a local chunk of data.

So workers are not managed by any kind of Julia's `ClusterManager` and in general know nothing about definitions in the main driver program. The only 2 pieces of information they have are serialized function and data to process.

My question is: does Julia's serialization produce completely self-containing code that can be run on workers? In other words, is it possible to send serialized function over network to another host / Julia process and applied there without any additional information from the first process?

I made some tests on a single machine, and when I defined function without `@everywhere`, worker failed with a message "function myfunc not defined on process 1". With `@everywhere`, my code worked, but will it work on multiple hosts with essentially independent Julia processes?

Jeff Waller

unread,

Aug 10, 2015, 7:33:11 AM8/10/15

to julia-users

My question is: does Julia's serialization produce completely self-containing code that can be run on workers? In other words, is it possible to send serialized function over network to another host / Julia process and applied there without any additional information from the first process?

I made some tests on a single machine, and when I defined function without `@everywhere`, worker failed with a message "function myfunc not defined on process 1". With `@everywhere`, my code worked, but will it work on multiple hosts with essentially independent Julia processes?

According to Jey here, Base.serialize does what we want; it's contained in serialize.jl

Andrei Zh

unread,

Aug 10, 2015, 3:43:19 PM8/10/15

to julia-users

I'm afraid it's not quite true, and I found simple way to show it. In the next code snippet I define function `f` and serialize it to a file:

julia> f(x) = x + 1
f (generic function with 1 method)

julia> f(5)
6

julia> open("example.jld", "w") do io serialize(io, f) end

Then I close Julia REPL and in a new session try to load and use this function:

julia> f2 = open("example.jld") do io deserialize(io) end
(anonymous function)

julia> f2(5)
ERROR: function f not defined on process 1
in error at error.jl:21
in anonymous at serialize.jl:398

So deserialized function still refers to the old definition, which is not available in this new session.

Is there any better way to serialize a function and run it on an unrelated Julia process?

Stefan Karpinski

unread,

Aug 10, 2015, 3:45:35 PM8/10/15

to Julia Users

JLD doesn't support serializing functions but Julia itself does.

Tony Kelman

unread,

Aug 10, 2015, 4:13:15 PM8/10/15

to julia-users

The above code wasn't using the HDF5-based JLD package/format, it was just using .jld as a file extension to store the results of serialize(). Should probably use some different extension for that, .jls or something, to avoid confusion.

Tim Holy

unread,

Aug 10, 2015, 4:40:03 PM8/10/15

to julia...@googlegroups.com

On Monday, August 10, 2015 01:13:15 PM Tony Kelman wrote:
> Should
> probably use some different extension for that, .jls or something, to avoid
> confusion.

Yes. That has been sufficiently confusing in the past, we even cover this here:
https://github.com/JuliaLang/JLD.jl#saving-and-loading-variables-in-julia-data-format-jld

--Tim

> >>> <https://groups.google.com/forum/#!searchin/julia-users/jey/julia-users/
> >>> bolLGcSCrs0/fGGVLgNhI2YJ>, Base.serialize does what we want; it's
> >>> contained in serialize.jl
> >>> <https://github.com/JuliaLang/julia/blob/master/base/serialize.jl>

Andrei Zh

unread,

Aug 10, 2015, 4:48:55 PM8/10/15

to julia-users

Yes, I incorrectly assumed `serialize` / `deserialize` use JLD format. But anyway, even when I saved the function into "example.jls" or even plain byte array (using IOBuffer and `takebuf_array`), nothing changed. Am I missing something obvious?

Andrei Zh

unread,

Aug 13, 2015, 6:10:54 PM8/13/15

to julia-users

Ok, after going through serialization code, it's clear that default implementation doesn't support serializing function code, but only its name. For example, here's relevant section from `deserialize(::SerializationState, ::Function)`:

mod = deserialize(s)::Module
name = deserialize(s)::Symbol
if !isdefined(mod,name)
return (args...)->error("function $name not defined on process $(myid())")
end

This doesn't fit my needs (essentially, semantics of Spark), and I guess there's no existing solution for full function serialization. Thus I'm going to write new solution for this.

So far the best idea I have is to get function's AST and recursively serialize it, catching calls to the other non-Base function and any bound variables. But this looks quite complicated. Is there better / easier way to get portable function's representation?

Tim Holy

unread,

Aug 14, 2015, 5:35:18 AM8/14/15

to julia...@googlegroups.com

If you define the function with @everywhere, it will be defined on all existing
workers. Likewise, `using MyPackage` loads the package on all workers.

--Tim

Andrei Zh

unread,

Aug 14, 2015, 6:23:23 AM8/14/15

to julia-users

Yes, but once again, I'm not using Julia workers, but instead completely independent Julia processes, running on different machines and ruled by Spark, not by Julia's ClusterManager. I.e. workflow looks like this:

1. Julia process 1 starts JVM and connects to Spark master node.

2. Julia process 1 sends serialized function to Spark master node.

3. Spark master node notifies Spark worker nodes (say, there are N of them) about upcoming computations.

4. Each Spark worker node creates its own Julia process, independent from Julia process 1.

5. Each Spark worker node receives serialized function and passes it to its local Julia process.

So with N workers in Spark cluster, there's in total N+1 Julia processes, and when function in question is created, Julia processes from 2 to N+1 don't even exist yet.

Jake Bolewski

unread,

Aug 14, 2015, 10:49:55 AM8/14/15

to julia-users

Andrei Zh

I'm confused. Have you actually tried?

julia> io = IOBuffer()
IOBuffer(data=UInt8[...], readable=true, writable=true, seekable=true, append=false, size=0, maxsize=Inf, ptr=1, mark=-1)

julia> foo(x) =  x + 1
foo (generic function with 1 method)

julia> serialize(io, foo)

julia> seekstart(io)
IOBuffer(data=UInt8[...], readable=true, writable=true, seekable=true, append=false, size=9, maxsize=Inf, ptr=1, mark=-1)

julia> baz = deserialize(io)
foo (generic function with 1 method)

julia> baz(1)
2

The serialization code won't recursively serialize all the of the functions dependencies so you will have to send/serialize the code that defines the environment (types, constants, Packages, etc).

Andrei Zh

unread,

Aug 14, 2015, 11:06:30 AM8/14/15

to julia-users

Hi Jake,

your example works because you don't leave Julia session. `foo` is defined in this session, so the the pair of module name and function name is enough to get function object. If you save serialized function (or just retype it byte by byte) , it won't work. Here's an example:

Session #1:

julia> io = IOBuffer() IOBuffer(data=Uint8[...], readable=true, writable=true, seekable=true, append=false, size=0, maxsize=Inf, ptr=1, mark=-1)

julia> foo(x) = x + 1 foo (generic function with 1 method) julia> serialize(io, foo)

julia> takebuf_array(io) 9-element Array{Uint8,1}: 0x13 0x02 0x23 0x2f 0x02 0x03 0x66 0x6f 0x6f julia>

Session #2:

julia> data = Uint8[0x13, 0x02, 0x23, 0x2f, 0x02, 0x03, 0x66, 0x6f, 0x6f] 9-element Array{Uint8,1}: 0x13 0x02 0x23 0x2f 0x02 0x03 0x66 0x6f 0x6f julia> io = IOBuffer(data) IOBuffer(data=Uint8[...], readable=true, writable=false, seekable=true, append=false, size=9, maxsize=Inf, ptr=1, mark=-1) julia> bar = deserialize(io) (anonymous function) julia> bar(1) ERROR: function foo not defined on process 1

in error at error.jl:21 in anonymous at serialize.jl:398

julia>

edward zhang

unread,

Sep 21, 2015, 8:53:54 AM9/21/15

to julia-users

hi， dear，

have you already fixed this problem？

在 2015年8月14日星期五 UTC+8下午11:06:30，Andrei Zh写道：

Andrei

unread,

Sep 21, 2015, 9:30:05 AM9/21/15

to julia...@googlegroups.com

Hi,

not yet. I made some initial research regarding serialization of ASTs and reconstructing functions from them, but it seems quite a tricky procedure and I have very little time for this project now. I plan to come back to this issue around the beginning of the next month.

edward zhang

unread,

Sep 21, 2015, 10:33:11 PM9/21/15

to julia-users

hmm, I'm already very interested in the project like Julia-on-(Spark/c++?),

and the ser/des issue is a big obstacle.
在 2015年9月21日星期一 UTC+8下午9:30:05，Andrei Zh写道：

Andrei

unread,

Sep 22, 2015, 8:40:12 AM9/22/15

to julia...@googlegroups.com

So far the best way to overcome it is to install all needed Julia packages on every machine. This is not very convenient, but at least it's not a blocker and you need to install some Julia packages manually anyway (though I'm thinking of creating API for massive installation of packages on all Spark workers).

I'm not sure what you mean by "Julia-on-C++", though.

Reply all

Reply to author

Forward