Can Julia function be serialized and sent by network?

1,234 views
Skip to first unread message

Andrei Zh

unread,
Aug 9, 2015, 3:59:39 PM8/9/15
to julia-users
I'm working on a wrapper for Apache Spark in Julia. Essentially, workflow looks like this: 

 1. Julia driver creates instance of `JuliaRDD` class in JVM and passes serialized Julia function to it. 
 2. Spark core copies JuliaRDD to each machine in a cluster and runs its `.compute()` method.
 3. `JuliaRDD.compute()` starts new Julia process and invokes function `launch_worker`.
 4. Launched worker reads and deserializes original function and applies it to a local chunk of data. 

So workers are not managed by any kind of Julia's `ClusterManager` and in general know nothing about definitions in the main driver program. The only 2 pieces of information they have are serialized function and data to process. 

My question is: does Julia's serialization produce completely self-containing code that can be run on workers? In other words, is it possible to send serialized function over network to another host / Julia process and applied there without any additional information from the first process? 

I made some tests on a single machine, and when I defined function without `@everywhere`, worker failed with a message "function myfunc not defined on process 1". With `@everywhere`, my code worked, but will it work on multiple hosts with essentially independent Julia processes? 


Jeff Waller

unread,
Aug 10, 2015, 7:33:11 AM8/10/15
to julia-users
 
My question is: does Julia's serialization produce completely self-containing code that can be run on workers? In other words, is it possible to send serialized function over network to another host / Julia process and applied there without any additional information from the first process? 

I made some tests on a single machine, and when I defined function without `@everywhere`, worker failed with a message "function myfunc not defined on process 1". With `@everywhere`, my code worked, but will it work on multiple hosts with essentially independent Julia processes? 
 
According to Jey here, Base.serialize does what we want; it's contained in serialize.jl

Andrei Zh

unread,
Aug 10, 2015, 3:43:19 PM8/10/15
to julia-users
I'm afraid it's not quite true, and I found simple way to show it. In the next code snippet I define function `f` and serialize it to a file:

julia> f(x) = x + 1
f (generic function with 1 method)

julia> f(5)
6

julia> open("example.jld", "w") do io serialize(io, f) end

Then I close Julia REPL and in a new session try to load and use this function:

julia> f2 = open("example.jld") do io deserialize(io) end
(anonymous function)

julia> f2(5)
ERROR: function f not defined on process 1
 in error at error.jl:21
 in anonymous at serialize.jl:398

So deserialized function still refers to the old definition, which is not available in this new session. 

Is there any better way to serialize a function and run it on an unrelated Julia process? 

Stefan Karpinski

unread,
Aug 10, 2015, 3:45:35 PM8/10/15
to Julia Users
JLD doesn't support serializing functions but Julia itself does.

Tony Kelman

unread,
Aug 10, 2015, 4:13:15 PM8/10/15
to julia-users
The above code wasn't using the HDF5-based JLD package/format, it was just using .jld as a file extension to store the results of serialize(). Should probably use some different extension for that, .jls or something, to avoid confusion.

Tim Holy

unread,
Aug 10, 2015, 4:40:03 PM8/10/15
to julia...@googlegroups.com
On Monday, August 10, 2015 01:13:15 PM Tony Kelman wrote:
> Should
> probably use some different extension for that, .jls or something, to avoid
> confusion.

Yes. That has been sufficiently confusing in the past, we even cover this here:
https://github.com/JuliaLang/JLD.jl#saving-and-loading-variables-in-julia-data-format-jld

--Tim
> >>> <https://groups.google.com/forum/#!searchin/julia-users/jey/julia-users/
> >>> bolLGcSCrs0/fGGVLgNhI2YJ>, Base.serialize does what we want; it's
> >>> contained in serialize.jl
> >>> <https://github.com/JuliaLang/julia/blob/master/base/serialize.jl>

Andrei Zh

unread,
Aug 10, 2015, 4:48:55 PM8/10/15
to julia-users
Yes, I incorrectly assumed `serialize` / `deserialize` use JLD format. But anyway, even when I saved the function into "example.jls" or even plain byte array (using IOBuffer and `takebuf_array`), nothing changed. Am I missing something obvious?  

Andrei Zh

unread,
Aug 13, 2015, 6:10:54 PM8/13/15
to julia-users
Ok, after going through serialization code, it's clear that default implementation doesn't support serializing function code, but only its name. For example, here's relevant section from `deserialize(::SerializationState, ::Function)`:
mod = deserialize(s)::Module
name = deserialize(s)::Symbol
if !isdefined(mod,name)
    return (args...)->error("function $name not defined on process $(myid())")
end


This doesn't fit my needs (essentially, semantics of Spark), and I guess there's no existing solution for full function serialization. Thus I'm going to write new solution for this. 

So far the best idea I have is to get function's AST and recursively serialize it, catching calls to the other non-Base function and any bound variables. But this looks quite complicated. Is there better / easier way to get portable function's representation? 

Tim Holy

unread,
Aug 14, 2015, 5:35:18 AM8/14/15
to julia...@googlegroups.com
If you define the function with @everywhere, it will be defined on all existing
workers. Likewise, `using MyPackage` loads the package on all workers.

--Tim

Andrei Zh

unread,
Aug 14, 2015, 6:23:23 AM8/14/15
to julia-users
Yes, but once again, I'm not using Julia workers, but instead completely independent Julia processes, running on different machines and ruled by Spark, not by Julia's ClusterManager. I.e. workflow looks like this:

1. Julia process 1 starts JVM and connects to Spark master node. 
2. Julia process 1 sends serialized function to Spark master node. 
3. Spark master node notifies Spark worker nodes (say, there are N of them) about upcoming computations. 
4. Each Spark worker node creates its own Julia process, independent from Julia process 1. 
5. Each Spark worker node receives serialized function and passes it to its local Julia process. 

So with N workers in Spark cluster, there's in total N+1 Julia processes, and when function in question is created, Julia processes from 2 to N+1 don't even exist yet.

Jake Bolewski

unread,
Aug 14, 2015, 10:49:55 AM8/14/15
to julia-users
Andrei Zh

I'm confused.  Have you actually tried?  

julia> io = IOBuffer()
IOBuffer(data=UInt8[...], readable=true, writable=true, seekable=true, append=false, size=0, maxsize=Inf, ptr=1, mark=-1)

julia> foo(x) =  x + 1
foo (generic function with 1 method)

julia> serialize(io, foo)

julia> seekstart(io)
IOBuffer(data=UInt8[...], readable=true, writable=true, seekable=true, append=false, size=9, maxsize=Inf, ptr=1, mark=-1)

julia> baz = deserialize(io)
foo (generic function with 1 method)

julia> baz(1)
2

The serialization code won't recursively serialize all the of the functions dependencies so you will have to send/serialize the code that defines the environment (types, constants, Packages, etc).

Andrei Zh

unread,
Aug 14, 2015, 11:06:30 AM8/14/15
to julia-users

Hi Jake, 

your example works because you don't leave Julia session. `foo` is defined in this session, so the the pair of module name and function name is enough to get function object. If you save serialized function (or just retype it byte by byte) , it won't work. Here's an example: 

Session #1: 

julia> io = IOBuffer()
IOBuffer(data=Uint8[...], readable=true, writable=true, seekable=true, append=false, size=0, maxsize=Inf, ptr=1, mark=-1)



julia
> foo(x) =  x + 1
foo
(generic function with 1 method)


julia
> serialize(io, foo)



julia
> takebuf_array(io)
9-element Array{Uint8,1}:
 
0x13
 
0x02
 
0x23
 
0x2f
 
0x02
 
0x03
 
0x66
 
0x6f
 
0x6f


julia
>



Session #2: 

julia> data = Uint8[0x13, 0x02, 0x23, 0x2f, 0x02, 0x03, 0x66, 0x6f, 0x6f]
9-element Array{Uint8,1}:
 
0x13
 
0x02
 
0x23
 
0x2f
 
0x02
 
0x03
 
0x66
 
0x6f
 
0x6f


julia
> io = IOBuffer(data)
IOBuffer(data=Uint8[...], readable=true, writable=false, seekable=true, append=false, size=9, maxsize=Inf, ptr=1, mark=-1)


julia
> bar = deserialize(io)
(anonymous function)


julia
> bar(1)
ERROR
: function foo not defined on process 1

 
in error at error.jl:21
 
in anonymous at serialize.jl:398



julia
>

 


edward zhang

unread,
Sep 21, 2015, 8:53:54 AM9/21/15
to julia-users
hi, dear, 
 have you already fixed this problem?


在 2015年8月14日星期五 UTC+8下午11:06:30,Andrei Zh写道:

Andrei

unread,
Sep 21, 2015, 9:30:05 AM9/21/15
to julia...@googlegroups.com
Hi, 

not yet. I made some initial research regarding serialization of ASTs and reconstructing functions from them, but it seems quite a tricky procedure and I have very little time for this project now. I plan to come back to this issue around the beginning of the next month. 

edward zhang

unread,
Sep 21, 2015, 10:33:11 PM9/21/15
to julia-users

hmm, I'm already very interested in the project like Julia-on-(Spark/c++?), 
and the ser/des issue is a big obstacle.
在 2015年9月21日星期一 UTC+8下午9:30:05,Andrei Zh写道:

Andrei

unread,
Sep 22, 2015, 8:40:12 AM9/22/15
to julia...@googlegroups.com
So far the best way to overcome it is to install all needed Julia packages on every machine. This is not very convenient, but at least it's not a blocker and you need to install some Julia packages manually anyway (though I'm thinking of creating API for massive installation of packages on all Spark workers).  

I'm not sure what you mean by "Julia-on-C++", though. 
Reply all
Reply to author
Forward
0 new messages