Strategy for OpenCL support

James Bergstra

unread,

May 8, 2013, 10:26:00 AM5/8/13

to thean...@googlegroups.com

Hi Devs,

I've been playing with OpenCL recently as a means of getting

* lower-overhead CPU execution than Theano currently offers

* multi-core CPU support

* GPU support from same code base as CPU

* GPU support for range of dtypes

Of course, Theano doesn't have OpenCL support yet. One way to add it would be to re-do what was done in the sandbox.cuda folder, but I think I have found a better way: post-process theano.function. This strategy is better because (a) it deals more naturally with OpenCL contexts, (b) it works fine as a separate project from Theano, and (c) a lot can be done with a few lines of code, (d) it solves the mystery of how to pickle compiled functions.

The strategy is essentially to create a new class "Simulator" that runs theano functions:

# -- do standard graph optimizations, don't care about quality of VM

f = theano.function([x], y, linker='py')

# -- Create a VM-like thing *externally* from the function,

# which works by allocating a NEW storage map and creating NEW

# thunks for each of the apply nodes in f's optimized graph

# (of course the simulator can also modify the graph even more)

# The simulator either uses the original shared variables

# or maybe creates copies... up to simulator.

sim = SimulatorOCL(f)

# simulator provides calling mechanism similar to the original function

# (maybe simulator provides other calling protocols too

# e.g. for running N times)

sim(xvalue)

# updates shared variables

sim.sync_to_theano()

The way it solves the pickling issue is that it allows the original theano graph to be just pure numpy, which was always picklable no problem, while still providing a way to evaluate a function really fast on a particular host. In this case, the function `f` can be unserialized anywhere, and the OpenCL-based simulator `sim` can only be un-serialized on hosts that have OpenCL.

I've been developing this mechanism in the context of a Theano port of the nengo brain simulator [1]. Would readers of this list are interested in making this available more generally? The OpenCL simulator could be factored out as a standalone project, or included into theano directly.

- James

[1] https://github.com/jaberg/nengo_theano/blob/simulator-rebase1/nengo/nef_theano/simulator_ocl.py

James Bergstra

unread,

May 8, 2013, 10:40:47 AM5/8/13

to thean...@googlegroups.com

Also, for anyone who was on board with the "workspace" idea I have talked about previously, this is the same type of thing. The "Simulator" here is a workspace. Workspace offered a slightly more powerful abstraction, but the way I was thinking of the Workspace meant more changes required of code that was already using theano.function.

The "Simulator" class here is just as powerful, but it's a better fit for existing code because it still uses shared variables and theano.function. It still makes sense to create a Simulator for multiple functions if you want them to share internal state:

s = SimulatorOCL(theano.function(...)) # -- Create it with a default __call__.

s.add_method('f', theano.function(...)) # -- Add some more methods ...

s.add_method('g', theano.funtion(...)) # -- that use / update common shared vars

Of course, we might want nicer syntax for doing that, but conceptually it makes sense. I haven't actually implemented this "add_method" business, but the basic data structures required to support it (i.e. self._ocl_vars, self._ocl_constants) are in place.

- James

Frédéric Bastien

unread,

May 14, 2013, 8:36:09 AM5/14/13

to theano-dev

Hi,

I like the idea. But it move work to the user. I think we need to keep the possibility to compile in only 1 step. Not all user will want to compile in 2 step. But that isn't complicated.

Your URL is broken.

I don't agree when you tell it don't need to do something like what is in sandbox.cuda. You will need an opencl ndarray or you won't be able to support all functionality from Theano. Arnaud is doing the correction on his master. After that, he should be finishing his PR to make his new back-end work with Theano. And this back-end support cuda and opencl. Did you just used PyOpenCL as your back-end?

Fred

--

---
You received this message because you are subscribed to the Google Groups "theano-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to theano-dev+...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

James Bergstra

unread,

May 14, 2013, 9:20:58 AM5/14/13

to thean...@googlegroups.com, Arnaud Bergeron

Good morning Fred, thanks for having a read through my proposal. Here's a working link to a "Simulator" that overrides theano's linker:

https://github.com/jaberg/nengo_theano/blob/simulator-concat/nengo/nef_theano/simulator_ocl.py

I'm glad you like the idea, but I don't see why it moves work to the user. It is still possible to compile in one step. The "Simulator" I describe could be provided as a Linker or VM that is inserted as part of the mode.

But I recommend thinking of it and documenting it in terms of two steps, because there is currently no recommended way to combine function serialization with the use of platform-specific compilation, and that's what I'm proposing.

I'd like to hear Arnaud's thoughts on this. I would suggest that there was a design flaw in how the GPU ops were introduced in theano. They should have been done in an explicitly distinct "instruction selection" step rather than simply yet-another pass through the FunctionGraph. We should have left the type nodes as storage-agnostic. There should have been

# build a numpy function, with dynamically generated C

f = theano.function([x], y)

# redo allocation of memory and compiles new thunks

f_gpu = f.optimize(cuda_backend('gpu'))

# transfers x to gpu and runs GPU-side computations

f_gpu(x)

# -- at this point f_gpu and f have shared variables out of sync

# but they can be put back into sync like this:

f_gpu.sync_shared_to_parent()

With this design serialization of each object makes sense, in particular, f can be serialized for long term, and de-serialized on any machine with numpy etc.

*Also* with this design the object which is passed to optimize() has free reign on how to go about

producing a new work-a-like of f. It does not have to define new Type objects, and it does

not have to use our make_thunk business, or the clinker, etc. All it has to do is produce a callable

object that can compute the right thing, and sync shared variables back to the parent function

that created it.

My "Simulator" class in that link demonstrates a technique that I might expose by e.g.

f_ocl = f.optimize(fixed_size_opencl_backend(device='auto'))

Reply all

Reply to author

Forward