Multiple Nets in parallel on one GPU

95 views
Skip to first unread message

Leif Blaese

unread,
Jun 17, 2015, 2:56:07 PM6/17/15
to caffe...@googlegroups.com
Hi,

My problem is the following. I have only one GPU and a relatively small network but a lot of time series data - it takes long to sift through it all. Since the GPU is not fully occupied with the one network, I tought I might try to use multithreading and push several networks on the GPU that can then learn in parallel - or at least use the GPU better because some computation can be done concurrently to loading new data for other networks.

I want to do that with as little changes in the implementation as possible because I am not sure whether the speedup will be that great. There are two possible ways (I think) to achieve parallelism using oversubscription 

A: We have N threads and each one creates its own network. Each thread performs a Step(1) but instead of updating its parameters, it copies the gradients it computed (saved in cpu_diff/gpu_diff) into one part of an array visible to all threads. There, they get summed up and propagated back so that each network has the gradients of all the other networks. Then, each network updates its parameters with the "global" gradients. This requires changing the Update() function as well as a lot of copying around of data which will probably introduce a lot of overhead.


B: We have N threads and each one creates a network. Then, threads 1..N-1 use their  ShareData- and ShareDiff-functions in order to delete their own parameters and map to the parameters of thread 0.This way, we essentially have one set of parameters and diffs that all networks can access. They can directly Update() the parameters without any changes necessary in code. 
The problem is, that race conditions can arise when two threads update the parameters concurrently. You have lock the parameters while updating in order to prevent that.


I am not sure which one is faster or easier to implement? I suspect B is the better choice here because it is easier to implement and there are fewer memory operations to be done. On the other hand, I heard locks on a GPU are really slow. Has anyone done that before and has any experience  to share? Any problems with my train of thought here? How slow are locks? Would you do A or B? Or something completely different? 


Thanks in advance,
Leif

Leif Blaese

unread,
Jun 18, 2015, 3:28:20 AM6/18/15
to caffe...@googlegroups.com
 
B: We have N threads and each one creates a network. Then, threads 1..N-1 use their  ShareData- and ShareDiff-functions in order to delete their own parameters and map to the parameters of thread 0.This way, we essentially have one set of parameters and diffs that all networks can access. They can directly Update() the parameters without any changes necessary in code. 
The problem is, that race conditions can arise when two threads update the parameters concurrently. You have lock the parameters while updating in order to prevent that.

This is of course wrong - they only share their data (their parameters), not their diff. Each network has it's own diff and then just updates the data using atomic functions 
Reply all
Reply to author
Forward
0 new messages