How to return solver when trying to training caffe with Multi-GPU in python?

644 views

Skip to first unread message

Z.L

unread,

Jul 17, 2017, 2:17:27 PM7/17/17

to Caffe Users

Hi all,

I would like train caffe with multiple GPU in python, and then use solver directly. For example, if I only use one GPU, I could use the following code:

caffe.set_mode_gpu()

caffe.set_device(0)

caffe.set_solver_count(1)

caffe.set_solver_rank(0)

solver = caffe.SGDSover('cifar10_quick_solver.prototxt')

solver.step(solver.param.max_iter)

after this, I could use this solver to do prediction and computer accuracy:

solver.test_nets[0].forward()

validation_acc = solver.test_nets[0].blob['accuracy'].data

If I try to use multiple GPU and modified the code from: https://github.com/BVLC/caffe/blob/master/python/train.py

like the following:

def solve(proto, gpus, uid, rank):

caffe.set_mode_gpu()

caffe.set_device(gpus[rank])

caffe.set_solver_count(len(gpus))

caffe.set_solver_rank(rank)

caffe.set_multiprocess(True)

solver = caffe.SGDSolver(proto)

nccl = caffe.NCCL(solver, uid)

nccl.bcast()

solver.add_callback(nccl)

if solver.param.layer_wise_reduce:

solver.net.after_backward(nccl)

solver.step(solver.param.max_iter)

if __name__ == '__main__':

gpus = [0,1]

solver_file = "cifar10_quick_solver.prototxt"

# NCCL uses a uid to identify a session

uid = caffe.NCCL.new_uid()

caffe.init_log()

caffe.log('Using devices %s' % str(gpus))

procs = []

for rank in range(len(gpus)):

p = Process(target=solve,

args=(solver_file, gpus, uid, rank))

p.daemon = True

p.start()

procs.append(p)

for p in procs:

p.join()

here, if I would like to compute accuracy, how to grab that solver? one indirectly way to use that is to restore solverstate. Do we have any direct way to grab solver just like in one GPU situation? Any suggestions?