Stuck before iteration 0 when training with multiple GPUs

GyeongHwan Hong

unread,

Feb 1, 2018, 1:08:04 AM2/1/18

to Caffe Users

Hello, I am a Caffe beginner.

I have three GPUs (GeForce GTX1080 Ti) and I want to train my CNN models with those GPUs.

I used a command as following.

./build/tools/caffe train --solver=./models/bvlc_googlenet/solver.prototxt --gpu=all

I tried to train the models with multiple GPU options, but it is stuck before iteration 0 starts with following message.

I0201 14:59:17.370656 8896 net.cpp:255] Network initialization done.
I0201 14:59:17.371352 8896 solver.cpp:56] Solver scaffolding done.
I0201 14:59:17.378583 8896 caffe.cpp:248] Starting Optimization
I0201 14:59:19.693693 8923 solver.cpp:172] Creating test net (#0) specified by net file: models/bvlc_googlenet/train_val.prototxt
I0201 14:59:19.790326 8922 solver.cpp:172] Creating test net (#0) specified by net file: models/bvlc_googlenet/train_val.prototxt
I0201 14:59:21.654232 8896 solver.cpp:272] Solving GoogleNet
I0201 14:59:21.654263 8896 solver.cpp:273] Learning Rate Policy: step

There is no following message after "Learning Rate Policy: step".

My solver file(./models_googlenet/solver.prototxt) is as following.

net: "models/bvlc_googlenet/train_val.prototxt"
test_iter: 1000
test_interval: 2000
test_initialization: false
display: 2000
average_loss: 40
base_lr: 0.01
lr_policy: "step"
stepsize: 100000
gamma: 0.96
max_iter: 10000000
momentum: 0.9
weight_decay: 0.0002
snapshot: 40000
snapshot_prefix: "models/bvlc_googlenet/eslab_googlenet"
solver_mode: GPU

My training batch size is 128.

How can I solve this problem?

Thank you.

Gyeonghwan Hong.

Przemek D

unread,

Feb 2, 2018, 6:52:14 AM2/2/18

to Caffe Users

Does it run on a single device normally? What about 2 cards?

GyeongHwan Hong

unread,

Feb 2, 2018, 7:51:44 AM2/2/18

to Caffe Users

Hello,

I want to use multiple GPUs to speed-up my training procedure.

Actually, I refered to Caffe's multigpu.md guide (https://github.com/BVLC/caffe/blob/master/docs/multigpu.md).

However, this way does not work.

--
You received this message because you are subscribed to a topic in the Google Groups "Caffe Users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/caffe-users/yRteT3Hh8RQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to caffe-users+unsubscribe@googlegroups.com.
To post to this group, send email to caffe...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/caffe-users/aa334ad7-386f-4f5c-b551-70b165004d06%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--

Gyeonghwan Hong (RedCarrottt)

Embedded Software Lab.

Sungkyunkwan University

RedCa...@gmail.com

Przemek D

unread,

Feb 2, 2018, 8:22:24 AM2/2/18

to Caffe Users

I know, I'm trying to extract some more information about your problem so we can narrow down the root cause.

Your network, does it run on a single GPU or only on two devices? Can you run any of the examples in multi-gpu mode?

To unsubscribe from this group and all its topics, send an email to caffe-users...@googlegroups.com.

To post to this group, send email to caffe...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/caffe-users/aa334ad7-386f-4f5c-b551-70b165004d06%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Reply all

Reply to author

Forward