Stuck before iteration 0 when training with multiple GPUs

56 views
Skip to first unread message

GyeongHwan Hong

unread,
Feb 1, 2018, 1:08:04 AM2/1/18
to Caffe Users
Hello, I am a Caffe beginner.

I have three GPUs (GeForce GTX1080 Ti) and I want to train my CNN models with those GPUs.

I used a command as following.

./build/tools/caffe train --solver=./models/bvlc_googlenet/solver.prototxt --gpu=all

I tried to train the models with multiple GPU options, but it is stuck before iteration 0 starts with following message.

I0201 14:59:17.370656  8896 net.cpp:255] Network initialization done.
I0201 14:59:17.371352  8896 solver.cpp:56] Solver scaffolding done.
I0201 14:59:17.378583  8896 caffe.cpp:248] Starting Optimization
I0201 14:59:19.693693  8923 solver.cpp:172] Creating test net (#0) specified by net file: models/bvlc_googlenet/train_val.prototxt
I0201 14:59:19.790326  8922 solver.cpp:172] Creating test net (#0) specified by net file: models/bvlc_googlenet/train_val.prototxt
I0201 14:59:21.654232  8896 solver.cpp:272] Solving GoogleNet
I0201 14:59:21.654263  8896 solver.cpp:273] Learning Rate Policy: step

 There is no following message after "Learning Rate Policy: step".

My solver file(./models_googlenet/solver.prototxt) is as following.
net: "models/bvlc_googlenet/train_val.prototxt"
test_iter: 1000
test_interval: 2000
test_initialization: false
display: 2000
average_loss: 40
base_lr: 0.01
lr_policy: "step"
stepsize: 100000
gamma: 0.96
max_iter: 10000000
momentum: 0.9
weight_decay: 0.0002
snapshot: 40000
snapshot_prefix: "models/bvlc_googlenet/eslab_googlenet"
solver_mode: GPU

My training batch size is 128.

How can I solve this problem?

Thank you.

Gyeonghwan Hong.

Przemek D

unread,
Feb 2, 2018, 6:52:14 AM2/2/18
to Caffe Users
Does it run on a single device normally? What about 2 cards?

GyeongHwan Hong

unread,
Feb 2, 2018, 7:51:44 AM2/2/18
to Caffe Users
Hello,

I want to use multiple GPUs to speed-up my training procedure.


However, this way does not work.

--
You received this message because you are subscribed to a topic in the Google Groups "Caffe Users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/caffe-users/yRteT3Hh8RQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to caffe-users+unsubscribe@googlegroups.com.
To post to this group, send email to caffe...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/caffe-users/aa334ad7-386f-4f5c-b551-70b165004d06%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
Gyeonghwan Hong (RedCarrottt)
Embedded Software Lab.
Sungkyunkwan University

Przemek D

unread,
Feb 2, 2018, 8:22:24 AM2/2/18
to Caffe Users
I know, I'm trying to extract some more information about your problem so we can narrow down the root cause.
Your network, does it run on a single GPU or only on two devices? Can you run any of the examples in multi-gpu mode?
To unsubscribe from this group and all its topics, send an email to caffe-users...@googlegroups.com.

To post to this group, send email to caffe...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages