how train a model on multi-gpus?

146 views
Skip to first unread message

Qingze Wang

unread,
Nov 2, 2015, 11:43:50 PM11/2/15
to Caffe Users
I clone the newest code from githup, when run a model on multi-Gpus, it turns out that the loss is 87.33, and it does not decrese, like this:

I1103 12:28:06.969892 67367 parallel.cpp:394] GPUs pairs 1:2
I1103 12:28:07.489945 67367 data_layer.cpp:41] output data size: 64,3,32,32
I1103 12:28:07.760704 67367 parallel.cpp:422] Starting Optimization
I1103 12:28:07.760788 67367 solver.cpp:287] Solving binary.prototxt
I1103 12:28:07.760807 67367 solver.cpp:288] Learning Rate Policy: poly
I1103 12:28:07.987860 67367 solver.cpp:236] Iteration 0, loss = 1.27674
I1103 12:28:07.987900 67367 solver.cpp:252]     Train net output #0: loss = 1.27674 (* 1 = 1.27674 loss)
I1103 12:28:08.248898 73036 blocking_queue.cpp:50] Data layer prefetch queue empty
I1103 12:28:08.541195 67367 sgd_solver.cpp:106] Iteration 0, lr = 0.01
I1103 12:28:41.554708 67367 solver.cpp:236] Iteration 100, loss = 87.3366
I1103 12:28:41.554788 67367 solver.cpp:252]     Train net output #0: loss = 87.3365 (* 1 = 87.3365 loss)
I1103 12:28:41.691866 67367 sgd_solver.cpp:106] Iteration 100, lr = 0.009995
I1103 12:29:13.268553 67367 solver.cpp:236] Iteration 200, loss = 87.3366
I1103 12:29:13.268658 67367 solver.cpp:252]     Train net output #0: loss = 87.3365 (* 1 = 87.3365 loss)
I1103 12:29:13.413436 67367 sgd_solver.cpp:106] Iteration 200, lr = 0.00998999
I1103 12:29:48.620781 67367 solver.cpp:236] Iteration 300, loss = 87.3366
I1103 12:29:48.620867 67367 solver.cpp:252]     Train net output #0: loss = 87.3365 (* 1 = 87.3365 loss)
I1103 12:29:48.755810 67367 sgd_solver.cpp:106] Iteration 300, lr = 0.00998499
I1103 12:30:24.096528 67367 solver.cpp:236] Iteration 400, loss = 87.3366
I1103 12:30:24.096607 67367 solver.cpp:252]     Train net output #0: loss = 87.3365 (* 1 = 87.3365 loss)
I1103 12:30:24.238646 67367 sgd_solver.cpp:106] Iteration 400, lr = 0.00997998
I1103 12:30:56.876447 67367 solver.cpp:236] Iteration 500, loss = 87.3366
I1103 12:30:56.883353 67367 solver.cpp:252]     Train net output #0: loss = 87.3365 (* 1 = 87.3365 loss)
I1103 12:30:57.014220 67367 sgd_solver.cpp:106] Iteration 500, lr = 0.00997497
I1103 12:31:28.930490 67367 solver.cpp:236] Iteration 600, loss = 87.3366
I1103 12:31:28.930589 67367 solver.cpp:252]     Train net output #0: loss = 87.3365 (* 1 = 87.3365 loss)
I1103 12:31:29.077127 67367 sgd_solver.cpp:106] Iteration 600, lr = 0.00996995
I1103 12:31:59.410114 67367 solver.cpp:236] Iteration 700, loss = 87.3366
I1103 12:31:59.410212 67367 solver.cpp:252]     Train net output #0: loss = 87.3365 (* 1 = 87.3365 loss)
I1103 12:31:59.556262 67367 sgd_solver.cpp:106] Iteration 700, lr = 0.00996494
I1103 12:32:30.065392 67367 solver.cpp:236] Iteration 800, loss = 87.3366
I1103 12:32:30.065515 67367 solver.cpp:252]     Train net output #0: loss = 87.3365 (* 1 = 87.3365 loss)
I1103 12:32:30.065556 67367 sgd_solver.cpp:106] Iteration 800, lr = 0.00995992
I1103 12:33:02.699904 67367 solver.cpp:236] Iteration 900, loss = 87.3366
I1103 12:33:02.700036 67367 solver.cpp:252]     Train net output #0: loss = 87.3365 (* 1 = 87.3365 loss)
I1103 12:33:02.840323 67367 sgd_solver.cpp:106] Iteration 900, lr = 0.0099549


but I run it on only one gpu, it looks ok:

I1103 11:50:41.117281 130072 caffe.cpp:212] Starting Optimization
I1103 11:50:41.117300 130072 solver.cpp:287] Solving binary.prototxt
I1103 11:50:41.117306 130072 solver.cpp:288] Learning Rate Policy: poly
I1103 11:50:41.364779 130072 solver.cpp:236] Iteration 0, loss = 1.14704
I1103 11:50:41.364820 130072 solver.cpp:252]     Train net output #0: loss = 1.14704 (* 1 = 1.14704 loss)
I1103 11:50:41.364838 130072 sgd_solver.cpp:106] Iteration 0, lr = 0.01
I1103 11:51:12.484623 130072 solver.cpp:236] Iteration 100, loss = 0.0939418
I1103 11:51:12.484840 130072 solver.cpp:252]     Train net output #0: loss = 0.0880857 (* 1 = 0.0880857 loss)
I1103 11:51:12.484854 130072 sgd_solver.cpp:106] Iteration 100, lr = 0.009995
I1103 11:51:43.605880 130072 solver.cpp:236] Iteration 200, loss = 0.0432996
I1103 11:51:43.605942 130072 solver.cpp:252]     Train net output #0: loss = 0.0425499 (* 1 = 0.0425499 loss)
I1103 11:51:43.605952 130072 sgd_solver.cpp:106] Iteration 200, lr = 0.00998999
I1103 11:52:14.750869 130072 solver.cpp:236] Iteration 300, loss = 0.0317657
I1103 11:52:14.750993 130072 solver.cpp:252]     Train net output #0: loss = 0.0228157 (* 1 = 0.0228157 loss)
I1103 11:52:14.751004 130072 sgd_solver.cpp:106] Iteration 300, lr = 0.00998499
I1103 11:52:45.893976 130072 solver.cpp:236] Iteration 400, loss = 0.0267928
I1103 11:52:45.894088 130072 solver.cpp:252]     Train net output #0: loss = 0.0528598 (* 1 = 0.0528598 loss)
I1103 11:52:45.894098 130072 sgd_solver.cpp:106] Iteration 400, lr = 0.00997998
I1103 11:53:17.043527 130072 solver.cpp:236] Iteration 500, loss = 0.0292804
I1103 11:53:17.043645 130072 solver.cpp:252]     Train net output #0: loss = 0.0221649 (* 1 = 0.0221649 loss)
I1103 11:53:17.043658 130072 sgd_solver.cpp:106] Iteration 500, lr = 0.00997497
I1103 11:53:48.184842 130072 solver.cpp:236] Iteration 600, loss = 0.0205087
I1103 11:53:48.184963 130072 solver.cpp:252]     Train net output #0: loss = 0.0288153 (* 1 = 0.0288153 loss)
I1103 11:53:48.184975 130072 sgd_solver.cpp:106] Iteration 600, lr = 0.00996995
I1103 11:54:19.328132 130072 solver.cpp:236] Iteration 700, loss = 0.0211321
I1103 11:54:19.328251 130072 solver.cpp:252]     Train net output #0: loss = 0.0201308 (* 1 = 0.0201308 loss)
I1103 11:54:19.328263 130072 sgd_solver.cpp:106] Iteration 700, lr = 0.00996494
I1103 11:54:50.473924 130072 solver.cpp:236] Iteration 800, loss = 0.0231302
I1103 11:54:50.474036 130072 solver.cpp:252]     Train net output #0: loss = 0.0145986 (* 1 = 0.0145986 loss)
I1103 11:54:50.474047 130072 sgd_solver.cpp:106] Iteration 800, lr = 0.00995992
I1103 11:55:21.623100 130072 solver.cpp:236] Iteration 900, loss = 0.0216888
I1103 11:55:21.623239 130072 solver.cpp:252]     Train net output #0: loss = 0.00979382 (* 1 = 0.00979382 loss)
I1103 11:55:21.623251 130072 sgd_solver.cpp:106] Iteration 900, lr = 0.0099549
I1103 11:55:52.462342 130072 solver.cpp:340] Iteration 1000, Testing net (#0)
I1103 11:55:54.574981 130072 solver.cpp:408]     Test net output #0: accuracy = 0.9961
I1103 11:55:54.575027 130072 solver.cpp:408]     Test net output #1: loss = 0.0120374 (* 1 = 0.0120374 loss)
I1103 11:55:54.806218 130072 solver.cpp:236] Iteration 1000, loss = 0.0168279
I1103 11:55:54.806255 130072 solver.cpp:252]     Train net output #0: loss = 0.0137241 (* 1 = 0.0137241 loss)
I1103 11:55:54.806267 130072 sgd_solver.cpp:106] Iteration 1000, lr = 0.00994987
I1103 11:56:25.956745 130072 solver.cpp:236] Iteration 1100, loss = 0.0242743
I1103 11:56:25.956876 130072 solver.cpp:252]     Train net output #0: loss = 0.017992 (* 1 = 0.017992 loss)
I1103 11:56:25.956898 130072 sgd_solver.cpp:106] Iteration 1100, lr = 0.00994485
I1103 11:56:57.099609 130072 solver.cpp:236] Iteration 1200, loss = 0.0154625
I1103 11:56:57.099723 130072 solver.cpp:252]     Train net output #0: loss = 0.0159056 (* 1 = 0.0159056 loss)
I1103 11:56:57.099735 130072 sgd_solver.cpp:106] Iteration 1200, lr = 0.00993982
I1103 11:57:28.240137 130072 solver.cpp:236] Iteration 1300, loss = 0.0150156


why? how to use it?
Reply all
Reply to author
Forward
0 new messages