How to save a snapshot randomly(I mean all of a sudden create a snapshot)(?

42 views
Skip to first unread message

Hossein Hasanpour

unread,
Apr 27, 2016, 1:56:08 PM4/27/16
to Caffe Users
Is this possible in caffe? 
I have set to create snapshot every 50K iterations, but sometimes I need to save a very good result that I have at the moment. 
How can I save that except using Ctrl+C which terminates the training. (I want the training to continue )

Jan

unread,
Apr 28, 2016, 3:43:49 AM4/28/16
to Caffe Users
I don't think 'randomly' describes your problem very well ;-).

But no, I am not aware of something like that. You'll probably have to adjust the solver (or create a new one) and maybe add some parameter to the proto to do that.

Jan

Eli Gibson

unread,
Apr 28, 2016, 6:27:02 AM4/28/16
to Caffe Users
If you are running from the command-line, I think your best bet is to Ctrl+C and then restart from that point using the --snapshot command-line parameter and the snapshot you just saved.

Hossein Hasanpour

unread,
Apr 28, 2016, 7:41:49 AM4/28/16
to Caffe Users
@Jan: I recently read bout SIGUP being added to the caffe that can be used for the exact same purpose. It seems its not merged yet. hope they add it soon :) 

@Eli Gibson: Well I'm actually now stuck here! I dont know if I'm doing this wrongly or this is normal!?  
I tried to resume training by using the following command : 
REM go to the caffe root
cd
../../
set BIN=build/x64/Release
"%BIN%/caffe.exe" train --solver=examples/cifar10/cifar10_full_solver.prototxt
 
--snapshot=examples/cifar10/test/cifar10_full_iter_50000.solverstate> caffe.log & type c:\directory.txt

and it seems to be starting from the beginning! I say this since, when I hit 50K, my accuracy was around lets say 85%, and now when it starts its output is like this (0.1!) : 

I0428 15:45:05.757570 15392 net.cpp:217] conv1 needs backward computation.
I0428 15:45:05.757570 15392 net.cpp:219] label_cifar_1_split does not need backward computation.
I0428 15:45:05.757570 15392 net.cpp:219] cifar does not need backward computation.
I0428 15:45:05.757570 15392 net.cpp:261] This network produces output accuracy
I0428 15:45:05.757570 15392 net.cpp:261] This network produces output loss
I0428 15:45:05.757570 15392 net.cpp:274] Network initialization done.
I0428 15:45:05.757570 15392 solver.cpp:60] Solver scaffolding done.
I0428 15:45:05.760571 15392 caffe.cpp:220] Starting Optimization
I0428 15:45:05.760571 15392 solver.cpp:279] Solving CIFAR10_full
I0428 15:45:05.760571 15392 solver.cpp:280] Learning Rate Policy: multistep
I0428 15:45:05.769582 15392 solver.cpp:337] Iteration 0, Testing net (#0)
I0428 15:45:17.689569 15392 solver.cpp:404]     Test net output #0: accuracy = 0.1
I0428 15:45:17.689569 15392 solver.cpp:404]     Test net output #1: loss = 78.6029 (* 1 = 78.6029 loss)
I0428 15:45:18.045231 15392 solver.cpp:228] Iteration 0, loss = 2.69506
I0428 15:45:18.045231 15392 solver.cpp:244]     Train net output #0: loss = 2.69506 (* 1 = 2.69506 loss)
I0428 15:45:18.045231 15392 sgd_solver.cpp:106] Iteration 0, lr = 0.01
I0428 15:45:42.583858 15392 solver.cpp:228] Iteration 100, loss = 2.14817
I0428 15:45:42.584359 15392 solver.cpp:244]     Train net output #0: loss = 2.14817 (* 1 = 2.14817 loss)
I0428 15:45:42.584359 15392 sgd_solver.cpp:106] Iteration 100, lr = 0.01
I0428 15:46:07.756189 15392 solver.cpp:228] Iteration 200, loss = 1.9271
I0428 15:46:07.756189 15392 solver.cpp:244]     Train net output #0: loss = 1.9271 (* 1 = 1.9271 loss)
I0428 15:46:07.756189 15392 sgd_solver.cpp:106] Iteration 200, lr = 0.01
I0428 15:46:33.268297 15392 solver.cpp:228] Iteration 300, loss = 1.89921
I0428 15:46:33.268297 15392 solver.cpp:244]     Train net output #0: loss = 1.89921 (* 1 = 1.89921 loss)
I0428 15:46:33.268297 15392 sgd_solver.cpp:106] Iteration 300, lr = 0.01
I0428 15:46:59.068845 15392 solver.cpp:228] Iteration 400, loss = 2.04308
I0428 15:46:59.068845 15392 solver.cpp:244]     Train net output #0: loss = 2.04308 (* 1 = 2.04308 loss)
I0428 15:46:59.068845 15392 sgd_solver.cpp:106] Iteration 400, lr = 0.01
I0428 15:47:25.201395 15392 solver.cpp:228] Iteration 500, loss = 1.71861
I0428 15:47:25.201395 15392 solver.cpp:244]     Train net output #0: loss = 1.71861 (* 1 = 1.71861 loss)
I0428 15:47:25.201894 15392 sgd_solver.cpp:106] Iteration 500, lr = 0.01
I0428 15:47:50.755519 15392 solver.cpp:228] Iteration 600, loss = 1.5223
I0428 15:47:50.755519 15392 solver.cpp:244]     Train net output #0: loss = 1.5223 (* 1 = 1.5223 loss)
I0428 15:47:50.755519 15392 sgd_solver.cpp:106] Iteration 600, lr = 0.01
I0428 15:48:16.238324 15392 solver.cpp:228] Iteration 700, loss = 1.62518
I0428 15:48:16.238324 15392 solver.cpp:244]     Train net output #0: loss = 1.62518 (* 1 = 1.62518 loss)
I0428 15:48:16.238324 15392 sgd_solver.cpp:106] Iteration 700, lr = 0.01
I0428 15:48:42.066558 15392 solver.cpp:228] Iteration 800, loss = 1.42534
I0428 15:48:42.066558 15392 solver.cpp:244]     Train net output #0: loss = 1.42534 (* 1 = 1.42534 loss)
I0428 15:48:42.066558 15392 sgd_solver.cpp:106] Iteration 800, lr = 0.01
I0428 15:49:07.671883 15392 solver.cpp:228] Iteration 900, loss = 1.2936
I0428 15:49:07.671883 15392 solver.cpp:244]     Train net output #0: loss = 1.2936 (* 1 = 1.2936 loss)
I0428 15:49:07.671883 15392 sgd_solver.cpp:106] Iteration 900, lr = 0.01
I0428 15:49:32.767840 15392 solver.cpp:337] Iteration 1000, Testing net (#0)
I0428 15:49:44.879137 15392 solver.cpp:404]     Test net output #0: accuracy = 0.1286
I0428 15:49:44.879137 15392 solver.cpp:404]     Test net output #1: loss = 2.88058 (* 1 = 2.88058 loss)
I0428 15:49:44.947199 15392 solver.cpp:228] Iteration 1000, loss = 1.64134
I0428 15:49:44.947199 15392 solver.cpp:244]     Train net output #0: loss = 1.64134 (* 1 = 1.64134 loss)
I0428 15:49:44.947199 15392 sgd_solver.cpp:106] Iteration 1000, lr = 0.01
I0428 15:50:10.078454 15392 solver.cpp:228] Iteration 1100, loss = 1.54176
I0428 15:50:10.078454 15392 solver.cpp:244]     Train net output #0: loss = 1.54176 (* 1 = 1.54176 loss)
I0428 15:50:10.078454 15392 sgd_solver.cpp:106] Iteration 1100, lr = 0.01
I0428 15:50:35.211081 15392 solver.cpp:228] Iteration 1200, loss = 1.46021
I0428 15:50:35.211081 15392 solver.cpp:244]     Train net output #0: loss = 1.46021 (* 1 = 1.46021 loss)
I0428 15:50:35.211580 15392 sgd_solver.cpp:106] Iteration 1200, lr = 0.01
I0428 15:51:00.423518 15392 solver.cpp:228] Iteration 1300, loss = 1.42154
I0428 15:51:00.423518 15392 solver.cpp:244]     Train net output #0: loss = 1.42154 (* 1 = 1.42154 loss)
I0428 15:51:00.424020 15392 sgd_solver.cpp:106] Iteration 1300, lr = 0.01
I0428 15:51:25.844864 15392 solver.cpp:228] Iteration 1400, loss = 1.61946
I0428 15:51:25.844864 15392 solver.cpp:244]     Train net output #0: loss = 1.61946 (* 1 = 1.61946 loss)
I0428 15:51:25.844864 15392 sgd_solver.cpp:106] Iteration 1400, lr = 0.01
I0428 15:51:51.479568 15392 solver.cpp:228] Iteration 1500, loss = 1.33445
I0428 15:51:51.479568 15392 solver.cpp:244]     Train net output #0: loss = 1.33445 (* 1 = 1.33445 loss)
I0428 15:51:51.479568 15392 sgd_solver.cpp:106] Iteration 1500, lr = 0.01
I0428 15:52:16.933943 15392 solver.cpp:228] Iteration 1600, loss = 1.24857
I0428 15:52:16.933943 15392 solver.cpp:244]     Train net output #0: loss = 1.24857 (* 1 = 1.24857 loss)
I0428 15:52:16.933943 15392 sgd_solver.cpp:106] Iteration 1600, lr = 0.01
I0428 15:52:42.341366 15392 solver.cpp:228] Iteration 1700, loss = 1.26117
I0428 15:52:42.341366 15392 solver.cpp:244]     Train net output #0: loss = 1.26117 (* 1 = 1.26117 loss)
I0428 15:52:42.341366 15392 sgd_solver.cpp:106] Iteration 1700, lr = 0.01
I0428 15:53:07.583760 15392 solver.cpp:228] Iteration 1800, loss = 1.08346
I0428 15:53:07.583760 15392 solver.cpp:244]     Train net output #0: loss = 1.08346 (* 1 = 1.08346 loss)
I0428 15:53:07.583760 15392 sgd_solver.cpp:106] Iteration 1800, lr = 0.01
I0428 15:53:32.460234 15392 solver.cpp:228] Iteration 1900, loss = 1.052
I0428 15:53:32.460736 15392 solver.cpp:244]     Train net output #0: loss = 1.052 (* 1 = 1.052 loss)
I0428 15:53:32.460736 15392 sgd_solver.cpp:106] Iteration 1900, lr = 0.01
I0428 15:53:57.960438 15392 solver.cpp:337] Iteration 2000, Testing net (#0)
I0428 15:54:10.105379 15392 solver.cpp:404]     Test net output #0: accuracy = 0.1444
I0428 15:54:10.105379 15392 solver.cpp:404]     Test net output #1: loss = 3.42501 (* 1 = 3.42501 loss)
I0428 15:54:10.173928 15392 solver.cpp:228] Iteration 2000, loss = 1.31026
I0428 15:54:10.174428 15392 solver.cpp:244]     Train net output #0: loss = 1.31026 (* 1 = 1.31026 loss)
I0428 15:54:10.174428 15392 sgd_solver.cpp:106] Iteration 2000, lr = 0.01
I0428 15:54:35.595716 15392 solver.cpp:228] Iteration 2100, loss = 1.18611
I0428 15:54:35.595716 15392 solver.cpp:244]     Train net output #0: loss = 1.18611 (* 1 = 1.18611 loss)
I0428 15:54:35.595716 15392 sgd_solver.cpp:106] Iteration 2100, lr = 0.01
I0428 15:55:00.997709 15392 solver.cpp:228] Iteration 2200, loss = 1.23393
I0428 15:55:00.997709 15392 solver.cpp:244]     Train net output #0: loss = 1.23393 (* 1 = 1.23393 loss)
I0428 15:55:00.997709 15392 sgd_solver.cpp:106] Iteration 2200, lr = 0.01
I0428 15:55:26.298413 15392 solver.cpp:228] Iteration 2300, loss = 1.08047
I0428 15:55:26.299414 15392 solver.cpp:244]     Train net output #0: loss = 1.08047 (* 1 = 1.08047 loss)
I0428 15:55:26.299414 15392 sgd_solver.cpp:106] Iteration 2300, lr = 0.01
I0428 15:55:51.647357 15392 solver.cpp:228] Iteration 2400, loss = 1.31164
I0428 15:55:51.647357 15392 solver.cpp:244]     Train net output #0: loss = 1.31164 (* 1 = 1.31164 loss)
I0428 15:55:51.647357 15392 sgd_solver.cpp:106] Iteration 2400, lr = 0.01
I0428 15:56:17.579519 15392 solver.cpp:228] Iteration 2500, loss = 1.11762
I0428 15:56:17.580019 15392 solver.cpp:244]     Train net output #0: loss = 1.11762 (* 1 = 1.11762 loss)
I0428 15:56:17.580019 15392 sgd_solver.cpp:106] Iteration 2500, lr = 0.01
I0428 15:56:43.195716 15392 solver.cpp:228] Iteration 2600, loss = 1.0478
I0428 15:56:43.195716 15392 solver.cpp:244]     Train net output #0: loss = 1.0478 (* 1 = 1.0478 loss)
I0428 15:56:43.195716 15392 sgd_solver.cpp:106] Iteration 2600, lr = 0.01
I0428 15:57:08.695070 15392 solver.cpp:228] Iteration 2700, loss = 1.0907
I0428 15:57:08.695070 15392 solver.cpp:244]     Train net output #0: loss = 1.0907 (* 1 = 1.0907 loss)
I0428 15:57:08.695070 15392 sgd_solver.cpp:106] Iteration 2700, lr = 0.01
I0428 15:57:34.256702 15392 solver.cpp:228] Iteration 2800, loss = 0.992899
I0428 15:57:34.256702 15392 solver.cpp:244]     Train net output #0: loss = 0.992899 (* 1 = 0.992899 loss)
I0428 15:57:34.256702 15392 sgd_solver.cpp:106] Iteration 2800, lr = 0.01
I0428 15:57:59.773699 15392 solver.cpp:228] Iteration 2900, loss = 0.80717
I0428 15:57:59.773699 15392 solver.cpp:244]     Train net output #0: loss = 0.80717 (* 1 = 0.80717 loss)
I0428 15:57:59.773699 15392 sgd_solver.cpp:106] Iteration 2900, lr = 0.01
I0428 15:58:25.240093 15392 solver.cpp:337] Iteration 3000, Testing net (#0)
I0428 15:58:37.420219 15392 solver.cpp:404]     Test net output #0: accuracy = 0.1613
I0428 15:58:37.420219 15392 solver.cpp:404]     Test net output #1: loss = 3.49334 (* 1 = 3.49334 loss)
I0428 15:58:37.488767 15392 solver.cpp:228] Iteration 3000, loss = 1.13374
I0428 15:58:37.488767 15392 solver.cpp:244]     Train net output #0: loss = 1.13374 (* 1 = 1.13374 loss)
I0428 15:58:37.488767 15392 sgd_solver.cpp:106] Iteration 3000, lr = 0.01

well, what should I do now ? 

Hossein Hasanpour

unread,
Apr 28, 2016, 7:56:11 AM4/28/16
to Caffe Users
Thank God! I found what the cause was. 
I need to define --snapshot just after the previous argument, not beneath it (my new line after the first argument messed the whole thing up!)

Jan

unread,
Apr 28, 2016, 8:25:52 AM4/28/16
to Caffe Users
Ah yes, now that mention it, I have heard about the SIGHUP earlier, but at the time I only associated it with Strg+C, that is why I didn't think of it here. But of course you can also send only SIGHUP (without SIGINT), you should look into that. Might solve your problem quite well. Thanks for mentioning it ;-).

Jan
Reply all
Reply to author
Forward
0 new messages