training process is killed without any error information

152 views
Skip to first unread message

Qian Yang

unread,
Oct 12, 2017, 5:10:29 AM10/12/17
to Caffe Users
Hi,

when I was training a customized vgg16 network in caffe, it get killed halfway without extra information. Here is part of training log.

I1012 16:02:20.142881 16357 solver.cpp:218] Iteration 600 (1.45809 iter/s, 68.5831s/100 iters), loss = 0.307288
I1012 16:02:20.142946 16357 solver.cpp:237]     Train net output #0: loss = 0.188233 (* 1 = 0.188233 loss)
I1012 16:02:20.142953 16357 sgd_solver.cpp:105] Iteration 600, lr = 0.001
I1012 16:03:28.003197 16357 solver.cpp:218] Iteration 700 (1.47367 iter/s, 67.8577s/100 iters), loss = 0.495435
I1012 16:03:28.003325 16357 solver.cpp:237]     Train net output #0: loss = 0.215491 (* 1 = 0.215491 loss)
I1012 16:03:28.003334 16357 sgd_solver.cpp:105] Iteration 700, lr = 0.001
I1012 16:04:35.836854 16357 solver.cpp:218] Iteration 800 (1.47425 iter/s, 67.831s/100 iters), loss = 0.47339
I1012 16:04:35.836949 16357 solver.cpp:237]     Train net output #0: loss = 0.00221339 (* 1 = 0.00221339 loss)
I1012 16:04:35.836957 16357 sgd_solver.cpp:105] Iteration 800, lr = 0.001
I1012 16:06:31.388617 16357 solver.cpp:218] Iteration 900 (0.865515 iter/s, 115.538s/100 iters), loss = 0.425094
I1012 16:06:31.449453 16357 solver.cpp:237]     Train net output #0: loss = 0.21385 (* 1 = 0.21385 loss)
I1012 16:06:31.472916 16357 sgd_solver.cpp:105] Iteration 900, lr = 0.001
./examples/weighted_bilinear/ft_last_layer3.sh: line 9: 16357 Killed

The file ft_last_layer3.sh is given below.

#!/bin/bash

# first fine tune the last layer only
GLOG_logtostderr
=0 GLOG_log_dir=/home/qy/documents/caffe/examples/weighted_bilinear/log/ \
./build/tools/caffe train \
   
-model "examples/weighted_bilinear/ft_last_layer3.prototxt" \
   
-solver "examples/weighted_bilinear/ft_last_layer3.solver" \
   
-weights "/home/qy/documents/CaffeModel/VGG_ILSVRC_16_layers.caffemodel" \
   
-gpu 0

Line 9 just contains one statement "-gpu 0".

I am also monitoring the gpu state as shown below. It seems that the memory is enough.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.66                 Driver Version: 375.66                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1080    Off  | 0000:01:00.0      On |                  N/A |
| 52%   79C    P2   127W / 180W |   5120MiB /  8112MiB |     96%      Default |
+-------------------------------+----------------------+----------------------+
                                                                              
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0      1127    G   /usr/lib/xorg/Xorg                             240MiB |
|    0      2049    G   compiz                                         152MiB |
|    0      2420    G   ...el-token=4D887FF09714CDAAFA04F7E91E9C165A    54MiB |
|    0     15970    G   /usr/lib/firefox/firefox                         2MiB |
|    0     16357    C   ./build/tools/caffe                           4664MiB |
+-----------------------------------------------------------------------------+


So I am rather confused about this issue. Does anyone give me some advice?

Thanks.

张欣彧

unread,
Nov 18, 2017, 9:07:28 PM11/18/17
to Caffe Users
Hi,

I also met this problem. Have you solved that?

Xinyu

在 2017年10月12日星期四 UTC+8下午5:10:29,Qian Yang写道:
Reply all
Reply to author
Forward
0 new messages