Resuming from solverstate fails with an error!

241 views
Skip to first unread message

Hossein Hasanpour

unread,
Feb 24, 2016, 11:57:07 AM2/24/16
to Caffe Users
Hello All, I'm working with cifar10 example in Caffe,
I changed some parameters and when I try to use the TrainFull script which looks like below, I get an error when it comes to resume the training using solverstates.

#!/usr/bin/env sh

TOOLS
=./build/tools

$TOOLS
/caffe train \
   
--solver=examples/cifar10/cifar10_full_relu_solver_bn.prototxt

# reduce learning rate by factor of 10
$TOOLS
/caffe train \
   
--solver=examples/cifar10/cifar10_full_solver_bn_lr1.prototxt \
   
--snapshot=examples/cifar10/cifar10_full_relu_bn_60000.solverstate.h5

# reduce learning rate by factor of 10
$TOOLS
/caffe train \
   
--solver=examples/cifar10/cifar10_full_solver_bn_lr2.prototxt \
   
--snapshot=examples/cifar10/cifar10_full_relu_bn_65000.solverstate.h5



And this is the cifar10_full_solver_bn_lr1.prototxt
# reduce learning rate after 120 epochs (60000 iters) by factor 0f 10
# then another factor of 10 after 10 more epochs (5000 iters)

# The train/test net protocol buffer definition
net
: "examples/cifar10/cifar10_full_relu_train_test_bn.prototxt"
# test_iter specifies how many forward passes the test should carry out.
# In the case of CIFAR10, we have test batch size 100 and 100 test iterations,
# covering the full 10,000 testing images.
test_iter
: 100
# Carry out testing every 1000 training iterations.
test_interval
: 1000
# The base learning rate, momentum and the weight decay of the network.
base_lr
: 0.0001
momentum
: 0.9
weight_decay
: 0.004
# The learning rate policy
lr_policy
: "fixed"
# Display every 200 iterations
display
: 200
# The maximum number of iterations
max_iter
: 65000
# snapshot intermediate results
snapshot
: 5000
snapshot_format
: HDF5
snapshot_prefix
: "examples/cifar10/cifar10_full_bn_relu"
# solver mode: CPU or GPU
solver_mode
: GPU



this is the second one (cifar10_full_solver_bn_lr2.prototxt)
# reduce learning rate after 120 epochs (60000 iters) by factor 0f 10
# then another factor of 10 after 10 more epochs (5000 iters)

# The train/test net protocol buffer definition
net
: "examples/cifar10/cifar10_full_relu_train_test_bn.prototxt"
# test_iter specifies how many forward passes the test should carry out.
# In the case of CIFAR10, we have test batch size 100 and 100 test iterations,
# covering the full 10,000 testing images.
test_iter
: 100
# Carry out testing every 1000 training iterations.
test_interval
: 1000
# The base learning rate, momentum and the weight decay of the network.
base_lr
: 0.00001
momentum
: 0.9
weight_decay
: 0.004
# The learning rate policy
lr_policy
: "fixed"
# Display every 200 iterations
display
: 200
# The maximum number of iterations
max_iter
: 70000
# snapshot intermediate results
snapshot
: 5000
snapshot_format
: HDF5
snapshot_prefix
: "examples/cifar10/cifar10_full_bn_relu"
# solver mode: CPU or GPU
solver_mode
: GPU



and this is the output error I get when it reaches the second and third commands in the script I posted above :
./build/tools/caffe: /home/hossein/anaconda2/lib/liblzma.so.5: no version information available (required by /usr/lib/x86_64-linux-gnu/libunwind.so.8)
I0224
19:56:31.536526  3611 caffe.cpp:185] Using GPUs 0
I0224
19:56:31.544860  3611 caffe.cpp:190] GPU 0: GeForce GTX 750
I0224
19:56:31.763121  3611 solver.cpp:48] Initializing solver from parameters:
test_iter
: 100
test_interval
: 1000
base_lr
: 0.0001
display
: 200
max_iter
: 65000
lr_policy
: "fixed"
momentum
: 0.9
weight_decay
: 0.004
snapshot
: 5000
snapshot_prefix
: "examples/cifar10/cifar10_full_bn_relu"
solver_mode
: GPU
device_id
: 0
net
: "examples/cifar10/cifar10_full_relu_train_test_bn.prototxt"
snapshot_format
: HDF5
I0224
19:56:31.763766  3611 solver.cpp:91] Creating training net from net file: examples/cifar10/cifar10_full_relu_train_test_bn.prototxt
I0224
19:56:31.764195  3611 net.cpp:322] The NetState phase (0) differed from the phase (1) specified by a rule in layer cifar
I0224
19:56:31.764253  3611 net.cpp:322] The NetState phase (0) differed from the phase (1) specified by a rule in layer accuracy
I0224
19:56:31.764387  3611 net.cpp:49] Initializing net from parameters:

...

I0224
19:56:31.768781  3611 layer_factory.hpp:77] Creating layer cifar
I0224
19:56:31.769311  3611 net.cpp:106] Creating Layer cifar
I0224
19:56:31.769327  3611 net.cpp:411] cifar -> data
I0224
19:56:31.769354  3611 net.cpp:411] cifar -> label
I0224
19:56:31.769367  3611 data_transformer.cpp:25] Loading mean file from: examples/cifar10/mean.binaryproto
I0224
19:56:31.783421  3614 db_lmdb.cpp:38] Opened lmdb examples/cifar10/cifar10_train_lmdb
I0224
19:56:31.784562  3611 data_layer.cpp:41] output data size: 100,3,32,32
I0224
19:56:31.787253  3611 net.cpp:150] Setting up cifar

...

I0224
19:56:31.965220  3611 net.cpp:283] Network initialization done.
I0224
19:56:31.968299  3611 solver.cpp:181] Creating test net (#0) specified by net file: examples/cifar10/cifar10_full_relu_train_test_bn.prototxt
I0224
19:56:31.968397  3611 net.cpp:322] The NetState phase (1) differed from the phase (0) specified by a rule in layer cifar
I0224
19:56:31.968688  3611 net.cpp:49] Initializing net from parameters:

...

I0224
19:56:31.970953  3611 layer_factory.hpp:77] Creating layer cifar
I0224
19:56:31.971557  3611 net.cpp:106] Creating Layer cifar
I0224
19:56:31.971572  3611 net.cpp:411] cifar -> data
I0224
19:56:31.971586  3611 net.cpp:411] cifar -> label
I0224
19:56:31.971596  3611 data_transformer.cpp:25] Loading mean file from: examples/cifar10/mean.binaryproto
I0224
19:56:31.972451  3616 db_lmdb.cpp:38] Opened lmdb examples/cifar10/cifar10_test_lmdb
I0224
19:56:31.972777  3611 data_layer.cpp:41] output data size: 1000,3,32,32
I0224
19:56:32.010258  3611 net.cpp:150] Setting up cifar

...

I0224
19:56:32.043560  3611 solver.cpp:60] Solver scaffolding done.
I0224
19:56:32.044262  3611 caffe.cpp:209] Resuming from examples/cifar10/cifar10_full_relu_bn_60000.solverstate.h5
HDF5
-DIAG: Error detected in HDF5 (1.8.15-patch1) thread 0:
 
#000: H5F.c line 604 in H5Fopen(): unable to open file
    major
: File accessibilty
    minor
: Unable to open file
 
#001: H5Fint.c line 990 in H5F_open(): unable to open file: time = Wed Feb 24 19:56:32 2016
, name = 'examples/cifar10/cifar10_full_relu_bn_60000.solverstate.h5', tent_flags = 0
    major
: File accessibilty
    minor
: Unable to open file
 
#002: H5FD.c line 993 in H5FD_open(): open failed
    major
: Virtual File Layer
    minor
: Unable to initialize object
 
#003: H5FDsec2.c line 343 in H5FD_sec2_open(): unable to open file: name = 'examples/cifar10/cifar10_full_relu_bn_60000.solverstate.h5', errno = 2, error message = 'No such file or directory', flags = 0, o_flags = 0
    major
: File accessibilty
    minor
: Unable to open file
F0224
19:56:32.047998  3611 sgd_solver.cpp:327] Check failed: file_hid >= 0 (-1 vs. 0) Couldn't open solver state file examples/cifar10/cifar10_full_relu_bn_60000.solverstate.h5
*** Check failure stack trace: ***
    @     0x7faaa9e7edaa  (unknown)
    @     0x7faaa9e7ece4  (unknown)
    @     0x7faaa9e7e6e6  (unknown)
    @     0x7faaa9e81687  (unknown)
    @     0x7faaaa5b0f9b  caffe::SGDSolver<>::RestoreSolverStateFromHDF5()
    @     0x7faaaa594f19  caffe::Solver<>::Restore()
    @           0x408038  train()
    @           0x405a0c  main
    @     0x7faaa918cec5  (unknown)
    @           0x406141  (unknown)
    @              (nil)  (unknown)
Aborted (core dumped)
./build/tools/caffe: /home/hossein/anaconda2/lib/liblzma.so.5: no version information available (required by /usr/lib/x86_64-linux-gnu/libunwind.so.8)
I0224 19:56:32.368366  3619 caffe.cpp:185] Using GPUs 0
I0224 19:56:32.373322  3619 caffe.cpp:190] GPU 0: GeForce GTX 750
I0224 19:56:32.578726  3619 solver.cpp:48] Initializing solver from parameters:
test_iter: 100
test_interval: 1000
base_lr: 1e-05
display: 200
max_iter: 70000
lr_policy: "fixed"
momentum: 0.9
weight_decay: 0.004
snapshot: 5000
snapshot_prefix: "examples/cifar10/cifar10_full_bn_relu"
solver_mode: GPU
device_id: 0
net: "examples/cifar10/cifar10_full_relu_train_test_bn.prototxt"
snapshot_format: HDF5
I0224 19:56:32.579169  3619 solver.cpp:91] Creating training net from net file: examples/cifar10/cifar10_full_relu_train_test_bn.prototxt
I0224 19:56:32.581733  3619 net.cpp:322] The NetState phase (0) differed from the phase (1) specified by a rule in layer cifar
I0224 19:56:32.581759  3619 net.cpp:322] The NetState phase (0) differed from the phase (1) specified by a rule in layer accuracy
I0224 19:56:32.581876  3619 net.cpp:49] Initializing net from parameters:

...

I0224 19:56:32.582489  3619 layer_factory.hpp:77] Creating layer cifar
I0224 19:56:32.583034  3619 net.cpp:106] Creating Layer cifar
I0224 19:56:32.583048  3619 net.cpp:411] cifar -> data
I0224 19:56:32.583076  3619 net.cpp:411] cifar -> label
I0224 19:56:32.583092  3619 data_transformer.cpp:25] Loading mean file from: examples/cifar10/mean.binaryproto
I0224 19:56:32.598793  3622 db_lmdb.cpp:38] Opened lmdb examples/cifar10/cifar10_train_lmdb

...

I0224 19:56:33.062932  3619 net.cpp:283] Network initialization done.
I0224 19:56:33.063343  3619 solver.cpp:181] Creating test net (#0) specified by net file: examples/cifar10/cifar10_full_relu_train_test_bn.prototxt
I0224 19:56:33.063377  3619 net.cpp:322] The NetState phase (1) differed from the phase (0) specified by a rule in layer cifar
I0224 19:56:33.063491  3619 net.cpp:49] Initializing net from parameters:

...

I0224 19:56:33.064108  3619 layer_factory.hpp:77] Creating layer cifar
I0224 19:56:33.064637  3619 net.cpp:106] Creating Layer cifar
I0224 19:56:33.064649  3619 net.cpp:411] cifar -> data
I0224 19:56:33.064661  3619 net.cpp:411] cifar -> label
I0224 19:56:33.064671  3619 data_transformer.cpp:25] Loading mean file from: examples/cifar10/mean.binaryproto
I0224 19:56:33.065486  3625 db_lmdb.cpp:38] Opened lmdb examples/cifar10/cifar10_test_lmdb
I0224 19:56:33.070000  3619 data_layer.cpp:41] output data size: 1000,3,32,32
I0224 19:56:33.129371  3619 net.cpp:150] Setting up cifar

...

I0224 19:56:33.174549  3619 net.cpp:226] bn1 needs backward computation.
I0224 19:56:33.174556  3619 net.cpp:226] pool1 needs backward computation.
I0224 19:56:33.174561  3619 net.cpp:226] conv1 needs backward computation.
I0224 19:56:33.174585  3619 net.cpp:228] label_cifar_1_split does not need backward computation.
I0224 19:56:33.174593  3619 net.cpp:228] cifar does not need backward computation.
I0224 19:56:33.174617  3619 net.cpp:270] This network produces output accuracy
I0224 19:56:33.174624  3619 net.cpp:270] This network produces output loss
I0224 19:56:33.174643  3619 net.cpp:283] Network initialization done.
I0224 19:56:33.174739  3619 solver.cpp:60] Solver scaffolding done.
I0224 19:56:33.175525  3619 caffe.cpp:209] Resuming from examples/cifar10/cifar10_full_relu_bn_65000.solverstate.h5
HDF5-DIAG: Error detected in HDF5 (1.8.15-patch1) thread 0:
  #000: H5F.c line 604 in H5Fopen(): unable to open file
    major: File accessibilty
    minor: Unable to open file
  #001: H5Fint.c line 990 in H5F_open(): unable to open file: time = Wed Feb 24 19:56:33 2016
, name = '
examples/cifar10/cifar10_full_relu_bn_65000.solverstate.h5', tent_flags = 0
    major: File accessibilty
    minor: Unable to open file
  #002: H5FD.c line 993 in H5FD_open(): open failed
    major: Virtual File Layer
    minor: Unable to initialize object
  #003: H5FDsec2.c line 343 in H5FD_sec2_open(): unable to open file: name = '
examples/cifar10/cifar10_full_relu_bn_65000.solverstate.h5', errno = 2, error message = 'No such file or directory', flags = 0, o_flags = 0
    major: File accessibilty
    minor: Unable to open file
F0224 19:56:33.185070  3619 sgd_solver.cpp:327] Check failed: file_hid >= 0 (-1 vs. 0) Couldn'
t open solver state file examples/cifar10/cifar10_full_relu_bn_65000.solverstate.h5
*** Check failure stack trace: ***
   
@     0x7f63c0056daa  (unknown)
   
@     0x7f63c0056ce4  (unknown)
   
@     0x7f63c00566e6  (unknown)
   
@     0x7f63c0059687  (unknown)
   
@     0x7f63c0788f9b  caffe::SGDSolver<>::RestoreSolverStateFromHDF5()
   
@     0x7f63c076cf19  caffe::Solver<>::Restore()
   
@           0x408038  train()
   
@           0x405a0c  main
   
@     0x7f63bf364ec5  (unknown)
   
@           0x406141  (unknown)
   
@              (nil)  (unknown)
Aborted (core dumped)


What is causing this ?

Jan C Peters

unread,
Feb 25, 2016, 2:42:33 AM2/25/16
to Caffe Users
The error basically says that it cannot open the solverstate file. Is it really there? Is the filename completely correct? Could the relative paths be a problem?

Jan
Reply all
Reply to author
Forward
0 new messages