resuming from a trained snapshot fails! (Failed to load int dataset with name iter)

262 views
Skip to first unread message

Hossein Hasanpour

unread,
Mar 28, 2016, 5:20:18 AM3/28/16
to Caffe Users
Hello all, I trained my network with some configuration , and then saved a snapshot of it.
Now I am trying to resume from the last snapshot and it fails with this error message :
I0328 13:44:30.756110 24238 net.cpp:283] Network initialization done.
I0328
13:44:30.756206 24238 solver.cpp:60] Solver scaffolding done.
I0328
13:44:30.757062 24238 caffe.cpp:209] Resuming from /media/hossein/tmpstore/caffe_new/examples/cifar10/cifar10_full_relu_bn_iter_60000.caffemodel.h5
HDF5
-DIAG: Error detected in HDF5 (1.8.15-patch1) thread 0:
 
#000: H5D.c line 358 in H5Dopen2(): not found
    major
: Dataset
    minor
: Object not found
 
#001: H5Gloc.c line 430 in H5G_loc_find(): can't find object
    major
: Symbol table
    minor
: Object not found
 
#002: H5Gtraverse.c line 861 in H5G_traverse(): internal path traversal failed
    major
: Symbol table
    minor
: Object not found
 
#003: H5Gtraverse.c line 641 in H5G_traverse_real(): traversal operator failed
    major
: Symbol table
    minor
: Callback failed
 
#004: H5Gloc.c line 385 in H5G_loc_find_cb(): object 'iter' doesn't exist
    major
: Symbol table
    minor
: Object not found
F0328
13:44:30.786376 24238 hdf5.cpp:153] Check failed: status >= 0 (-1 vs. 0) Failed to load int dataset with name iter
*** Check failure stack trace: ***
   
@     0x7f2d6e635daa  (unknown)
   
@     0x7f2d6e635ce4  (unknown)
   
@     0x7f2d6e6356e6  (unknown)
   
@     0x7f2d6e638687  (unknown)
   
@     0x7f2d6ed74acd  caffe::hdf5_load_int()
   
@     0x7f2d6ed678d0  caffe::SGDSolver<>::RestoreSolverStateFromHDF5()
   
@     0x7f2d6ed4bf19  caffe::Solver<>::Restore()
   
@           0x408038  train()
   
@           0x405a0c  main
   
@     0x7f2d6d943ec5  (unknown)
   
@           0x406141  (unknown)
   
@              (nil)  (unknown)
Aborted (core dumped)


This is how I'm trying to resume it :
#!/usr/bin/env sh

TOOLS
=./build/tools

$TOOLS
/caffe train \
   
--solver=examples/cifar10/cifar10_full_solver_bn_lr2.prototxt \
   
--snapshot=/media/hossein/tmpstore/caffe_new/examples/cifar10/cifar10_full_relu_bn_iter_60000.caffemodel.h5


Hossein Hasanpour

unread,
Mar 28, 2016, 8:08:22 AM3/28/16
to Caffe Users
I tried to use "BINARYPROTO" instead of hdf5, but I get this error :
I0328 16:35:34.721277 27243 net.cpp:283] Network initialization done.
I0328
16:35:34.721369 27243 solver.cpp:60] Solver scaffolding done.
I0328
16:35:34.722338 27243 caffe.cpp:209] Resuming from /media/hossein/tmpstore/caffe_new/examples/cifar10_full_relu_bn_iter_60000.caffemodel
F0328
16:35:39.143900 27243 sgd_solver.cpp:316] Check failed: state.history_size() == history_.size() (0 vs. 28) Incorrect length of history blobs.

*** Check failure stack trace: ***

   
@     0x7fd1c2cbbdaa  (unknown)
   
@     0x7fd1c2cbbce4  (unknown)
   
@     0x7fd1c2cbb6e6  (unknown)
   
@     0x7fd1c2cbe687  (unknown)
   
@     0x7fd1c33ef097  caffe::SGDSolver<>::RestoreSolverStateFromBinaryProto()
   
@     0x7fd1c33d1ed3  caffe::Solver<>::Restore()

   
@           0x408038  train()
   
@           0x405a0c
 main
   
@     0x7fd1c1fc9ec5  (unknown)

   
@           0x406141  (unknown)
   
@              (nil)  (unknown)
Aborted (core dumped)


What should I do? this is getting on my nerve now! ;(

Jan

unread,
Mar 30, 2016, 6:24:26 AM3/30/16
to Caffe Users
You need to use the saved solverstate with the -snapshot cmdline param, not the caffemodel. The caffemodel can be given to caffe with the -weights param when doing finetuning. Note that -weights and -snapshot are mutually exclusive. Whether you use the binary proto or hdf5 shoudn't matter, caffe should do the right thing.

Jan
Reply all
Reply to author
Forward
0 new messages