After adding two outcomes to my network the network crashes when trying to read dataset. Not sure why this suddenly happens, below follows a description of the error and setup. I don't think its related to the bug:
https://github.com/BVLC/caffe/issues/1726 that addresses output data, but I'm currently not sure even where to start looking for the problem. I'm running the latest git-version on a Ubuntu 14.04 machine with two K40c cards.
Layer setupI'm using HDF5-containers as I have 7 labels associated with each image. In the prototxt I use a simple slicer to separate the label-blob from the HDF5-file:
layer {
name: "Label_slicer"
type: "Slice"
bottom: "label"
top: "label_var1"
top: "label_var2"
top: "label_var3"
top: "label_var4"
top: "label_var5"
top: "label_var6"
top: "label_var7"
slice_param {
slice_point: 1
slice_point: 2
slice_point: 3
slice_point: 4
slice_point: 5
slice_point: 6
axis: 1
}
}
On top of each layer I have an accuracy and a softmax:
layer {
name: "accuracy_prev_fracture"
type: "Accuracy"
bottom: "fc8_l1"
bottom: "label_var1"
top: "accuracy_var1"
include {
phase: TEST
}
accuracy_param {
ignore_label: 0
}
}
layer {
name: "loss_var1"
type: "SoftmaxWithLoss"
bottom: "fc8_l1"
bottom: "label_var1"
top: "loss_prev_fracture"
loss_param {
ignore_label: 0
}
}
Each top layer have their own hidden layer, i.e. the fc8. At the bottom I have two simple HDF5-layer:
layer {
name: "Wrists"
type: "HDF5Data"
top: "data"
top: "label"
include {
phase: TRAIN
}
hdf5_data_param {
source: "/media/max/Encrypted/Processed/hdf5/all_train.txt"
batch_size: 256
}
}
layer {
name: "Wrists"
type: "HDF5Data"
top: "data"
top: "label"
include {
phase: TEST
}
hdf5_data_param {
source: "/media/max/Encrypted/Processed/hdf5/all_validation.txt"
batch_size: 50
}
}
The error
The rest is basically standard AlexNet setup. I've previously ran this setup with 5 output layers, after adding the additional outputs I get the cryptic HDF5-error:
HDF5-DIAG: Error detected in HDF5 (1.8.11) thread 140690729749056:
#000: ../../../src/H5Dio.c line 182 in H5Dread(): can't read data
major: Dataset
minor: Read failed
#001: ../../../src/H5Dio.c line 550 in H5D__read(): can't read data
major: Dataset
minor: Read failed
#002: ../../../src/H5Dchunk.c line 1837 in H5D__chunk_read(): unable to read raw data chunk
major: Low-level I/O
minor: Read failed
#003: ../../../src/H5Dchunk.c line 2868 in H5D__chunk_lock(): data pipeline read failed
major: Data filters
minor: Filter operation failed
#004: ../../../src/H5Z.c line 1175 in H5Z_pipeline(): filter returned failure during read
major: Data filters
minor: Read failed
#005: ../../../src/H5Zdeflate.c line 125 in H5Z_filter_deflate(): inflate() failed
major: Data filters
minor: Unable to initialize object
F1230 15:28:18.565651 23368 hdf5.cpp:72] Check failed: status >= 0 (-1 vs. 0) Failed to read float dataset data
*** Check failure stack trace: ***
@ 0x7ff51c22ddaa (unknown)
@ 0x7ff51c22dce4 (unknown)
@ 0x7ff51c22d6e6 (unknown)
@ 0x7ff51c230687 (unknown)
@ 0x7ff51c783dfb caffe::hdf5_load_nd_dataset<>()
@ 0x7ff51c87ecee caffe::HDF5DataLayer<>::LoadHDF5FileData()
@ 0x7ff51c9209f5 caffe::HDF5DataLayer<>::Forward_gpu()
@ 0x7ff51c7f2a81 caffe::Net<>::ForwardFromTo()
@ 0x7ff51c7f2e07 caffe::Net<>::ForwardPrefilled()
@ 0x7ff51c7a6c41 caffe::Solver<>::Step()
@ 0x7ff51c7a7645 caffe::Solver<>::Solve()
@ 0x7ff51c7bdb95 caffe::P2PSync<>::run()
@ 0x40a6e1 train()
@ 0x408421 main
@ 0x7ff51ad30ec5 (unknown)
@ 0x408bdd (unknown)
@ (nil) (unknown)
Aborted (core dumped)
Some debugging that I've done
I'm not aware of any new formatting of my data. When debugging the H5-conatiner files:
import random
for h5_file_name in random.sample(h5_files, 4):
with h5py.File(h5_file_name, "r") as h5_file:
data = h5_file["data"]
label = h5_file["label"]
print h5_file_name
print np.array(label)
print "The current range is %.2f to %.2f with a mean of %.2f" % \
(np.min(data), np.max(data), np.mean(data))
I get a reasonable output (there are 10 rows since the image is oversampled):
/media/max/Encrypted/Processed/hdf5/img_train_36943.h5
[[ 0. 0. 1. -1. 1. -1. -1.]
[ 0. 0. 1. -1. 1. -1. -1.]
[ 0. 0. 1. -1. 1. -1. -1.]
[ 0. 0. 1. -1. 1. -1. -1.]
[ 0. 0. 1. -1. 1. -1. -1.]
[ 0. 0. 1. -1. 1. -1. -1.]
[ 0. 0. 1. -1. 1. -1. -1.]
[ 0. 0. 1. -1. 1. -1. -1.]
[ 0. 0. 1. -1. 1. -1. -1.]
[ 0. 0. 1. -1. 1. -1. -1.]]
The current range is -0.70 to 0.33 with a mean of 0.00
I guess changing to a different input type would be fine but I haven't found any good examples on how to setup a vector input - this doesn't seem to be available in the standard leveldb. Any suggestions on how to address the issue or how to setup a different input structure are appreciated.