Segmentation fault on large training sets

Tassilo Klein

unread,

Apr 16, 2015, 4:27:47 AM4/16/15

to caffe...@googlegroups.com

Hi,

I am experiencing segmentation faults and/or Check failed: status == CUBLAS_STATUS_SUCCESS (14 vs. 0) CUBLAS_STATUS_INTERNAL_ERROR on OSX,

when using a large number of training samples, e.g. > 500 (MNIST, 28x28). I am using Python and MemoryDataLayer for input.

For smaller number of input, e.g. 100, it runs.

Any idea?

Thanks,

Tassilo

train_set_x = train_set_x.reshape((num_elements_train,1,28,28))

train_set_y = train_set_y.reshape((num_elements_train)),1,1,1))

test_set_x = test_set_x.reshape((num_elements_test,1,28,28))

test_set_y = test_set_y.reshape((num_elements_test)),1,1,1))

caffe.set_device(0)

caffe.set_mode_gpu()

solver = caffe.SGDSolver('/Users/TJKlein/caffe/examples/mnist/tjk.prototxt')

solver.net.set_input_arrays(train_set_x.astype(np.float32),train_set_y.astype(np.float32))

solver.test_nets[0].set_input_arrays(test_set_x.astype(np.float32),test_set_y.astype(np.float32))

niter = 1000

test_interval = 1

train_loss = zeros(niter)

test_acc = zeros(int(np.ceil(niter / test_interval)))

output = zeros((niter, 8, 10))

for it in range(niter):

solver.step(1) # SGD by Caffe

train_loss[it] = solver.net.blobs['loss'].data

solver.test_nets[0].forward(start='conv1')

output[it] = solver.test_nets[0].blobs['ip2'].data[:8]

[....]

Network definition file:

layer {

type: "MemoryData"

top: "data"

top: "label"

memory_data_param {

batch_size: 50

channels: 1

height: 28

width: 28

}

layer {

type: "Convolution"

bottom: "data"

top: "conv1"

convolution_param {

num_output: 20

kernel_size: 5

weight_filler {

type: "xavier"

}

layer {

type: "Pooling"

bottom: "conv1"

top: "pool1"

pooling_param {

pool: MAX

kernel_size: 2

stride: 2

}

layer {

type: "Convolution"

bottom: "pool1"

top: "conv2"

convolution_param {

num_output: 50

kernel_size: 5

weight_filler {

type: "xavier"

}

layer {

type: "Pooling"

bottom: "conv2"

top: "pool2"

pooling_param {

pool: MAX

kernel_size: 2

stride: 2

}

layer {

type: "InnerProduct"

bottom: "pool2"

top: "ip1"

inner_product_param {

num_output: 500

weight_filler {

type: "xavier"

}

layer {

type: "ReLU"

bottom: "ip1"

top: "ip1"

}

layer {

type: "InnerProduct"

bottom: "ip1"

top: "ip2"

inner_product_param {

num_output: 10

weight_filler {

type: "xavier"

}

layer {

type: "SoftmaxWithLoss"

bottom: "ip2"

bottom: "label"

top: "loss"

}

Andriy Lysak

unread,

Apr 16, 2015, 12:44:34 PM4/16/15

to caffe...@googlegroups.com

you might be running out of memory.

When you run your code keep an eye on memory usage.

Best Regards,
Andriy

Tassilo Klein

unread,

Apr 17, 2015, 2:36:41 AM4/17/15

to caffe...@googlegroups.com

Actually, it should all fit comfortably into the memory.

Tassilo Klein

unread,

Apr 17, 2015, 2:41:09 AM4/17/15

to caffe...@googlegroups.com

I just tested in on Linux, there I get CUBLAS_STATUS_INTERNAL_ERROR

sri

unread,

Jul 17, 2015, 12:34:59 AM7/17/15

to caffe...@googlegroups.com

Were you able to figure out a solution?

Reply all

Reply to author

Forward