NVIDIA autoencoder example hangs on fresh caffe install

Dixon Dick

unread,

Sep 3, 2015, 6:14:56 PM9/3/15

to Caffe Users

We have installed caffe branch 0.12 and are working on testing DIGITS on a 980 GPU. Running 'make runtest' passes all tests and the ./examples/mnist/train_lenet.sh succeeds.

However, the train_mnist_autoencoder.sh example hangs at this step:

I0903 13:46:54.967397 22723 layer_factory.hpp:75] Creating layer data

I0903 13:46:54.967449 22723 net.cpp:99] Creating Layer data

I0903 13:46:54.967455 22723 net.cpp:409] data -> data

I0903 13:46:54.967463 22723 net.cpp:131] Setting up data

<hangs>

We believe it is about to access the ./mnist/mnist_train_lmdb/data.mdb file.

We've checked our memory and disk space and have more than enough, and looked at the running threads. They look like this with strace, with one thread stuck on a FUTEX_WAIT_PRIVATE and another thread polling for a file descriptor:

ubuntu@ubuntu:~$ sudo strace -p 18722

strace: attach: ptrace(PTRACE_ATTACH, ...): Operation not permitted

Could not attach to process. If your uid matches the uid of the target

process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try

again as the root user. For more details, see /etc/sysctl.d/10-ptrace.conf

ubuntu@ubuntu:~$ sudo strace -p 18723

Process 18723 attached

restart_syscall(<... resuming interrupted call ...>

ubuntu@ubuntu:~$ sudo strace -p 18725

Process 18725 attached

restart_syscall(<... resuming interrupted call ...>) = 0

clock_gettime(CLOCK_MONOTONIC_RAW, {1789, 252919189}) = 0

clock_gettime(CLOCK_MONOTONIC_RAW, {1789, 253116985}) = 0

poll([{fd=17, events=POLLIN}, {fd=19, events=POLLIN}, {fd=20, events=POLLIN}, {fd=21, events=POLLIN}, {fd=22, events=POLLIN}, {fd=24, events=POLLIN}], 6, 100) = 0 (Timeout)

clock_gettime(CLOCK_MONOTONIC_RAW, {1789, 353748503}) = 0

clock_gettime(CLOCK_MONOTONIC_RAW, {1789, 353858253}) = 0

poll([{fd=17, events=POLLIN}, {fd=19, events=POLLIN}, {fd=20, events=POLLIN}, {fd=21, events=POLLIN}, {fd=22, events=POLLIN}, {fd=24, events=POLLIN}], 6, 100) = 0 (Timeout)

clock_gettime(CLOCK_MONOTONIC_RAW, {1789, 454261969}) = 0

clock_gettime(CLOCK_MONOTONIC_RAW, {1789, 454410259}) =

...continues POLLING...

ubuntu@ubuntu:~$ sudo strace -p 18726

Process 18726 attached

futex(0xe360ca4, FUTEX_WAIT_PRIVATE, 577, NULL

Has anyone seen this kind of livelock/deadlock behavior before?

dcd

christi...@gmail.com

unread,

Sep 16, 2015, 4:16:02 AM9/16/15

to Caffe Users

Hi,

I can confirm that behavior on a fresh master checkout. Something seems to go wrong

with the TEST data source initialization. Here's a gdb stack trace:

(gdb) bt

#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185

#1 0x00007ffff797dddb in boost::condition_variable::wait (this=0x42b0b18, m=...) at /usr/include/boost/thread/pthread/condition_variable.hpp:73

#2 0x00007ffff797f492 in caffe::BlockingQueue<caffe::Datum*>::peek (this=0x42b11d0) at code/caffe/src/caffe/util/blocking_queue.cpp:77

#3 0x00007ffff790e0eb in caffe::DataLayer<float>::DataLayerSetUp (this=0x42b0490, bottom=..., top=...) at code/caffe/src/caffe/layers/data_layer.cpp:33

#4 0x00007ffff79500df in caffe::BaseDataLayer<float>::LayerSetUp (this=0x42b0490, bottom=..., top=...) at code/caffe/src/caffe/layers/base_data_layer.cpp:29

#5 0x00007ffff795050c in caffe::BasePrefetchingDataLayer<float>::LayerSetUp (this=0x42b0490, bottom=..., top=...) at code/caffe/src/caffe/layers/base_data_layer.cpp:45

#6 0x00007ffff7868e50 in caffe::Layer<float>::SetUp (this=0x42b0490, bottom=..., top=...) at code/caffe/include/caffe/layer.hpp:71

#7 0x00007ffff78976b8 in caffe::Net<float>::Init (this=0x3db4770, in_param=...) at code/caffe/src/caffe/net.cpp:152

#8 0x00007ffff7895811 in caffe::Net<float>::Net (this=0x3db4770, param=..., root_net=0x0) at code/caffe/src/caffe/net.cpp:27

#9 0x00007ffff7877f13 in caffe::Solver<float>::InitTestNets (this=0x752c40) at code/caffe/src/caffe/solver.cpp:190

#10 0x00007ffff7876a2c in caffe::Solver<float>::Init (this=0x752c40, param=...) at code/caffe/src/caffe/solver.cpp:65

#11 0x00007ffff78764fa in caffe::Solver<float>::Solver (this=0x752c40, param=..., root_solver=0x0) at code/caffe/src/caffe/solver.cpp:38

#12 0x000000000042eba5 in caffe::SGDSolver<float>::SGDSolver (this=0x752c40, param=...) at code/caffe/include/caffe/solver.hpp:159

#13 0x000000000042ed73 in caffe::NesterovSolver<float>::NesterovSolver (this=0x752c40, param=...) at code/caffe/include/caffe/solver.hpp:191

#14 0x000000000042c301 in caffe::GetSolver<float> (param=...) at code/caffe/include/caffe/solver.hpp:288

#15 0x0000000000427c7b in train () at code/caffe/tools/caffe.cpp:196

#16 0x000000000042a03c in main (argc=2, argv=0x7fffffffdda0) at code/caffe/tools/caffe.cpp:394

The deadlock is resolved when I disable testing during training. It seems the data source can only be accessed

by the train net and not by the test net.

Christian

Dixon Dick

unread,

Sep 16, 2015, 4:21:38 AM9/16/15

to Caffe Users

This is fantastic work, thank you very much! I will disable the testing and see how it goes.

dcd

Dixon Dick

unread,

Sep 16, 2015, 5:16:47 AM9/16/15

to Caffe Users

Hi Christian,

As easily as possible, what do you mean by disabling the tests during training. Suddenly I am not sure i understand what to change in the prototxt.

dcd

Message has been deleted

christi...@gmail.com

unread,

Sep 16, 2015, 5:47:27 AM9/16/15

to Caffe Users

Hi Dixon,

I managed to disable testing on the test net with the following solver.prototxt:

net: "mnist_autoencoder.prototxt"

#test_state: { stage: 'test-on-train' }

#test_iter: 500

#test_state: { stage: 'test-on-test' }

#test_iter: 100

#test_interval: 500

#test_compute_loss: true

base_lr: 0.005

display: 100

max_iter: 120000

lr_policy: "step"

gamma: 0.1

momentum: 0.9

weight_decay: 0.0005

stepsize: 80000

snapshot: 50000

snapshot_prefix: "mnist_autoencoder"

solver_mode: GPU

solver_type: NESTEROV

delta: 1e-08

Christian

christi...@gmail.com

unread,

Sep 16, 2015, 5:48:17 AM9/16/15

to Caffe Users

Can anyone reproduce this?

Mohamed Omran

unread,

Sep 16, 2015, 9:29:24 AM9/16/15

to christi...@gmail.com, Caffe Users

Hey Christian, the problem occurs when during training you test on the same lmdb used for training, i.e. only commenting out the following lines should be sufficient:

#test_state: { stage: 'test-on-train' }

#test_iter: 500

A better workaround for now though, which would enable testing on the training data, would be to rename the source in the "test-on-train" data layer to:

source: "./examples/mnist/mnist_train_lmdb"

vs.

source: "examples/mnist/mnist_train_lmdb"

I have localised the source of the problem and will send in a bug report.

--
You received this message because you are subscribed to the Google Groups "Caffe Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to caffe-users...@googlegroups.com.
To post to this group, send email to caffe...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/caffe-users/dbd11862-9366-48a9-9dab-517d2c65dbf1%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Evan Shelhamer

unread,

Sep 16, 2015, 1:45:28 PM9/16/15

to Caffe Users, christi...@gmail.com

Please follow up at this issue https://github.com/BVLC/caffe/issues/3037

There was once a locking issue with lmdb on certain platforms but that was long fixed -- and I can't find the issue at the moment -- but the data layer / data reader changes from the parallel PR might have stirred up a problem. If anyone traces it down please comment.

To unsubscribe from this group and stop receiving emails from it, send an email to caffe-users+unsubscribe@googlegroups.com.

To post to this group, send email to caffe...@googlegroups.com.

christi...@gmail.com

unread,

Sep 16, 2015, 1:47:41 PM9/16/15

to Caffe Users, christi...@gmail.com

Hi Mohamed,

thanks for your workaround! Seems the bug is already filed here:

https://github.com/BVLC/caffe/issues/3037

Best,

Christian

Dixon Dick

unread,

Sep 16, 2015, 2:13:22 PM9/16/15

to Caffe Users, christi...@gmail.com

I have marked this complete, it seems there is both an understanding of the problem and workarounds. I am deeply grateful, thank you all for posting!

dcd

Reply all

Reply to author

Forward