NVIDIA autoencoder example hangs on fresh caffe install

519 views
Skip to first unread message

Dixon Dick

unread,
Sep 3, 2015, 6:14:56 PM9/3/15
to Caffe Users
We have installed caffe branch 0.12 and are working on testing DIGITS on a 980 GPU. Running 'make runtest' passes all tests and the ./examples/mnist/train_lenet.sh succeeds. 

However, the train_mnist_autoencoder.sh example hangs at this step:

I0903 13:46:54.967397 22723 layer_factory.hpp:75] Creating layer data

I0903 13:46:54.967449 22723 net.cpp:99] Creating Layer data

I0903 13:46:54.967455 22723 net.cpp:409] data -> data

I0903 13:46:54.967463 22723 net.cpp:131] Setting up data


<hangs>


We believe it is about to access the ./mnist/mnist_train_lmdb/data.mdb file.


We've checked our memory and disk space and have more than enough, and looked at the running threads. They look like this with strace, with one thread stuck on a FUTEX_WAIT_PRIVATE and another thread polling for a file descriptor:


ubuntu@ubuntu:~$ sudo strace -p 18722
strace: attach: ptrace(PTRACE_ATTACH, ...): Operation not permitted
Could not attach to process.  If your uid matches the uid of the target
process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
again as the root user.  For more details, see /etc/sysctl.d/10-ptrace.conf

ubuntu@ubuntu:~$ sudo strace -p 18723
Process 18723 attached
restart_syscall(<... resuming interrupted call ...>

ubuntu@ubuntu:~$ sudo strace -p 18725
Process 18725 attached
restart_syscall(<... resuming interrupted call ...>) = 0
clock_gettime(CLOCK_MONOTONIC_RAW, {1789, 252919189}) = 0
clock_gettime(CLOCK_MONOTONIC_RAW, {1789, 253116985}) = 0
poll([{fd=17, events=POLLIN}, {fd=19, events=POLLIN}, {fd=20, events=POLLIN}, {fd=21, events=POLLIN}, {fd=22, events=POLLIN}, {fd=24, events=POLLIN}], 6, 100) = 0 (Timeout)
clock_gettime(CLOCK_MONOTONIC_RAW, {1789, 353748503}) = 0
clock_gettime(CLOCK_MONOTONIC_RAW, {1789, 353858253}) = 0
poll([{fd=17, events=POLLIN}, {fd=19, events=POLLIN}, {fd=20, events=POLLIN}, {fd=21, events=POLLIN}, {fd=22, events=POLLIN}, {fd=24, events=POLLIN}], 6, 100) = 0 (Timeout)
clock_gettime(CLOCK_MONOTONIC_RAW, {1789, 454261969}) = 0
clock_gettime(CLOCK_MONOTONIC_RAW, {1789, 454410259}) = 
...continues POLLING...

ubuntu@ubuntu:~$ sudo strace -p 18726
Process 18726 attached

futex(0xe360ca4, FUTEX_WAIT_PRIVATE, 577, NULL

Has anyone seen this kind of livelock/deadlock behavior before?

dcd
 

christi...@gmail.com

unread,
Sep 16, 2015, 4:16:02 AM9/16/15
to Caffe Users
Hi,

I can confirm that behavior on a fresh master checkout. Something seems to go wrong
with the TEST data source initialization. Here's a gdb stack trace:

(gdb) bt
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x00007ffff797dddb in boost::condition_variable::wait (this=0x42b0b18, m=...) at /usr/include/boost/thread/pthread/condition_variable.hpp:73
#2  0x00007ffff797f492 in caffe::BlockingQueue<caffe::Datum*>::peek (this=0x42b11d0) at code/caffe/src/caffe/util/blocking_queue.cpp:77
#3  0x00007ffff790e0eb in caffe::DataLayer<float>::DataLayerSetUp (this=0x42b0490, bottom=..., top=...) at code/caffe/src/caffe/layers/data_layer.cpp:33
#4  0x00007ffff79500df in caffe::BaseDataLayer<float>::LayerSetUp (this=0x42b0490, bottom=..., top=...) at code/caffe/src/caffe/layers/base_data_layer.cpp:29
#5  0x00007ffff795050c in caffe::BasePrefetchingDataLayer<float>::LayerSetUp (this=0x42b0490, bottom=..., top=...) at code/caffe/src/caffe/layers/base_data_layer.cpp:45
#6  0x00007ffff7868e50 in caffe::Layer<float>::SetUp (this=0x42b0490, bottom=..., top=...) at code/caffe/include/caffe/layer.hpp:71
#7  0x00007ffff78976b8 in caffe::Net<float>::Init (this=0x3db4770, in_param=...) at code/caffe/src/caffe/net.cpp:152
#8  0x00007ffff7895811 in caffe::Net<float>::Net (this=0x3db4770, param=..., root_net=0x0) at code/caffe/src/caffe/net.cpp:27
#9  0x00007ffff7877f13 in caffe::Solver<float>::InitTestNets (this=0x752c40) at code/caffe/src/caffe/solver.cpp:190
#10 0x00007ffff7876a2c in caffe::Solver<float>::Init (this=0x752c40, param=...) at code/caffe/src/caffe/solver.cpp:65
#11 0x00007ffff78764fa in caffe::Solver<float>::Solver (this=0x752c40, param=..., root_solver=0x0) at code/caffe/src/caffe/solver.cpp:38
#12 0x000000000042eba5 in caffe::SGDSolver<float>::SGDSolver (this=0x752c40, param=...) at code/caffe/include/caffe/solver.hpp:159
#13 0x000000000042ed73 in caffe::NesterovSolver<float>::NesterovSolver (this=0x752c40, param=...) at code/caffe/include/caffe/solver.hpp:191
#14 0x000000000042c301 in caffe::GetSolver<float> (param=...) at code/caffe/include/caffe/solver.hpp:288
#15 0x0000000000427c7b in train () at code/caffe/tools/caffe.cpp:196
#16 0x000000000042a03c in main (argc=2, argv=0x7fffffffdda0) at code/caffe/tools/caffe.cpp:394

 The deadlock is resolved when I disable testing during training. It seems the data source can only be accessed
by the train net and not by the test net.

Christian

Dixon Dick

unread,
Sep 16, 2015, 4:21:38 AM9/16/15
to Caffe Users
This is fantastic work, thank you very much! I will disable the testing and see how it goes.

dcd

Dixon Dick

unread,
Sep 16, 2015, 5:16:47 AM9/16/15
to Caffe Users
Hi Christian,

As easily as possible, what do you mean by disabling the tests during training. Suddenly I am not sure i understand what to change in the prototxt.

dcd
Message has been deleted

christi...@gmail.com

unread,
Sep 16, 2015, 5:47:27 AM9/16/15
to Caffe Users
Hi Dixon,

I managed to disable testing on the test net with the following solver.prototxt:

net: "mnist_autoencoder.prototxt"
#test_state: { stage: 'test-on-train' }
#test_iter: 500
#test_state: { stage: 'test-on-test' }
#test_iter: 100
#test_interval: 500
#test_compute_loss: true
base_lr: 0.005
display: 100
max_iter: 120000
lr_policy: "step"
gamma: 0.1
momentum: 0.9
weight_decay: 0.0005
stepsize: 80000
snapshot: 50000
snapshot_prefix: "mnist_autoencoder"
solver_mode: GPU
solver_type: NESTEROV
delta: 1e-08

Christian

christi...@gmail.com

unread,
Sep 16, 2015, 5:48:17 AM9/16/15
to Caffe Users
Can anyone reproduce this?

Mohamed Omran

unread,
Sep 16, 2015, 9:29:24 AM9/16/15
to christi...@gmail.com, Caffe Users
Hey Christian, the problem occurs when during training you test on the same lmdb used for training, i.e. only commenting out the following lines should be sufficient:

#test_state: { stage: 'test-on-train' }
#test_iter: 500

A better workaround for now though, which would enable testing on the training data, would be to rename the source in the "test-on-train" data layer to:

source: "./examples/mnist/mnist_train_lmdb"

vs. 

source: "examples/mnist/mnist_train_lmdb"

I have localised the source of the problem and will send in a bug report.

--
You received this message because you are subscribed to the Google Groups "Caffe Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to caffe-users...@googlegroups.com.
To post to this group, send email to caffe...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/caffe-users/dbd11862-9366-48a9-9dab-517d2c65dbf1%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Evan Shelhamer

unread,
Sep 16, 2015, 1:45:28 PM9/16/15
to Caffe Users, christi...@gmail.com
Please follow up at this issue https://github.com/BVLC/caffe/issues/3037

There was once a locking issue with lmdb on certain platforms but that was long fixed -- and I can't find the issue at the moment -- but the data layer / data reader changes from the parallel PR might have stirred up a problem. If anyone traces it down please comment.
To unsubscribe from this group and stop receiving emails from it, send an email to caffe-users+unsubscribe@googlegroups.com.

To post to this group, send email to caffe...@googlegroups.com.

christi...@gmail.com

unread,
Sep 16, 2015, 1:47:41 PM9/16/15
to Caffe Users, christi...@gmail.com
Hi Mohamed,

thanks for your workaround! Seems the bug is already filed here:


Best,
Christian

Dixon Dick

unread,
Sep 16, 2015, 2:13:22 PM9/16/15
to Caffe Users, christi...@gmail.com
I have marked this complete, it seems there is both an understanding of the problem and workarounds. I am deeply grateful, thank you all for posting!

dcd
Reply all
Reply to author
Forward
0 new messages