Hi, here are the logs with the error. "run.log" is the main log file, and train.0.3.log is the one generated by train.py. You can see the training parameters in the logs as well.
The error seems to happen when the memory usage of one of the GPUs approaches the total memory available (16276 MiB), in this run it happened for GPU 3 (see CUDA.png, it's a screenshot from just before the error). The actual percentage of RAM used doesn't move too much, even though the "VIRT" column as seen on MEM.png shows some huge numbers, but I'm not sure if that's an issue by itself (will it always get as big as possible, depending on the available memory).
The key lines in the log seem to be the following:
WARNING (nnet3-chain-train[5.4.240~1403-c60f2]:MallocPitchInternal():cu-allocator.cc:97) Allocation of 49836032 x 51 region failed: freeing some memory and trying again.
LOG (nnet3-chain-train[5.4.240~1403-c60f2]:MallocPitchInternal():cu-allocator.cc:102) To avoid future problems like this, changing memory_factor from 1.3 to 1.1
After that the allocation is tried a couple of times more and then the error happens. The problematic call seems to be in nnet3-chain-copy-egs.
For this run, I set the minibatch_size (num-chunk-per-minibatch in train.py) parameter to 1024, so the error occurs in the matter of seconds. Usually minibatch_size is initially set to 16, then lowered to 8 and 4 if (when) needed.
Any help is welcome.