I got out-of-range error when using distributed training with MultiWorkerMirroredStrategy (Tensorflow version: 2.1.0).
I'm using two nodes, each with 6 NVIDIA V100 GPUs. I'm using keras mnist model for this experiment.
strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy(). However, it produced out-of-range error.
I have following questions.
1. Does MultiWorkerMirroredStrategy work fine with Tensorflow 2.1.0?
2. I'm using below data pipeline for loading the data.
train_dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train)).shuffle(BUFFER_SIZE).batch(GLOBAL_BATCH_SIZE, drop_remainder=True)
Do I need to change it for MultiWorkerMirroredStrategy? I'm hoping it should automatically split the data across 12 GPUs analogous to single node MirroredStrategy.
3. I would like to know more about the MultiWorkerMirroredStrategy. How does it perform communication across the node and within the node?
Is there any document to provide more details about MultiWorkerMirroredStrategy?
020-02-21 00:17:45.951027: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:390] Started server with target: grpc://localhost:12345
Number of devices: 12
BUFFER_SIZE = 60000, BATCH_SIZE_PER_REPLICA = 64, GLOBAL_BATCH_SIZE = 768
WARNING:tensorflow:`eval_fn` is not passed in. The `worker_fn` will be used if an "evaluator" task exists in the cluster.
WARNING:tensorflow:`eval_strategy` is not passed in. No distribution strategy will be used for evaluation.
WARNING:tensorflow:ModelCheckpoint callback is not provided. Workers will need to restart training if any fails.
2020-02-21 00:17:55.604934: W tensorflow/core/grappler/optimizers/data/auto_shard.cc:428] In AUTO-mode, and switching to DATA-based sharding, instead of FILE-based sharding as we cannot find appropriate reader dataset op(s) to shard. Error: Found an unshardable source dataset: name: "TensorSliceDataset/_2"
op: "TensorSliceDataset"
input: "Placeholder/_0"
input: "Placeholder/_1"
attr {
key: "Toutput_types"
value {
list {
type: DT_DOUBLE
type: DT_UINT8
}
}
}
attr {
key: "output_shapes"
value {
list {
shape {
dim {
size: 28
}
dim {
size: 28
}
}
shape {
}
}
}
}
Epoch 1/10
2020-02-21 00:17:59.803252: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
1/Unknown - 7s 7s/step - loss: 2.4605 - accuracy: 0.08722020-02-21 00:18:02.338251: I tensorflow/core/profiler/lib/profiler_session.cc:225] Profiler session started.
2020-02-21 00:18:02.338343: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1259] Profiler found 6 GPUs
2020-02-21 00:18:02.339507: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcupti.so.10.1
WARNING:tensorflow:Method (on_train_batch_end) is slow compared to the batch update (1.144901). Check your callbacks.
2/Unknown - 9s 4s/step - loss: 2.3538 - accuracy: 0.12632020-02-21 00:18:04.644431: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1378] CUPTI activity buffer flushed
2020-02-21 00:18:04.644494: I tensorflow/core/profiler/internal/gpu/device_tracer.cc:88] GpuTracer has collected 660 callback api events and 660 activity events.
78/Unknown - 10s 128ms/step - loss: 0.7817 - accuracy: 0.78192020-02-21 00:18:05.648949: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
[[{{node IteratorGetNext_5}}]]
[[GroupCrossDeviceControlEdges_0/Adam/Adam/update_0/Const/_259]]
2020-02-21 00:18:05.648949: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
[[{{node IteratorGetNext_5}}]]
[[GroupCrossDeviceControlEdges_5/Identity_4/_423]]
2020-02-21 00:18:05.648954: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
[[{{node IteratorGetNext_5}}]]
[[replica_2/metrics/accuracy/AssignAddVariableOp_1/_67]]
2020-02-21 00:18:05.648973: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
[[{{node IteratorGetNext_5}}]]
[[GroupCrossDeviceControlEdges_0/Adam/Adam/update_0/Const/_239]]
2020-02-21 00:18:05.648983: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
[[{{node IteratorGetNext_5}}]]
[[Adam/ReadVariableOp_2/_10]]
2020-02-21 00:18:05.648991: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
[[{{node IteratorGetNext_5}}]]
[[GroupCrossDeviceControlEdges_2/Adam/Adam/update_0/Const/_291]]
2020-02-21 00:18:05.649055: E tensorflow/core/common_runtime/ring_alg.cc:279] Aborting RingReduce with Out of range: End of sequence
[[{{node IteratorGetNext_5}}]]
[[GroupCrossDeviceControlEdges_0/Adam/Adam/update_0/Const/_259]]
2020-02-21 00:18:05.649274: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
[[{{node IteratorGetNext_5}}]]
[[GroupCrossDeviceControlEdges_0/Adam/Adam/update_0/Const/_259]]
2020-02-21 00:18:05.649396: W tensorflow/core/framework/op_kernel.cc:1655] OP_REQUIRES failed at collective_ops.cc:253 : Out of range: End of sequence
[[{{node IteratorGetNext_5}}]]
[[GroupCrossDeviceControlEdges_0/Adam/Adam/update_0/Const/_259]]
2020-02-21 00:18:05.649653: E tensorflow/core/common_runtime/ring_alg.cc:279] Aborting RingReduce with Out of range: [_Derived_]End of sequence
[[{{node IteratorGetNext_5}}]]
[[GroupCrossDeviceControlEdges_0/Adam/Adam/update_0/Const/_259]]
2020-02-21 00:18:05.649675: E tensorflow/core/common_runtime/ring_alg.cc:279] Aborting RingReduce with Out of range: [_Derived_]End of sequence
[[{{node IteratorGetNext_5}}]]
[[GroupCrossDeviceControlEdges_0/Adam/Adam/update_0/Const/_259]]
2020-02-21 00:18:05.649727: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Out of range: [_Derived_]End of sequence
[[{{node IteratorGetNext_5}}]]
[[GroupCrossDeviceControlEdges_0/Adam/Adam/update_0/Const/_259]]
2020-02-21 00:18:05.649796: E tensorflow/core/common_runtime/ring_alg.cc:279] Aborting RingReduce with Out of range: [_Derived_]End of sequence
[[{{node IteratorGetNext_5}}]]
[[GroupCrossDeviceControlEdges_0/Adam/Adam/update_0/Const/_259]]
2020-02-21 00:18:05.649798: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Out of range: [_Derived_]End of sequence
[[{{node IteratorGetNext_5}}]]
[[GroupCrossDeviceControlEdges_0/Adam/Adam/update_0/Const/_259]]
2020-02-21 00:18:05.649863: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Out of range: [_Derived_]End of sequence
[[{{node IteratorGetNext_5}}]]
[[GroupCrossDeviceControlEdges_0/Adam/Adam/update_0/Const/_259]]
2020-02-21 00:18:05.649944: W tensorflow/core/framework/op_kernel.cc:1655] OP_REQUIRES failed at collective_ops.cc:253 : Out of range: [_Derived_]End of sequence
[[{{node IteratorGetNext_5}}]]
[[GroupCrossDeviceControlEdges_0/Adam/Adam/update_0/Const/_259]]
2020-02-21 00:18:05.650043: W tensorflow/core/framework/op_kernel.cc:1655] OP_REQUIRES failed at collective_ops.cc:253 : Out of range: [_Derived_]End of sequence
[[{{node IteratorGetNext_5}}]]
[[GroupCrossDeviceControlEdges_0/Adam/Adam/update_0/Const/_259]]
2020-02-21 00:18:05.650057: W tensorflow/core/framework/op_kernel.cc:1655] OP_REQUIRES failed at collective_ops.cc:253 : Out of range: [_Derived_]End of sequence
[[{{node IteratorGetNext_5}}]]
[[GroupCrossDeviceControlEdges_0/Adam/Adam/update_0/Const/_259]]
2020-02-21 00:18:05.650062: E tensorflow/core/common_runtime/ring_alg.cc:279] Aborting RingReduce with Out of range: [_Derived_]End of sequence
[[{{node IteratorGetNext_5}}]]
[[GroupCrossDeviceControlEdges_0/Adam/Adam/update_0/Const/_259]]
2020-02-21 00:18:05.650162: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Out of range: [_Derived_]End of sequence
[[{{node IteratorGetNext_5}}]]
[[GroupCrossDeviceControlEdges_0/Adam/Adam/update_0/Const/_259]]
2020-02-21 00:18:05.650322: E tensorflow/core/common_runtime/ring_alg.cc:279] Aborting RingReduce with Out of range: [_Derived_]End of sequence
[[{{node IteratorGetNext_5}}]]
[[GroupCrossDeviceControlEdges_0/Adam/Adam/update_0/Const/_259]]
2020-02-21 00:18:05.650342: W tensorflow/core/framework/op_kernel.cc:1655] OP_REQUIRES failed at collective_ops.cc:253 : Out of range: [_Derived_]End of sequence
[[{{node IteratorGetNext_5}}]]
[[GroupCrossDeviceControlEdges_0/Adam/Adam/update_0/Const/_259]]
2020-02-21 00:18:05.650374: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Out of range: [_Derived_]End of sequence
[[{{node IteratorGetNext_5}}]]
[[GroupCrossDeviceControlEdges_0/Adam/Adam/update_0/Const/_259]]
2020-02-21 00:18:05.650498: W tensorflow/core/framework/op_kernel.cc:1655] OP_REQUIRES failed at collective_ops.cc:253 : Out of range: [_Derived_]End of sequence
[[{{node IteratorGetNext_5}}]]
[[GroupCrossDeviceControlEdges_0/Adam/Adam/update_0/Const/_259]]
2020-02-21 00:18:05.650614: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
[[{{node IteratorGetNext_5}}]]
78/78 [==============================] - 10s 128ms/step - loss: 0.7817 - accuracy: 0.7819
Epoch 2/10
2020-02-21 00:18:05.692339: E tensorflow/core/common_runtime/ring_alg.cc:279] Aborting RingReduce with Out of range: [_Derived_]End of sequence
[[{{node IteratorGetNext_5}}]]
[[GroupCrossDeviceControlEdges_0/Adam/Adam/update_0/Const/_259]]
2020-02-21 00:18:05.692487: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Out of range: [_Derived_]End of sequence
[[{{node IteratorGetNext_5}}]]
[[GroupCrossDeviceControlEdges_0/Adam/Adam/update_0/Const/_259]]
2020-02-21 00:18:05.692610: W tensorflow/core/framework/op_kernel.cc:1655] OP_REQUIRES failed at collective_ops.cc:253 : Out of range: [_Derived_]End of sequence
[[{{node IteratorGetNext_5}}]]
[[GroupCrossDeviceControlEdges_0/Adam/Adam/update_0/Const/_259]]
2020-02-21 00:18:05.692701: E tensorflow/core/common_runtime/ring_alg.cc:279] Aborting RingReduce with Out of range: [_Derived_]End of sequence
[[{{node IteratorGetNext_5}}]]
[[GroupCrossDeviceControlEdges_0/Adam/Adam/update_0/Const/_259]]
2020-02-21 00:18:05.692718: E tensorflow/core/common_runtime/ring_alg.cc:279] Aborting RingReduce with Out of range: [_Derived_]End of sequence
[[{{node IteratorGetNext_5}}]]
[[GroupCrossDeviceControlEdges_0/Adam/Adam/update_0/Const/_259]]
2020-02-21 00:18:05.692717: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Out of range: [_Derived_]End of sequence
[[{{node IteratorGetNext_5}}]]
[[GroupCrossDeviceControlEdges_0/Adam/Adam/update_0/Const/_259]]
[[allreduce_1/CollectiveReduce]]
[[allreduce_1/CollectiveReduce/_354]]
2020-02-21 00:18:05.692858: E tensorflow/core/common_runtime/ring_alg.cc:279] Aborting RingReduce with Out of range: [_Derived_]End of sequence
[[{{node IteratorGetNext_5}}]]
[[GroupCrossDeviceControlEdges_0/Adam/Adam/update_0/Const/_259]]
2020-02-21 00:18:05.692798: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Out of range: [_Derived_]End of sequence
[[{{node IteratorGetNext_5}}]]
[[GroupCrossDeviceControlEdges_0/Adam/Adam/update_0/Const/_259]]
2020-02-21 00:18:05.692850: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Out of range: [_Derived_]End of sequence
[[{{node IteratorGetNext_5}}]]
[[GroupCrossDeviceControlEdges_0/Adam/Adam/update_0/Const/_259]]
2020-02-21 00:18:05.692755: E tensorflow/core/common_runtime/ring_alg.cc:279] Aborting RingReduce with Out of range: [_Derived_]End of sequence
[[{{node IteratorGetNext_5}}]]
[[GroupCrossDeviceControlEdges_0/Adam/Adam/update_0/Const/_259]]
2020-02-21 00:18:05.692993: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Out of range: [_Derived_]End of sequence
[[{{node IteratorGetNext_5}}]]
[[GroupCrossDeviceControlEdges_0/Adam/Adam/update_0/Const/_259]]
2020-02-21 00:18:05.693013: E tensorflow/core/common_runtime/ring_alg.cc:279] Aborting RingReduce with Out of range: [_Derived_]End of sequence
[[{{node IteratorGetNext_5}}]]
[[GroupCrossDeviceControlEdges_0/Adam/Adam/update_0/Const/_259]]
2020-02-21 00:18:05.693048: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Out of range: [_Derived_]End of sequence
[[{{node IteratorGetNext_5}}]]
[[GroupCrossDeviceControlEdges_0/Adam/Adam/update_0/Const/_259]]
2020-02-21 00:18:05.693106: W tensorflow/core/framework/op_kernel.cc:1655] OP_REQUIRES failed at collective_ops.cc:253 : Out of range: [_Derived_]End of sequence
[[{{node IteratorGetNext_5}}]]
[[GroupCrossDeviceControlEdges_0/Adam/Adam/update_0/Const/_259]]
2020-02-21 00:18:05.693161: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Out of range: [_Derived_]End of sequence
[[{{node IteratorGetNext_5}}]]
[[GroupCrossDeviceControlEdges_0/Adam/Adam/update_0/Const/_259]]
2020-02-21 00:18:05.693121: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Cancelled: Iterator was cancelled
[[{{node IteratorGetNext_5}}]]
2020-02-21 00:18:05.693146: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Cancelled: Iterator was cancelled
[[{{node IteratorGetNext_2}}]]
2020-02-21 00:18:05.693292: W tensorflow/core/framework/op_kernel.cc:1655] OP_REQUIRES failed at collective_ops.cc:253 : Out of range: [_Derived_]End of sequence
[[{{node IteratorGetNext_5}}]]
[[GroupCrossDeviceControlEdges_0/Adam/Adam/update_0/Const/_259]]
2020-02-21 00:18:05.693167: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Cancelled: Iterator was cancelled
[[{{node IteratorGetNext_4}}]]
2020-02-21 00:18:05.693153: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Cancelled: Iterator was cancelled
[[{{node IteratorGetNext_3}}]]
2020-02-21 00:18:05.693224: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Cancelled: Iterator was cancelled
[[{{node IteratorGetNext_1}}]]
2020-02-21 00:18:05.693114: W tensorflow/core/framework/op_kernel.cc:1655] OP_REQUIRES failed at collective_ops.cc:253 : Out of range: [_Derived_]End of sequence
[[{{node IteratorGetNext_5}}]]
[[GroupCrossDeviceControlEdges_0/Adam/Adam/update_0/Const/_259]]
2020-02-21 00:18:05.693399: W tensorflow/core/framework/op_kernel.cc:1655] OP_REQUIRES failed at collective_ops.cc:253 : Out of range: [_Derived_]End of sequence
[[{{node IteratorGetNext_5}}]]
[[GroupCrossDeviceControlEdges_0/Adam/Adam/update_0/Const/_259]]
2020-02-21 00:18:05.693418: W tensorflow/core/framework/op_kernel.cc:1655] OP_REQUIRES failed at collective_ops.cc:253 : Out of range: [_Derived_]End of sequence
[[{{node IteratorGetNext_5}}]]
[[GroupCrossDeviceControlEdges_0/Adam/Adam/update_0/Const/_259]]
2020-02-21 00:18:05.694962: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Out of range: [_Derived_]End of sequence
[[{{node IteratorGetNext_5}}]]
[[GroupCrossDeviceControlEdges_0/Adam/Adam/update_0/Const/_259]]
[[allreduce_1/CollectiveReduce]]
1/78 [..............................] - ETA: 1sTraceback (most recent call last):
File "model/mnist_multi_node_multi_gpu.py", line 59, in <module>
dmodel.fit(train_dataset, epochs=10, callbacks=[tensorboard_callback])
File "/export/home/arde/code/tf2-venv/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training.py", line 819, in fit
use_multiprocessing=use_multiprocessing)
File "/export/home/arde/code/tf2-venv/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_distributed.py", line 790, in fit
*args, **kwargs)
File "/export/home/arde/code/tf2-venv/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_distributed.py", line 777, in wrapper
mode=dc.CoordinatorMode.INDEPENDENT_WORKER)
File "/export/home/arde/code/tf2-venv/lib/python3.7/site-packages/tensorflow_core/python/distribute/distribute_coordinator.py", line 853, in run_distribute_coordinator
task_id, session_config, rpc_layer)
File "/export/home/arde/code/tf2-venv/lib/python3.7/site-packages/tensorflow_core/python/distribute/distribute_coordinator.py", line 360, in _run_single_worker
return worker_fn(strategy)
File "/export/home/arde/code/tf2-venv/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_distributed.py", line 772, in _worker_fn
return method(model, **kwargs)
File "/export/home/arde/code/tf2-venv/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_v2.py", line 342, in fit
total_epochs=epochs)
File "/export/home/arde/code/tf2-venv/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_v2.py", line 128, in run_one_epoch
batch_outs = execution_function(iterator)
File "/export/home/arde/code/tf2-venv/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_v2_utils.py", line 98, in execution_function
distributed_function(input_fn))
File "/export/home/arde/code/tf2-venv/lib/python3.7/site-packages/tensorflow_core/python/eager/def_function.py", line 568, in __call__
result = self._call(*args, **kwds)
File "/export/home/arde/code/tf2-venv/lib/python3.7/site-packages/tensorflow_core/python/eager/def_function.py", line 599, in _call
return self._stateless_fn(*args, **kwds) # pylint: disable=not-callable
File "/export/home/arde/code/tf2-venv/lib/python3.7/site-packages/tensorflow_core/python/eager/function.py", line 2363, in __call__
return graph_function._filtered_call(args, kwargs) # pylint: disable=protected-access
File "/export/home/arde/code/tf2-venv/lib/python3.7/site-packages/tensorflow_core/python/eager/function.py", line 1611, in _filtered_call
self.captured_inputs)
File "/export/home/arde/code/tf2-venv/lib/python3.7/site-packages/tensorflow_core/python/eager/function.py", line 1692, in _call_flat
ctx, args, cancellation_manager=cancellation_manager))
File "/export/home/arde/code/tf2-venv/lib/python3.7/site-packages/tensorflow_core/python/eager/function.py", line 545, in call
ctx=ctx)
File "/export/home/arde/code/tf2-venv/lib/python3.7/site-packages/tensorflow_core/python/eager/execute.py", line 67, in quick_execute
six.raise_from(core._status_to_exception(e.code, message), None)
File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.CancelledError: [_Derived_]Iterator was cancelled
[[node IteratorGetNext_2 (defined at model/mnist_multi_node_multi_gpu.py:59) ]] [Op:__inference_distributed_function_5054]
Function call stack:
distributed_function
2020-02-21 00:18:05.817692: W tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelled
2020-02-21 00:18:05.818577: W tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelled
2020-02-21 00:18:05.819643: W tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelled
2020-02-21 00:18:05.820546: W tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelled
2020-02-21 00:18:05.821401: W tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelled
2020-02-21 00:18:05.822531: W tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelled