Distributed training - MultiWorkerMirroredStrategy

Arup De

unread,

Feb 21, 2020, 12:30:57 PM2/21/20

to TensorFlow Developers

Hi,

I got out-of-range error when using distributed training with MultiWorkerMirroredStrategy (Tensorflow version: 2.1.0).

I'm using two nodes, each with 6 NVIDIA V100 GPUs. I'm using keras mnist model for this experiment.

Initially, I tried with single node MirroredStrategy with 6 NVIDIA V100 GPUs. It worked fine for me.

For MultiWorkerMirroredStrategy, I just changed the strategy to

strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy(). However, it produced out-of-range error.

I have following questions.

1. Does MultiWorkerMirroredStrategy work fine with Tensorflow 2.1.0?

2. I'm using below data pipeline for loading the data.

train_dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train)).shuffle(BUFFER_SIZE).batch(GLOBAL_BATCH_SIZE, drop_remainder=True)

Do I need to change it for MultiWorkerMirroredStrategy? I'm hoping it should automatically split the data across 12 GPUs analogous to single node MirroredStrategy.

3. I would like to know more about the MultiWorkerMirroredStrategy. How does it perform communication across the node and within the node?

Is there any document to provide more details about MultiWorkerMirroredStrategy?

Error logs:

020-02-21 00:17:45.951027: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:390] Started server with target: grpc://localhost:12345

Number of devices: 12

BUFFER_SIZE = 60000, BATCH_SIZE_PER_REPLICA = 64, GLOBAL_BATCH_SIZE = 768

WARNING:tensorflow:`eval_fn` is not passed in. The `worker_fn` will be used if an "evaluator" task exists in the cluster.

WARNING:tensorflow:`eval_strategy` is not passed in. No distribution strategy will be used for evaluation.

WARNING:tensorflow:ModelCheckpoint callback is not provided. Workers will need to restart training if any fails.

2020-02-21 00:17:55.604934: W tensorflow/core/grappler/optimizers/data/auto_shard.cc:428] In AUTO-mode, and switching to DATA-based sharding, instead of FILE-based sharding as we cannot find appropriate reader dataset op(s) to shard. Error: Found an unshardable source dataset: name: "TensorSliceDataset/_2"

op: "TensorSliceDataset"

input: "Placeholder/_0"

input: "Placeholder/_1"

attr {

key: "Toutput_types"

value {

list {

type: DT_DOUBLE

type: DT_UINT8

}

attr {

key: "output_shapes"

value {

list {

shape {

dim {

size: 28

}

dim {

size: 28

}

shape {

}

Epoch 1/10

2020-02-21 00:17:59.803252: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10

1/Unknown - 7s 7s/step - loss: 2.4605 - accuracy: 0.08722020-02-21 00:18:02.338251: I tensorflow/core/profiler/lib/profiler_session.cc:225] Profiler session started.

2020-02-21 00:18:02.338343: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1259] Profiler found 6 GPUs

2020-02-21 00:18:02.339507: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcupti.so.10.1

WARNING:tensorflow:Method (on_train_batch_end) is slow compared to the batch update (1.144901). Check your callbacks.

2/Unknown - 9s 4s/step - loss: 2.3538 - accuracy: 0.12632020-02-21 00:18:04.644431: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1378] CUPTI activity buffer flushed

2020-02-21 00:18:04.644494: I tensorflow/core/profiler/internal/gpu/device_tracer.cc:88] GpuTracer has collected 660 callback api events and 660 activity events.

78/Unknown - 10s 128ms/step - loss: 0.7817 - accuracy: 0.78192020-02-21 00:18:05.648949: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Out of range: End of sequence

[[{{node IteratorGetNext_5}}]]

[[GroupCrossDeviceControlEdges_0/Adam/Adam/update_0/Const/_259]]

2020-02-21 00:18:05.648949: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Out of range: End of sequence

[[{{node IteratorGetNext_5}}]]

[[GroupCrossDeviceControlEdges_5/Identity_4/_423]]

2020-02-21 00:18:05.648954: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Out of range: End of sequence

[[{{node IteratorGetNext_5}}]]

[[replica_2/metrics/accuracy/AssignAddVariableOp_1/_67]]

2020-02-21 00:18:05.648973: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Out of range: End of sequence

[[{{node IteratorGetNext_5}}]]

[[GroupCrossDeviceControlEdges_0/Adam/Adam/update_0/Const/_239]]

2020-02-21 00:18:05.648983: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Out of range: End of sequence

[[{{node IteratorGetNext_5}}]]

[[Adam/ReadVariableOp_2/_10]]

2020-02-21 00:18:05.648991: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Out of range: End of sequence

[[{{node IteratorGetNext_5}}]]

[[GroupCrossDeviceControlEdges_2/Adam/Adam/update_0/Const/_291]]

2020-02-21 00:18:05.649055: E tensorflow/core/common_runtime/ring_alg.cc:279] Aborting RingReduce with Out of range: End of sequence

[[{{node IteratorGetNext_5}}]]

[[GroupCrossDeviceControlEdges_0/Adam/Adam/update_0/Const/_259]]

2020-02-21 00:18:05.649274: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Out of range: End of sequence

[[{{node IteratorGetNext_5}}]]

[[GroupCrossDeviceControlEdges_0/Adam/Adam/update_0/Const/_259]]

2020-02-21 00:18:05.649396: W tensorflow/core/framework/op_kernel.cc:1655] OP_REQUIRES failed at collective_ops.cc:253 : Out of range: End of sequence

[[{{node IteratorGetNext_5}}]]

[[GroupCrossDeviceControlEdges_0/Adam/Adam/update_0/Const/_259]]

2020-02-21 00:18:05.649653: E tensorflow/core/common_runtime/ring_alg.cc:279] Aborting RingReduce with Out of range: [_Derived_]End of sequence

[[{{node IteratorGetNext_5}}]]

[[GroupCrossDeviceControlEdges_0/Adam/Adam/update_0/Const/_259]]

2020-02-21 00:18:05.649675: E tensorflow/core/common_runtime/ring_alg.cc:279] Aborting RingReduce with Out of range: [_Derived_]End of sequence

[[{{node IteratorGetNext_5}}]]

[[GroupCrossDeviceControlEdges_0/Adam/Adam/update_0/Const/_259]]

2020-02-21 00:18:05.649727: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Out of range: [_Derived_]End of sequence

[[{{node IteratorGetNext_5}}]]

[[GroupCrossDeviceControlEdges_0/Adam/Adam/update_0/Const/_259]]

2020-02-21 00:18:05.649796: E tensorflow/core/common_runtime/ring_alg.cc:279] Aborting RingReduce with Out of range: [_Derived_]End of sequence

[[{{node IteratorGetNext_5}}]]

[[GroupCrossDeviceControlEdges_0/Adam/Adam/update_0/Const/_259]]

2020-02-21 00:18:05.649798: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Out of range: [_Derived_]End of sequence

[[{{node IteratorGetNext_5}}]]

[[GroupCrossDeviceControlEdges_0/Adam/Adam/update_0/Const/_259]]

2020-02-21 00:18:05.649863: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Out of range: [_Derived_]End of sequence

[[{{node IteratorGetNext_5}}]]

[[GroupCrossDeviceControlEdges_0/Adam/Adam/update_0/Const/_259]]

2020-02-21 00:18:05.649944: W tensorflow/core/framework/op_kernel.cc:1655] OP_REQUIRES failed at collective_ops.cc:253 : Out of range: [_Derived_]End of sequence

[[{{node IteratorGetNext_5}}]]

[[GroupCrossDeviceControlEdges_0/Adam/Adam/update_0/Const/_259]]

2020-02-21 00:18:05.650043: W tensorflow/core/framework/op_kernel.cc:1655] OP_REQUIRES failed at collective_ops.cc:253 : Out of range: [_Derived_]End of sequence

[[{{node IteratorGetNext_5}}]]

[[GroupCrossDeviceControlEdges_0/Adam/Adam/update_0/Const/_259]]

2020-02-21 00:18:05.650057: W tensorflow/core/framework/op_kernel.cc:1655] OP_REQUIRES failed at collective_ops.cc:253 : Out of range: [_Derived_]End of sequence

[[{{node IteratorGetNext_5}}]]

[[GroupCrossDeviceControlEdges_0/Adam/Adam/update_0/Const/_259]]

2020-02-21 00:18:05.650062: E tensorflow/core/common_runtime/ring_alg.cc:279] Aborting RingReduce with Out of range: [_Derived_]End of sequence

[[{{node IteratorGetNext_5}}]]

[[GroupCrossDeviceControlEdges_0/Adam/Adam/update_0/Const/_259]]

2020-02-21 00:18:05.650162: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Out of range: [_Derived_]End of sequence

[[{{node IteratorGetNext_5}}]]

[[GroupCrossDeviceControlEdges_0/Adam/Adam/update_0/Const/_259]]

2020-02-21 00:18:05.650322: E tensorflow/core/common_runtime/ring_alg.cc:279] Aborting RingReduce with Out of range: [_Derived_]End of sequence

[[{{node IteratorGetNext_5}}]]

[[GroupCrossDeviceControlEdges_0/Adam/Adam/update_0/Const/_259]]

2020-02-21 00:18:05.650342: W tensorflow/core/framework/op_kernel.cc:1655] OP_REQUIRES failed at collective_ops.cc:253 : Out of range: [_Derived_]End of sequence

[[{{node IteratorGetNext_5}}]]

[[GroupCrossDeviceControlEdges_0/Adam/Adam/update_0/Const/_259]]

2020-02-21 00:18:05.650374: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Out of range: [_Derived_]End of sequence

[[{{node IteratorGetNext_5}}]]

[[GroupCrossDeviceControlEdges_0/Adam/Adam/update_0/Const/_259]]

2020-02-21 00:18:05.650498: W tensorflow/core/framework/op_kernel.cc:1655] OP_REQUIRES failed at collective_ops.cc:253 : Out of range: [_Derived_]End of sequence

[[{{node IteratorGetNext_5}}]]

[[GroupCrossDeviceControlEdges_0/Adam/Adam/update_0/Const/_259]]

2020-02-21 00:18:05.650614: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Out of range: End of sequence

[[{{node IteratorGetNext_5}}]]

78/78 [==============================] - 10s 128ms/step - loss: 0.7817 - accuracy: 0.7819

Epoch 2/10

2020-02-21 00:18:05.692339: E tensorflow/core/common_runtime/ring_alg.cc:279] Aborting RingReduce with Out of range: [_Derived_]End of sequence

[[{{node IteratorGetNext_5}}]]

[[GroupCrossDeviceControlEdges_0/Adam/Adam/update_0/Const/_259]]

2020-02-21 00:18:05.692487: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Out of range: [_Derived_]End of sequence

[[{{node IteratorGetNext_5}}]]

[[GroupCrossDeviceControlEdges_0/Adam/Adam/update_0/Const/_259]]

2020-02-21 00:18:05.692610: W tensorflow/core/framework/op_kernel.cc:1655] OP_REQUIRES failed at collective_ops.cc:253 : Out of range: [_Derived_]End of sequence

[[{{node IteratorGetNext_5}}]]

[[GroupCrossDeviceControlEdges_0/Adam/Adam/update_0/Const/_259]]

2020-02-21 00:18:05.692701: E tensorflow/core/common_runtime/ring_alg.cc:279] Aborting RingReduce with Out of range: [_Derived_]End of sequence

[[{{node IteratorGetNext_5}}]]

[[GroupCrossDeviceControlEdges_0/Adam/Adam/update_0/Const/_259]]

2020-02-21 00:18:05.692718: E tensorflow/core/common_runtime/ring_alg.cc:279] Aborting RingReduce with Out of range: [_Derived_]End of sequence

[[{{node IteratorGetNext_5}}]]

[[GroupCrossDeviceControlEdges_0/Adam/Adam/update_0/Const/_259]]

2020-02-21 00:18:05.692717: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Out of range: [_Derived_]End of sequence

[[{{node IteratorGetNext_5}}]]

[[GroupCrossDeviceControlEdges_0/Adam/Adam/update_0/Const/_259]]

[[allreduce_1/CollectiveReduce]]

[[allreduce_1/CollectiveReduce/_354]]

2020-02-21 00:18:05.692858: E tensorflow/core/common_runtime/ring_alg.cc:279] Aborting RingReduce with Out of range: [_Derived_]End of sequence

[[{{node IteratorGetNext_5}}]]

[[GroupCrossDeviceControlEdges_0/Adam/Adam/update_0/Const/_259]]

2020-02-21 00:18:05.692798: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Out of range: [_Derived_]End of sequence

[[{{node IteratorGetNext_5}}]]

[[GroupCrossDeviceControlEdges_0/Adam/Adam/update_0/Const/_259]]

2020-02-21 00:18:05.692850: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Out of range: [_Derived_]End of sequence

[[{{node IteratorGetNext_5}}]]

[[GroupCrossDeviceControlEdges_0/Adam/Adam/update_0/Const/_259]]

2020-02-21 00:18:05.692755: E tensorflow/core/common_runtime/ring_alg.cc:279] Aborting RingReduce with Out of range: [_Derived_]End of sequence

[[{{node IteratorGetNext_5}}]]

[[GroupCrossDeviceControlEdges_0/Adam/Adam/update_0/Const/_259]]

2020-02-21 00:18:05.692993: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Out of range: [_Derived_]End of sequence

[[{{node IteratorGetNext_5}}]]

[[GroupCrossDeviceControlEdges_0/Adam/Adam/update_0/Const/_259]]

2020-02-21 00:18:05.693013: E tensorflow/core/common_runtime/ring_alg.cc:279] Aborting RingReduce with Out of range: [_Derived_]End of sequence

[[{{node IteratorGetNext_5}}]]

[[GroupCrossDeviceControlEdges_0/Adam/Adam/update_0/Const/_259]]

2020-02-21 00:18:05.693048: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Out of range: [_Derived_]End of sequence

[[{{node IteratorGetNext_5}}]]

[[GroupCrossDeviceControlEdges_0/Adam/Adam/update_0/Const/_259]]

2020-02-21 00:18:05.693106: W tensorflow/core/framework/op_kernel.cc:1655] OP_REQUIRES failed at collective_ops.cc:253 : Out of range: [_Derived_]End of sequence

[[{{node IteratorGetNext_5}}]]

[[GroupCrossDeviceControlEdges_0/Adam/Adam/update_0/Const/_259]]

2020-02-21 00:18:05.693161: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Out of range: [_Derived_]End of sequence

[[{{node IteratorGetNext_5}}]]

[[GroupCrossDeviceControlEdges_0/Adam/Adam/update_0/Const/_259]]

2020-02-21 00:18:05.693121: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Cancelled: Iterator was cancelled

[[{{node IteratorGetNext_5}}]]

2020-02-21 00:18:05.693146: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Cancelled: Iterator was cancelled

[[{{node IteratorGetNext_2}}]]

2020-02-21 00:18:05.693292: W tensorflow/core/framework/op_kernel.cc:1655] OP_REQUIRES failed at collective_ops.cc:253 : Out of range: [_Derived_]End of sequence

[[{{node IteratorGetNext_5}}]]

[[GroupCrossDeviceControlEdges_0/Adam/Adam/update_0/Const/_259]]

2020-02-21 00:18:05.693167: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Cancelled: Iterator was cancelled

[[{{node IteratorGetNext_4}}]]

2020-02-21 00:18:05.693153: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Cancelled: Iterator was cancelled

[[{{node IteratorGetNext_3}}]]

2020-02-21 00:18:05.693224: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Cancelled: Iterator was cancelled

[[{{node IteratorGetNext_1}}]]

2020-02-21 00:18:05.693114: W tensorflow/core/framework/op_kernel.cc:1655] OP_REQUIRES failed at collective_ops.cc:253 : Out of range: [_Derived_]End of sequence

[[{{node IteratorGetNext_5}}]]

[[GroupCrossDeviceControlEdges_0/Adam/Adam/update_0/Const/_259]]

2020-02-21 00:18:05.693399: W tensorflow/core/framework/op_kernel.cc:1655] OP_REQUIRES failed at collective_ops.cc:253 : Out of range: [_Derived_]End of sequence

[[{{node IteratorGetNext_5}}]]

[[GroupCrossDeviceControlEdges_0/Adam/Adam/update_0/Const/_259]]

2020-02-21 00:18:05.693418: W tensorflow/core/framework/op_kernel.cc:1655] OP_REQUIRES failed at collective_ops.cc:253 : Out of range: [_Derived_]End of sequence

[[{{node IteratorGetNext_5}}]]

[[GroupCrossDeviceControlEdges_0/Adam/Adam/update_0/Const/_259]]

2020-02-21 00:18:05.694962: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Out of range: [_Derived_]End of sequence

[[{{node IteratorGetNext_5}}]]

[[GroupCrossDeviceControlEdges_0/Adam/Adam/update_0/Const/_259]]

[[allreduce_1/CollectiveReduce]]

1/78 [..............................] - ETA: 1sTraceback (most recent call last):

File "model/mnist_multi_node_multi_gpu.py", line 59, in <module>

dmodel.fit(train_dataset, epochs=10, callbacks=[tensorboard_callback])

File "/export/home/arde/code/tf2-venv/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training.py", line 819, in fit

use_multiprocessing=use_multiprocessing)

File "/export/home/arde/code/tf2-venv/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_distributed.py", line 790, in fit

*args, **kwargs)

File "/export/home/arde/code/tf2-venv/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_distributed.py", line 777, in wrapper

mode=dc.CoordinatorMode.INDEPENDENT_WORKER)

File "/export/home/arde/code/tf2-venv/lib/python3.7/site-packages/tensorflow_core/python/distribute/distribute_coordinator.py", line 853, in run_distribute_coordinator

task_id, session_config, rpc_layer)

File "/export/home/arde/code/tf2-venv/lib/python3.7/site-packages/tensorflow_core/python/distribute/distribute_coordinator.py", line 360, in _run_single_worker

return worker_fn(strategy)

File "/export/home/arde/code/tf2-venv/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_distributed.py", line 772, in _worker_fn

return method(model, **kwargs)

File "/export/home/arde/code/tf2-venv/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_v2.py", line 342, in fit

total_epochs=epochs)

File "/export/home/arde/code/tf2-venv/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_v2.py", line 128, in run_one_epoch

batch_outs = execution_function(iterator)

File "/export/home/arde/code/tf2-venv/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_v2_utils.py", line 98, in execution_function

distributed_function(input_fn))

File "/export/home/arde/code/tf2-venv/lib/python3.7/site-packages/tensorflow_core/python/eager/def_function.py", line 568, in __call__

result = self._call(*args, **kwds)

File "/export/home/arde/code/tf2-venv/lib/python3.7/site-packages/tensorflow_core/python/eager/def_function.py", line 599, in _call

return self._stateless_fn(*args, **kwds) # pylint: disable=not-callable

File "/export/home/arde/code/tf2-venv/lib/python3.7/site-packages/tensorflow_core/python/eager/function.py", line 2363, in __call__

return graph_function._filtered_call(args, kwargs) # pylint: disable=protected-access

File "/export/home/arde/code/tf2-venv/lib/python3.7/site-packages/tensorflow_core/python/eager/function.py", line 1611, in _filtered_call

self.captured_inputs)

File "/export/home/arde/code/tf2-venv/lib/python3.7/site-packages/tensorflow_core/python/eager/function.py", line 1692, in _call_flat

ctx, args, cancellation_manager=cancellation_manager))

File "/export/home/arde/code/tf2-venv/lib/python3.7/site-packages/tensorflow_core/python/eager/function.py", line 545, in call

ctx=ctx)

File "/export/home/arde/code/tf2-venv/lib/python3.7/site-packages/tensorflow_core/python/eager/execute.py", line 67, in quick_execute

six.raise_from(core._status_to_exception(e.code, message), None)

File "<string>", line 3, in raise_from

tensorflow.python.framework.errors_impl.CancelledError: [_Derived_]Iterator was cancelled

[[node IteratorGetNext_2 (defined at model/mnist_multi_node_multi_gpu.py:59) ]] [Op:__inference_distributed_function_5054]

Function call stack:

distributed_function

2020-02-21 00:18:05.817692: W tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelled

2020-02-21 00:18:05.818577: W tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelled

2020-02-21 00:18:05.819643: W tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelled

2020-02-21 00:18:05.820546: W tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelled

2020-02-21 00:18:05.821401: W tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelled

2020-02-21 00:18:05.822531: W tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelled

Priya Gupta

unread,

Apr 6, 2020, 2:21:17 AM4/6/20

to Arup De, Rick Chao, Ran Chen, TensorFlow Developers

+Rick Chao +Ran Chen

Hi,

You can follow the tutorial here to learn more about MultiWorkerMirroredStrategy with Keras in TF2.

Regarding the error you're seeing, I think this is likely due to not using `steps_per_epoch` in the model.fit call. As mentioned here, currently MultiWorkerMirroredStrategy doesn't handle last partial batches correctly. So we require passing steps_per_epoch for now. In fact, if you use the tf-nightly, it will ask you to pass this argument now when using MultiWorkerMirroredStrategy.

On Fri, Feb 21, 2020 at 9:31 AM Arup De <ard...@gmail.com> wrote:

Hi,
I got out-of-range error when using distributed training with MultiWorkerMirroredStrategy (Tensorflow version: 2.1.0).
I'm using two nodes, each with 6 NVIDIA V100 GPUs. I'm using keras mnist model for this experiment.
Initially, I tried with single node MirroredStrategy with 6 NVIDIA V100 GPUs. It worked fine for me.
For MultiWorkerMirroredStrategy, I just changed the strategy to
strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy(). However, it produced out-of-range error.
I have following questions.

1. Does MultiWorkerMirroredStrategy work fine with Tensorflow 2.1.0?

Yes, but there are still bugs and issues we are working through. Many have been fixed in the nightlies.

2. I'm using below data pipeline for loading the data.
train_dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train)).shuffle(BUFFER_SIZE).batch(GLOBAL_BATCH_SIZE, drop_remainder=True)
Do I need to change it for MultiWorkerMirroredStrategy? I'm hoping it should automatically split the data across 12 GPUs analogous to single node MirroredStrategy.

You don't need to change it. From the logs you shared, you can see that we are not able to shard the data in a performant way since it's not being read from files. But it would be sharded nevertheless. For better performance though, you may want to read from files.

3. I would like to know more about the MultiWorkerMirroredStrategy. How does it perform communication across the node and within the node?
Is there any document to provide more details about MultiWorkerMirroredStrategy?

https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras talks about some of these things.

--
You received this message because you are subscribed to the Google Groups "TensorFlow Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to developers+...@tensorflow.org.
To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/developers/e7bab4c1-18fa-4f6d-b0a2-e341a983f399%40tensorflow.org.

Arup De

unread,

Apr 7, 2020, 3:47:07 PM4/7/20

to Rick Chao, Priya Gupta, Ran Chen, TensorFlow Developers

Yes. Thanks Priya and Rick.

On Mon, Apr 6, 2020 at 10:29 AM Rick Chao <rc...@google.com> wrote:

Hello Arup,

Let us know if Priya's suggestion solved the issue, and we'll update the tutorial to make it clearer. Thanks for reaching out!

Best,
Rick

Arup De

unread,

Apr 8, 2020, 3:26:40 PM4/8/20

to Rick Chao, Priya Gupta, Ran Chen, TensorFlow Developers

Hi Priya, Chen and Rick,

I'm looking at the multi-worker mirrored strategy estimator example https://www.tensorflow.org/tutorials/distribute/multi_worker_with_estimator. Could you please confirm the batch size in input_fn(). Is it a global batch size or batch size per replica ( global_batch/number_of_replica)?

Thanks,

Arup

Priya Gupta

unread,

Apr 8, 2020, 3:31:09 PM4/8/20

to Arup De, Ayush Dubey, Rick Chao, Ran Chen, TensorFlow Developers

For estimator input_fn, it should be batch size per replica ( global_batch/number_of_replica).

although maybe the BATCH_SIZE being used in loss scaling in this code snippet is wrong, and should be the global batch size. cc @Ayush Dubey

To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/developers/CAEFEUTKs%3De86uv3MqzPC%2B1pyur0Fq4F7v1upLdLrJHiZc-F98w%40mail.gmail.com.

Arup De

unread,

Apr 8, 2020, 4:59:46 PM4/8/20

to Priya Gupta, Ayush Dubey, Rick Chao, Ran Chen, TensorFlow Developers

Thanks Priya.

-Arup

Arup De

unread,

Apr 8, 2020, 7:39:35 PM4/8/20

to Priya Gupta, Ayush Dubey, Rick Chao, Ran Chen, TensorFlow Developers

Hi Priya,

What would be the official release date for TF 2.2?

Thanks,

Arup

Priya Gupta

unread,

Apr 9, 2020, 2:12:11 AM4/9/20

to Arup De, Goldie Gadde, Ayush Dubey, Rick Chao, Ran Chen, TensorFlow Developers

+Goldie Gadde can you help answer this question?

Arup De

unread,

Apr 9, 2020, 4:21:44 PM4/9/20

to Priya Gupta, Goldie Gadde, Ayush Dubey, Rick Chao, Ran Chen, TensorFlow Developers

Hi Priya and Rick,       I'm migrating a model for TF 1.14 to TF 2. I used the tf_upgrade_v2 script to change the model.  The upgraded model used `tf.compat.v1.get_variable()`  and and ` tf.compat.v1.train.AdagradOptimizer()`. 
It successfully ran on TF 2.0 using parameter server strategy. We used the estimator api for the distributed training. But when I changed the strategy to multi-worker mirrored strategy, it didn't work for me. 
There is some issue with Adagrad Initializer CollectiveBcastRecv. Please see the error below. Are those calls compatible (`tf.compat.v1.get_variable()`  and 
tf.compat.v1.train.AdagradOptimize()) with multi-worker mirrored strategy using estimator API? Please let me know if there is a quick fix for this issue.

API calls:
x = tf.compat.v1.get_variable(
                    name=x,
                    initializer=tf.random.truncated_normal([tensor_len, num_classes],
                                                           stddev=1.0 / math.sqrt(float(tensor_len))),
                    regularizer=tf.keras.regularizers.l2(l2_reg_weight)

 optimizer = tf.compat.v1.train.AdagradOptimizer(0.01)


Error Log:
2020-04-09 01:57:51.499271: W tensorflow/core/framework/op_kernel.cc:1622] OP_REQUIRES failed at collective_ops.cc:365 : Internal: RecvBufResponse returned 2408 bytes where to_tensor expected 808
2020-04-09 01:57:51.499305: W tensorflow/core/framework/op_kernel.cc:1622] OP_REQUIRES failed at collective_ops.cc:365 : Internal: RecvBufResponse returned 808 bytes where to_tensor expected 2408
2020-04-09 01:57:51.499272: W tensorflow/core/framework/op_kernel.cc:1622] OP_REQUIRES failed at collective_ops.cc:365 : Internal: RecvBufResponse returned 2408 bytes where to_tensor expected 808
2020-04-09 01:57:51.499371: W tensorflow/core/framework/op_kernel.cc:1622] OP_REQUIRES failed at collective_ops.cc:365 : Internal: RecvBufResponse returned 808 bytes where to_tensor expected 2408
2020-04-09 01:57:51.499365: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Internal: RecvBufResponse returned 2408 bytes where to_tensor expected 808
	 [[{{node memberFeatures_geoRegion_weights/Adagrad/Initializer/CollectiveBcastRecv}}]]
2020-04-09 01:57:51.500466: W tensorflow/core/framework/op_kernel.cc:1622] OP_REQUIRES failed at collective_ops.cc:365 : Cancelled: [_Derived_]Cancelled
Additional GRPC error information:
{"created":"@1586397471.499604492","description":"Error received from peer","file":"external/grpc/src/core/lib/surface/call.cc","file_line":1039,"grpc_message":"Cancelled","grpc_status":1}
2020-04-09 01:57:51.501131: W tensorflow/core/framework/op_kernel.cc:1622] OP_REQUIRES failed at collective_ops.cc:365 : Internal: [_Derived_]RecvBufResponse returned 2408 bytes where to_tensor expected 808
	 [[{{node memberFeatures_geoRegion_weights/Adagrad/Initializer/CollectiveBcastRecv}}]]
2020-04-09 01:57:51.501135: W tensorflow/core/framework/op_kernel.cc:1622] OP_REQUIRES failed at collective_ops.cc:365 : Internal: [_Derived_]RecvBufResponse returned 2408 bytes where to_tensor expected 808
	 [[{{node memberFeatures_geoRegion_weights/Adagrad/Initializer/CollectiveBcastRecv}}]]
2020-04-09 01:57:51.501170: W tensorflow/core/framework/op_kernel.cc:1622] OP_REQUIRES failed at collective_ops.cc:365 : Internal: [_Derived_]RecvBufResponse returned 2408 bytes where to_tensor expected 808
	 [[{{node memberFeatures_geoRegion_weights/Adagrad/Initializer/CollectiveBcastRecv}}]]
2020-04-09 01:57:51.501175: W tensorflow/core/framework/op_kernel.cc:1622] OP_REQUIRES failed at collective_ops.cc:365 : Internal: [_Derived_]RecvBufResponse returned 2408 bytes where to_tensor expected 808
	 [[{{node memberFeatures_geoRegion_weights/Adagrad/Initializer/CollectiveBcastRecv}}]]
2020-04-09 01:57:51.501192: W tensorflow/core/framework/op_kernel.cc:1622] OP_REQUIRES failed at collective_ops.cc:365 : Internal: [_Derived_]RecvBufResponse returned 2408 bytes where to_tensor expected 808
	 [[{{node memberFeatures_geoRegion_weights/Adagrad/Initializer/CollectiveBcastRecv}}]]
2020-04-09 01:57:51.501223: W tensorflow/core/framework/op_kernel.cc:1622] OP_REQUIRES failed at collective_ops.cc:365 : Internal: [_Derived_]RecvBufResponse returned 2408 bytes where to_tensor expected 808
	 [[{{node memberFeatures_geoRegion_weights/Adagrad/Initializer/CollectiveBcastRecv}}]]
Traceback (most recent call last):
  File "<>/site-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
    return fn(*args)
  File "<>/site-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
    target_list, run_metadata)
  File "<>/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InternalError: From /job:worker/replica:0/task:1:
RecvBufResponse returned 2408 bytes where to_tensor expected 808
	 [[{{node memberFeatures_geoRegion_weights/Adagrad/Initializer/CollectiveBcastRecv}}]]


Thanks,
Arup

Arup De

unread,

Apr 10, 2020, 9:04:08 PM4/10/20

to Rick Chao, Priya Gupta, Goldie Gadde, Ayush Dubey, Ran Chen, TensorFlow Developers

Hi Rick,

I would definitely use Keras with MultiWorkerMirroredStrategy scope for new models. However, the existing models may still incline to use estimator API.

What would be the long term planning for MultiWorkerMirroredStrategy? Will it be only supported by Keras with tf.distribute?

About the above issue regarding distributed training with MultiWorkerMirroredStrategy using estimator API, https://www.tensorflow.org/api_docs/python/tf/estimator/Estimator the input_fn and

model_fn both ran in graph mode. Unfortunately, in graph mode for multi-worker mirrored strategy, it didn't able handle python dict properly (unorderedness). I solved it by maintaining the

consistent feature ordering using OrderedDict. I found the debugging is extremely difficult in graph mode. Could you please share the methodology to detect any graph mode discrepancy early for

model_fn, and any modeling guidelines/supported constructs.

Thanks,

Arup

On Thu, Apr 9, 2020 at 9:24 PM Rick Chao <rc...@google.com> wrote:

Hello Arup,

I was wondering if it'd be possible to use a Keras model with a MultiWorkerMirroredStrategy scope as opposed to an estimator - Keras with tf.distribute is a better supported path. Let us know if that's a possibility and thanks!

Best,
Rick

Priya Gupta

unread,

Apr 11, 2020, 1:40:29 PM4/11/20

to Arup De, Rick Chao, Goldie Gadde, Ayush Dubey, Ran Chen, TensorFlow Developers

On Fri, Apr 10, 2020 at 6:04 PM Arup De <ard...@gmail.com> wrote:

Hi Rick,
I would definitely use Keras with MultiWorkerMirroredStrategy scope for new models. However, the existing models may still incline to use estimator API.
What would be the long term planning for MultiWorkerMirroredStrategy? Will it be only supported by Keras with tf.distribute?

Yes, MultiWorkerMirroredStrategy support with Estimator is experimental and not being improved. All future improvements are targeted towards the Keras integration.

About the above issue regarding distributed training with MultiWorkerMirroredStrategy using estimator API, https://www.tensorflow.org/api_docs/python/tf/estimator/Estimator the input_fn and
model_fn both ran in graph mode. Unfortunately, in graph mode for multi-worker mirrored strategy, it didn't able handle python dict properly (unorderedness). I solved it by maintaining the
consistent feature ordering using OrderedDict.

Do you mean you were able to fix the above error "RecvBufResponse returned 2408 bytes where to_tensor expected 808" by using an ordered dict?

Priya Gupta

unread,

Apr 11, 2020, 1:41:28 PM4/11/20

to Arup De, Rick Chao, Goldie Gadde, Ayush Dubey, Ran Chen, TensorFlow Developers

BTW, which version of TF are you using when you ran into the "RecvBufResponse returned 2408 bytes where to_tensor expected 808" error?

Arup De

unread,

Apr 13, 2020, 1:57:22 PM4/13/20

to Priya Gupta, Rick Chao, Goldie Gadde, Ayush Dubey, Ran Chen, TensorFlow Developers

Hi Priya,

I'm using TensorFlow 2.0.

Thanks,

Arup

Arup De

unread,

Apr 13, 2020, 1:57:36 PM4/13/20

to Priya Gupta, Rick Chao, Goldie Gadde, Ayush Dubey, Ran Chen, TensorFlow Developers

Hi Priya,

That's correct. It was fixed by using ordered dict.

Thanks,

Arup

On Sat, Apr 11, 2020 at 10:40 AM Priya Gupta <pri...@google.com> wrote:

Priya Gupta

unread,

Apr 13, 2020, 2:03:12 PM4/13/20

to Arup De, Rick Chao, Goldie Gadde, Ayush Dubey, Ran Chen, TensorFlow Developers

Thanks Arup. This was a dict that you created and maintained? MultiWorkerMirroredStrategy does require variables to be created in the same order in all workers and using dict can be error prone, so that is the right fix. Good to know this worked for you!

Arup De

unread,

Apr 13, 2020, 2:16:51 PM4/13/20

to Priya Gupta, Rick Chao, Goldie Gadde, Ayush Dubey, Ran Chen, TensorFlow Developers

Yes, that dict keeps the features information.

Thanks,

Arup

Arup De

unread,

May 7, 2020, 8:47:21 PM5/7/20

to Priya Gupta, Rick Chao, Goldie Gadde, Ayush Dubey, Ran Chen, TensorFlow Developers

Hi Priya,

I'm getting below error when I used MultiWorkerMirroredStrategy for distributed training.The model works fine with parameter server strategy. I'm using TF 2.0.

Error:

2020-05-07 23:12:21.596480: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Internal: Inconsistent output shapes, got [4], but expected is [6].

[[{{node allreduce_2/CollectiveGather_43}}]]

2020-05-07 23:12:21.597835: E tensorflow/core/common_runtime/ring_alg.cc:279] Aborting RingGather with Internal: [_Derived_]Inconsistent output shapes, got [4], but expected is [6].

[[{{node allreduce_2/CollectiveGather_43}}]]

2020-05-07 23:12:21.598012: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Internal: [_Derived_]Inconsistent output shapes, got [4], but expected is [6].

[[{{node allreduce_2/CollectiveGather_43}}]]

2020-05-07 23:12:21.598126: E tensorflow/core/common_runtime/ring_alg.cc:279] Aborting RingGather with Internal: [_Derived_]Inconsistent output shapes, got [4], but expected is [6].

The model has an embedding layer on top, that used tf.compat.v1.nn.safe_embedding_lookup_sparse for lookuo embedding.

sp_ids = tf.SparseTensor(indices=feature.indices, values=feature.indices[:, -1],

dense_shape=feature.dense_shape)

embeddings = tf.compat.v1.nn.safe_embedding_lookup_sparse(embedding_weights=weights,

sparse_ids=sp_ids,

sparse_weights=feature,

combiner=combiner,

partition_strategy="mod"

)

Could you please confirm MultiWorkerMirroredStrategy supports above API. Please let me know how to fix this issue.

Do you think upgrade to TF 2.2 would help?

Thanks,

Arup

Priya Gupta

unread,

May 8, 2020, 1:48:15 AM5/8/20

to Arup De, Chenkai Kuang, Rick Chao, Goldie Gadde, Ayush Dubey, Ran Chen, TensorFlow Developers

+Chenkai Kuang

Hi Arup,

Yes this has been fixed in TF 2.2 - please test with that and let us know if it works.

Paul Cox

unread,

May 10, 2020, 3:10:54 AM5/10/20

to Priya Gupta, Arup De, Chenkai Kuang, Rick Chao, Goldie Gadde, Ayush Dubey, Ran Chen, TensorFlow Developers

😉

To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/developers/CAPecb-PqKxMn6wBVuB_5gJ1_P%3D2QhUqoDyxUye-zHhOP7YqGUw%40mail.gmail.com.

Arup De

unread,

Oct 16, 2020, 8:18:19 PM10/16/20

to Priya Gupta, Chenkai Kuang, Rick Chao, Goldie Gadde, Ayush Dubey, Ran Chen, TensorFlow Developers

Hi Priya and Rick,

I was running tensorflow distributed training with parameter strategy using estimator API.

I used tensorflow 2.2 for this experiment with 6 workers (with 1 V100 GPU.each) and 1 ps node. I observed the training performance

significantly improved after adding a GPU to the ps node (4.6675 global_step/sec to 41.0646 global_step/sec)

Performance summary in terms of global_step/sec:

6 workers (1 V100 GPU each) + ps node (without GPU): global_step/sec: 4.6675

6 workers (1 V100 GPU each) + ps node (1 V100 GPU): global_step/sec: 41.0646

Could you please explain why the performance improves after adding a GPU to the ps node? what operations got accelerated by the GPU in the ps node?

Currently, the tensorflow profiler didn't show the ps node trace. How can we see the ps node trace or any other de debugging methodology to investigate this issue?