Distributed training - MultiWorkerMirroredStrategy

982 views
Skip to first unread message

Arup De

unread,
Feb 21, 2020, 12:30:57 PM2/21/20
to TensorFlow Developers
Hi,
    I got out-of-range error when using  distributed training with MultiWorkerMirroredStrategy (Tensorflow  version: 2.1.0). 
I'm using two nodes, each with 6 NVIDIA V100 GPUs. I'm using keras mnist model for this experiment. 
Initially, I tried with single node MirroredStrategy with NVIDIA V100 GPUs.  It worked fine for me.
For MultiWorkerMirroredStrategy, I just changed the strategy to 
strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy(). However, it produced out-of-range error.
I have following questions.

1. Does MultiWorkerMirroredStrategy work fine with Tensorflow 2.1.0?
2. I'm using below data pipeline for loading the data.
train_dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train)).shuffle(BUFFER_SIZE).batch(GLOBAL_BATCH_SIZE, drop_remainder=True)
Do I need to change it for  MultiWorkerMirroredStrategy? I'm hoping it should automatically split the data across 12 GPUs analogous to single node MirroredStrategy.
3. I would like to know more about the MultiWorkerMirroredStrategy. How does it perform communication across the node and within the node?
  Is there any document to provide more details about MultiWorkerMirroredStrategy?



Error logs:
020-02-21 00:17:45.951027: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:390] Started server with target: grpc://localhost:12345
Number of devices: 12
BUFFER_SIZE = 60000, BATCH_SIZE_PER_REPLICA = 64, GLOBAL_BATCH_SIZE = 768
WARNING:tensorflow:`eval_fn` is not passed in. The `worker_fn` will be used if an "evaluator" task exists in the cluster.
WARNING:tensorflow:`eval_strategy` is not passed in. No distribution strategy will be used for evaluation.
WARNING:tensorflow:ModelCheckpoint callback is not provided. Workers will need to restart training if any fails.
2020-02-21 00:17:55.604934: W tensorflow/core/grappler/optimizers/data/auto_shard.cc:428] In AUTO-mode, and switching to DATA-based sharding, instead of FILE-based sharding as we cannot find appropriate reader dataset op(s) to shard. Error: Found an unshardable source dataset: name: "TensorSliceDataset/_2"
op: "TensorSliceDataset"
input: "Placeholder/_0"
input: "Placeholder/_1"
attr {
  key: "Toutput_types"
  value {
    list {
      type: DT_DOUBLE
      type: DT_UINT8
    }
  }
}
attr {
  key: "output_shapes"
  value {
    list {
      shape {
        dim {
          size: 28
        }
        dim {
          size: 28
        }
      }
      shape {
      }
    }
  }
}

Epoch 1/10
2020-02-21 00:17:59.803252: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
      1/Unknown - 7s 7s/step - loss: 2.4605 - accuracy: 0.08722020-02-21 00:18:02.338251: I tensorflow/core/profiler/lib/profiler_session.cc:225] Profiler session started.
2020-02-21 00:18:02.338343: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1259] Profiler found 6 GPUs
2020-02-21 00:18:02.339507: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcupti.so.10.1
WARNING:tensorflow:Method (on_train_batch_end) is slow compared to the batch update (1.144901). Check your callbacks.
      2/Unknown - 9s 4s/step - loss: 2.3538 - accuracy: 0.12632020-02-21 00:18:04.644431: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1378] CUPTI activity buffer flushed
2020-02-21 00:18:04.644494: I tensorflow/core/profiler/internal/gpu/device_tracer.cc:88]  GpuTracer has collected 660 callback api events and 660 activity events.
     78/Unknown - 10s 128ms/step - loss: 0.7817 - accuracy: 0.78192020-02-21 00:18:05.648949: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
[[{{node IteratorGetNext_5}}]]
[[GroupCrossDeviceControlEdges_0/Adam/Adam/update_0/Const/_259]]
2020-02-21 00:18:05.648949: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
[[{{node IteratorGetNext_5}}]]
[[GroupCrossDeviceControlEdges_5/Identity_4/_423]]
2020-02-21 00:18:05.648954: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
[[{{node IteratorGetNext_5}}]]
[[replica_2/metrics/accuracy/AssignAddVariableOp_1/_67]]
2020-02-21 00:18:05.648973: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
[[{{node IteratorGetNext_5}}]]
[[GroupCrossDeviceControlEdges_0/Adam/Adam/update_0/Const/_239]]
2020-02-21 00:18:05.648983: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
[[{{node IteratorGetNext_5}}]]
[[Adam/ReadVariableOp_2/_10]]
2020-02-21 00:18:05.648991: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
[[{{node IteratorGetNext_5}}]]
[[GroupCrossDeviceControlEdges_2/Adam/Adam/update_0/Const/_291]]
2020-02-21 00:18:05.649055: E tensorflow/core/common_runtime/ring_alg.cc:279] Aborting RingReduce with Out of range: End of sequence
[[{{node IteratorGetNext_5}}]]
[[GroupCrossDeviceControlEdges_0/Adam/Adam/update_0/Const/_259]]
2020-02-21 00:18:05.649274: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
[[{{node IteratorGetNext_5}}]]
[[GroupCrossDeviceControlEdges_0/Adam/Adam/update_0/Const/_259]]
2020-02-21 00:18:05.649396: W tensorflow/core/framework/op_kernel.cc:1655] OP_REQUIRES failed at collective_ops.cc:253 : Out of range: End of sequence
[[{{node IteratorGetNext_5}}]]
[[GroupCrossDeviceControlEdges_0/Adam/Adam/update_0/Const/_259]]
2020-02-21 00:18:05.649653: E tensorflow/core/common_runtime/ring_alg.cc:279] Aborting RingReduce with Out of range: [_Derived_]End of sequence
[[{{node IteratorGetNext_5}}]]
[[GroupCrossDeviceControlEdges_0/Adam/Adam/update_0/Const/_259]]
2020-02-21 00:18:05.649675: E tensorflow/core/common_runtime/ring_alg.cc:279] Aborting RingReduce with Out of range: [_Derived_]End of sequence
[[{{node IteratorGetNext_5}}]]
[[GroupCrossDeviceControlEdges_0/Adam/Adam/update_0/Const/_259]]
2020-02-21 00:18:05.649727: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Out of range: [_Derived_]End of sequence
[[{{node IteratorGetNext_5}}]]
[[GroupCrossDeviceControlEdges_0/Adam/Adam/update_0/Const/_259]]
2020-02-21 00:18:05.649796: E tensorflow/core/common_runtime/ring_alg.cc:279] Aborting RingReduce with Out of range: [_Derived_]End of sequence
[[{{node IteratorGetNext_5}}]]
[[GroupCrossDeviceControlEdges_0/Adam/Adam/update_0/Const/_259]]
2020-02-21 00:18:05.649798: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Out of range: [_Derived_]End of sequence
[[{{node IteratorGetNext_5}}]]
[[GroupCrossDeviceControlEdges_0/Adam/Adam/update_0/Const/_259]]
2020-02-21 00:18:05.649863: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Out of range: [_Derived_]End of sequence
[[{{node IteratorGetNext_5}}]]
[[GroupCrossDeviceControlEdges_0/Adam/Adam/update_0/Const/_259]]
2020-02-21 00:18:05.649944: W tensorflow/core/framework/op_kernel.cc:1655] OP_REQUIRES failed at collective_ops.cc:253 : Out of range: [_Derived_]End of sequence
[[{{node IteratorGetNext_5}}]]
[[GroupCrossDeviceControlEdges_0/Adam/Adam/update_0/Const/_259]]
2020-02-21 00:18:05.650043: W tensorflow/core/framework/op_kernel.cc:1655] OP_REQUIRES failed at collective_ops.cc:253 : Out of range: [_Derived_]End of sequence
[[{{node IteratorGetNext_5}}]]
[[GroupCrossDeviceControlEdges_0/Adam/Adam/update_0/Const/_259]]
2020-02-21 00:18:05.650057: W tensorflow/core/framework/op_kernel.cc:1655] OP_REQUIRES failed at collective_ops.cc:253 : Out of range: [_Derived_]End of sequence
[[{{node IteratorGetNext_5}}]]
[[GroupCrossDeviceControlEdges_0/Adam/Adam/update_0/Const/_259]]
2020-02-21 00:18:05.650062: E tensorflow/core/common_runtime/ring_alg.cc:279] Aborting RingReduce with Out of range: [_Derived_]End of sequence
[[{{node IteratorGetNext_5}}]]
[[GroupCrossDeviceControlEdges_0/Adam/Adam/update_0/Const/_259]]
2020-02-21 00:18:05.650162: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Out of range: [_Derived_]End of sequence
[[{{node IteratorGetNext_5}}]]
[[GroupCrossDeviceControlEdges_0/Adam/Adam/update_0/Const/_259]]
2020-02-21 00:18:05.650322: E tensorflow/core/common_runtime/ring_alg.cc:279] Aborting RingReduce with Out of range: [_Derived_]End of sequence
[[{{node IteratorGetNext_5}}]]
[[GroupCrossDeviceControlEdges_0/Adam/Adam/update_0/Const/_259]]
2020-02-21 00:18:05.650342: W tensorflow/core/framework/op_kernel.cc:1655] OP_REQUIRES failed at collective_ops.cc:253 : Out of range: [_Derived_]End of sequence
[[{{node IteratorGetNext_5}}]]
[[GroupCrossDeviceControlEdges_0/Adam/Adam/update_0/Const/_259]]
2020-02-21 00:18:05.650374: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Out of range: [_Derived_]End of sequence
[[{{node IteratorGetNext_5}}]]
[[GroupCrossDeviceControlEdges_0/Adam/Adam/update_0/Const/_259]]
2020-02-21 00:18:05.650498: W tensorflow/core/framework/op_kernel.cc:1655] OP_REQUIRES failed at collective_ops.cc:253 : Out of range: [_Derived_]End of sequence
[[{{node IteratorGetNext_5}}]]
[[GroupCrossDeviceControlEdges_0/Adam/Adam/update_0/Const/_259]]
2020-02-21 00:18:05.650614: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
[[{{node IteratorGetNext_5}}]]
78/78 [==============================] - 10s 128ms/step - loss: 0.7817 - accuracy: 0.7819
Epoch 2/10
2020-02-21 00:18:05.692339: E tensorflow/core/common_runtime/ring_alg.cc:279] Aborting RingReduce with Out of range: [_Derived_]End of sequence
[[{{node IteratorGetNext_5}}]]
[[GroupCrossDeviceControlEdges_0/Adam/Adam/update_0/Const/_259]]
2020-02-21 00:18:05.692487: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Out of range: [_Derived_]End of sequence
[[{{node IteratorGetNext_5}}]]
[[GroupCrossDeviceControlEdges_0/Adam/Adam/update_0/Const/_259]]
2020-02-21 00:18:05.692610: W tensorflow/core/framework/op_kernel.cc:1655] OP_REQUIRES failed at collective_ops.cc:253 : Out of range: [_Derived_]End of sequence
[[{{node IteratorGetNext_5}}]]
[[GroupCrossDeviceControlEdges_0/Adam/Adam/update_0/Const/_259]]
2020-02-21 00:18:05.692701: E tensorflow/core/common_runtime/ring_alg.cc:279] Aborting RingReduce with Out of range: [_Derived_]End of sequence
[[{{node IteratorGetNext_5}}]]
[[GroupCrossDeviceControlEdges_0/Adam/Adam/update_0/Const/_259]]
2020-02-21 00:18:05.692718: E tensorflow/core/common_runtime/ring_alg.cc:279] Aborting RingReduce with Out of range: [_Derived_]End of sequence
[[{{node IteratorGetNext_5}}]]
[[GroupCrossDeviceControlEdges_0/Adam/Adam/update_0/Const/_259]]
2020-02-21 00:18:05.692717: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Out of range: [_Derived_]End of sequence
[[{{node IteratorGetNext_5}}]]
[[GroupCrossDeviceControlEdges_0/Adam/Adam/update_0/Const/_259]]
[[allreduce_1/CollectiveReduce]]
[[allreduce_1/CollectiveReduce/_354]]
2020-02-21 00:18:05.692858: E tensorflow/core/common_runtime/ring_alg.cc:279] Aborting RingReduce with Out of range: [_Derived_]End of sequence
[[{{node IteratorGetNext_5}}]]
[[GroupCrossDeviceControlEdges_0/Adam/Adam/update_0/Const/_259]]
2020-02-21 00:18:05.692798: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Out of range: [_Derived_]End of sequence
[[{{node IteratorGetNext_5}}]]
[[GroupCrossDeviceControlEdges_0/Adam/Adam/update_0/Const/_259]]
2020-02-21 00:18:05.692850: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Out of range: [_Derived_]End of sequence
[[{{node IteratorGetNext_5}}]]
[[GroupCrossDeviceControlEdges_0/Adam/Adam/update_0/Const/_259]]
2020-02-21 00:18:05.692755: E tensorflow/core/common_runtime/ring_alg.cc:279] Aborting RingReduce with Out of range: [_Derived_]End of sequence
[[{{node IteratorGetNext_5}}]]
[[GroupCrossDeviceControlEdges_0/Adam/Adam/update_0/Const/_259]]
2020-02-21 00:18:05.692993: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Out of range: [_Derived_]End of sequence
[[{{node IteratorGetNext_5}}]]
[[GroupCrossDeviceControlEdges_0/Adam/Adam/update_0/Const/_259]]
2020-02-21 00:18:05.693013: E tensorflow/core/common_runtime/ring_alg.cc:279] Aborting RingReduce with Out of range: [_Derived_]End of sequence
[[{{node IteratorGetNext_5}}]]
[[GroupCrossDeviceControlEdges_0/Adam/Adam/update_0/Const/_259]]
2020-02-21 00:18:05.693048: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Out of range: [_Derived_]End of sequence
[[{{node IteratorGetNext_5}}]]
[[GroupCrossDeviceControlEdges_0/Adam/Adam/update_0/Const/_259]]
2020-02-21 00:18:05.693106: W tensorflow/core/framework/op_kernel.cc:1655] OP_REQUIRES failed at collective_ops.cc:253 : Out of range: [_Derived_]End of sequence
[[{{node IteratorGetNext_5}}]]
[[GroupCrossDeviceControlEdges_0/Adam/Adam/update_0/Const/_259]]
2020-02-21 00:18:05.693161: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Out of range: [_Derived_]End of sequence
[[{{node IteratorGetNext_5}}]]
[[GroupCrossDeviceControlEdges_0/Adam/Adam/update_0/Const/_259]]
2020-02-21 00:18:05.693121: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Cancelled: Iterator was cancelled
[[{{node IteratorGetNext_5}}]]
2020-02-21 00:18:05.693146: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Cancelled: Iterator was cancelled
[[{{node IteratorGetNext_2}}]]
2020-02-21 00:18:05.693292: W tensorflow/core/framework/op_kernel.cc:1655] OP_REQUIRES failed at collective_ops.cc:253 : Out of range: [_Derived_]End of sequence
[[{{node IteratorGetNext_5}}]]
[[GroupCrossDeviceControlEdges_0/Adam/Adam/update_0/Const/_259]]
2020-02-21 00:18:05.693167: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Cancelled: Iterator was cancelled
[[{{node IteratorGetNext_4}}]]
2020-02-21 00:18:05.693153: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Cancelled: Iterator was cancelled
[[{{node IteratorGetNext_3}}]]
2020-02-21 00:18:05.693224: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Cancelled: Iterator was cancelled
[[{{node IteratorGetNext_1}}]]
2020-02-21 00:18:05.693114: W tensorflow/core/framework/op_kernel.cc:1655] OP_REQUIRES failed at collective_ops.cc:253 : Out of range: [_Derived_]End of sequence
[[{{node IteratorGetNext_5}}]]
[[GroupCrossDeviceControlEdges_0/Adam/Adam/update_0/Const/_259]]
2020-02-21 00:18:05.693399: W tensorflow/core/framework/op_kernel.cc:1655] OP_REQUIRES failed at collective_ops.cc:253 : Out of range: [_Derived_]End of sequence
[[{{node IteratorGetNext_5}}]]
[[GroupCrossDeviceControlEdges_0/Adam/Adam/update_0/Const/_259]]
2020-02-21 00:18:05.693418: W tensorflow/core/framework/op_kernel.cc:1655] OP_REQUIRES failed at collective_ops.cc:253 : Out of range: [_Derived_]End of sequence
[[{{node IteratorGetNext_5}}]]
[[GroupCrossDeviceControlEdges_0/Adam/Adam/update_0/Const/_259]]
2020-02-21 00:18:05.694962: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Out of range: [_Derived_]End of sequence
[[{{node IteratorGetNext_5}}]]
[[GroupCrossDeviceControlEdges_0/Adam/Adam/update_0/Const/_259]]
[[allreduce_1/CollectiveReduce]]
 1/78 [..............................] - ETA: 1sTraceback (most recent call last):
  File "model/mnist_multi_node_multi_gpu.py", line 59, in <module>
    dmodel.fit(train_dataset, epochs=10, callbacks=[tensorboard_callback])
  File "/export/home/arde/code/tf2-venv/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training.py", line 819, in fit
    use_multiprocessing=use_multiprocessing)
  File "/export/home/arde/code/tf2-venv/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_distributed.py", line 790, in fit
    *args, **kwargs)
  File "/export/home/arde/code/tf2-venv/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_distributed.py", line 777, in wrapper
    mode=dc.CoordinatorMode.INDEPENDENT_WORKER)
  File "/export/home/arde/code/tf2-venv/lib/python3.7/site-packages/tensorflow_core/python/distribute/distribute_coordinator.py", line 853, in run_distribute_coordinator
    task_id, session_config, rpc_layer)
  File "/export/home/arde/code/tf2-venv/lib/python3.7/site-packages/tensorflow_core/python/distribute/distribute_coordinator.py", line 360, in _run_single_worker
    return worker_fn(strategy)
  File "/export/home/arde/code/tf2-venv/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_distributed.py", line 772, in _worker_fn
    return method(model, **kwargs)
  File "/export/home/arde/code/tf2-venv/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_v2.py", line 342, in fit
    total_epochs=epochs)
  File "/export/home/arde/code/tf2-venv/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_v2.py", line 128, in run_one_epoch
    batch_outs = execution_function(iterator)
  File "/export/home/arde/code/tf2-venv/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_v2_utils.py", line 98, in execution_function
    distributed_function(input_fn))
  File "/export/home/arde/code/tf2-venv/lib/python3.7/site-packages/tensorflow_core/python/eager/def_function.py", line 568, in __call__
    result = self._call(*args, **kwds)
  File "/export/home/arde/code/tf2-venv/lib/python3.7/site-packages/tensorflow_core/python/eager/def_function.py", line 599, in _call
    return self._stateless_fn(*args, **kwds)  # pylint: disable=not-callable
  File "/export/home/arde/code/tf2-venv/lib/python3.7/site-packages/tensorflow_core/python/eager/function.py", line 2363, in __call__
    return graph_function._filtered_call(args, kwargs)  # pylint: disable=protected-access
  File "/export/home/arde/code/tf2-venv/lib/python3.7/site-packages/tensorflow_core/python/eager/function.py", line 1611, in _filtered_call
    self.captured_inputs)
  File "/export/home/arde/code/tf2-venv/lib/python3.7/site-packages/tensorflow_core/python/eager/function.py", line 1692, in _call_flat
    ctx, args, cancellation_manager=cancellation_manager))
  File "/export/home/arde/code/tf2-venv/lib/python3.7/site-packages/tensorflow_core/python/eager/function.py", line 545, in call
    ctx=ctx)
  File "/export/home/arde/code/tf2-venv/lib/python3.7/site-packages/tensorflow_core/python/eager/execute.py", line 67, in quick_execute
    six.raise_from(core._status_to_exception(e.code, message), None)
  File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.CancelledError:  [_Derived_]Iterator was cancelled
[[node IteratorGetNext_2 (defined at model/mnist_multi_node_multi_gpu.py:59) ]] [Op:__inference_distributed_function_5054]

Function call stack:
distributed_function

2020-02-21 00:18:05.817692: W tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelled
2020-02-21 00:18:05.818577: W tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelled
2020-02-21 00:18:05.819643: W tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelled
2020-02-21 00:18:05.820546: W tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelled
2020-02-21 00:18:05.821401: W tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelled
2020-02-21 00:18:05.822531: W tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelled
 

Priya Gupta

unread,
Apr 6, 2020, 2:21:17 AM4/6/20
to Arup De, Rick Chao, Ran Chen, TensorFlow Developers
+Rick Chao +Ran Chen 

Hi,

You can follow the tutorial here to learn more about MultiWorkerMirroredStrategy with Keras in TF2. 

Regarding the error you're seeing, I think this is likely due to not using `steps_per_epoch` in the model.fit call. As mentioned here, currently MultiWorkerMirroredStrategy doesn't handle last partial batches correctly. So we require passing steps_per_epoch for now. In fact, if you use the tf-nightly, it will ask you to pass this argument now when using MultiWorkerMirroredStrategy. 

On Fri, Feb 21, 2020 at 9:31 AM Arup De <ard...@gmail.com> wrote:
Hi,
    I got out-of-range error when using  distributed training with MultiWorkerMirroredStrategy (Tensorflow  version: 2.1.0). 
I'm using two nodes, each with 6 NVIDIA V100 GPUs. I'm using keras mnist model for this experiment. 
Initially, I tried with single node MirroredStrategy with NVIDIA V100 GPUs.  It worked fine for me.
For MultiWorkerMirroredStrategy, I just changed the strategy to 
strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy(). However, it produced out-of-range error.
I have following questions.

1. Does MultiWorkerMirroredStrategy work fine with Tensorflow 2.1.0?
Yes, but there are still bugs and issues we are working through. Many have been fixed in the nightlies. 
2. I'm using below data pipeline for loading the data.
train_dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train)).shuffle(BUFFER_SIZE).batch(GLOBAL_BATCH_SIZE, drop_remainder=True)
Do I need to change it for  MultiWorkerMirroredStrategy? I'm hoping it should automatically split the data across 12 GPUs analogous to single node MirroredStrategy.
You don't need to change it. From the logs you shared, you can see that we are not able to shard the data in a performant way since it's not being read from files. But it would be sharded nevertheless. For better performance though, you may want to read from files. 

3. I would like to know more about the MultiWorkerMirroredStrategy. How does it perform communication across the node and within the node?
  Is there any document to provide more details about MultiWorkerMirroredStrategy?
--
You received this message because you are subscribed to the Google Groups "TensorFlow Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to developers+...@tensorflow.org.
To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/developers/e7bab4c1-18fa-4f6d-b0a2-e341a983f399%40tensorflow.org.

Arup De

unread,
Apr 7, 2020, 3:47:07 PM4/7/20
to Rick Chao, Priya Gupta, Ran Chen, TensorFlow Developers
Yes. Thanks Priya and Rick.

On Mon, Apr 6, 2020 at 10:29 AM Rick Chao <rc...@google.com> wrote:
Hello Arup,

Let us know if Priya's suggestion solved the issue, and we'll update the tutorial to make it clearer. Thanks for reaching out!

Best,
Rick

Arup De

unread,
Apr 8, 2020, 3:26:40 PM4/8/20
to Rick Chao, Priya Gupta, Ran Chen, TensorFlow Developers
Hi Priya, Chen and Rick,
        I'm looking at the multi-worker mirrored strategy estimator example  https://www.tensorflow.org/tutorials/distribute/multi_worker_with_estimator. Could you please confirm the batch size in input_fn(). Is it a global batch size or batch size per replica ( global_batch/number_of_replica)?  

Thanks,
Arup

Priya Gupta

unread,
Apr 8, 2020, 3:31:09 PM4/8/20
to Arup De, Ayush Dubey, Rick Chao, Ran Chen, TensorFlow Developers
For estimator input_fn, it should be batch size per replica ( global_batch/number_of_replica). 

although maybe the BATCH_SIZE being used in loss scaling in this code snippet is wrong, and should be the global batch size. cc @Ayush Dubey 

Arup De

unread,
Apr 8, 2020, 4:59:46 PM4/8/20
to Priya Gupta, Ayush Dubey, Rick Chao, Ran Chen, TensorFlow Developers
Thanks Priya.

-Arup

Arup De

unread,
Apr 8, 2020, 7:39:35 PM4/8/20
to Priya Gupta, Ayush Dubey, Rick Chao, Ran Chen, TensorFlow Developers
Hi Priya,
    What would be the official release date for TF 2.2?

Thanks,
Arup

Priya Gupta

unread,
Apr 9, 2020, 2:12:11 AM4/9/20
to Arup De, Goldie Gadde, Ayush Dubey, Rick Chao, Ran Chen, TensorFlow Developers
+Goldie Gadde can you help answer this question?

Arup De

unread,
Apr 9, 2020, 4:21:44 PM4/9/20
to Priya Gupta, Goldie Gadde, Ayush Dubey, Rick Chao, Ran Chen, TensorFlow Developers
Hi Priya and Rick,
       I'm migrating a model for TF 1.14 to TF 2. I used the tf_upgrade_v2 script to change the model.  The upgraded model used `tf.compat.v1.get_variable()`  and and ` tf.compat.v1.train.AdagradOptimizer()`. 
It successfully ran on TF 2.0 using parameter server strategy. We used the estimator api for the distributed training. But when I changed the strategy to multi-worker mirrored strategy, it didn't work for me. 
There is some issue with Adagrad Initializer CollectiveBcastRecv. Please see the error below. Are those calls compatible (`tf.compat.v1.get_variable()`  and
tf.compat.v1.train.AdagradOptimize()) with multi-worker mirrored strategy using estimator API? Please let me know if there is a quick fix for this issue.

API calls:
x = tf.compat.v1.get_variable(
                    name=x,
                    initializer=tf.random.truncated_normal([tensor_len, num_classes],
                                                           stddev=1.0 / math.sqrt(float(tensor_len))),
                    regularizer=tf.keras.regularizers.l2(l2_reg_weight)

 optimizer = tf.compat.v1.train.AdagradOptimizer(0.01)


Error Log:
2020-04-09 01:57:51.499271: W tensorflow/core/framework/op_kernel.cc:1622] OP_REQUIRES failed at collective_ops.cc:365 : Internal: RecvBufResponse returned 2408 bytes where to_tensor expected 808
2020-04-09 01:57:51.499305: W tensorflow/core/framework/op_kernel.cc:1622] OP_REQUIRES failed at collective_ops.cc:365 : Internal: RecvBufResponse returned 808 bytes where to_tensor expected 2408
2020-04-09 01:57:51.499272: W tensorflow/core/framework/op_kernel.cc:1622] OP_REQUIRES failed at collective_ops.cc:365 : Internal: RecvBufResponse returned 2408 bytes where to_tensor expected 808
2020-04-09 01:57:51.499371: W tensorflow/core/framework/op_kernel.cc:1622] OP_REQUIRES failed at collective_ops.cc:365 : Internal: RecvBufResponse returned 808 bytes where to_tensor expected 2408
2020-04-09 01:57:51.499365: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Internal: RecvBufResponse returned 2408 bytes where to_tensor expected 808
	 [[{{node memberFeatures_geoRegion_weights/Adagrad/Initializer/CollectiveBcastRecv}}]]
2020-04-09 01:57:51.500466: W tensorflow/core/framework/op_kernel.cc:1622] OP_REQUIRES failed at collective_ops.cc:365 : Cancelled: [_Derived_]Cancelled
Additional GRPC error information:
{"created":"@1586397471.499604492","description":"Error received from peer","file":"external/grpc/src/core/lib/surface/call.cc","file_line":1039,"grpc_message":"Cancelled","grpc_status":1}
2020-04-09 01:57:51.501131: W tensorflow/core/framework/op_kernel.cc:1622] OP_REQUIRES failed at collective_ops.cc:365 : Internal: [_Derived_]RecvBufResponse returned 2408 bytes where to_tensor expected 808
	 [[{{node memberFeatures_geoRegion_weights/Adagrad/Initializer/CollectiveBcastRecv}}]]
2020-04-09 01:57:51.501135: W tensorflow/core/framework/op_kernel.cc:1622] OP_REQUIRES failed at collective_ops.cc:365 : Internal: [_Derived_]RecvBufResponse returned 2408 bytes where to_tensor expected 808
	 [[{{node memberFeatures_geoRegion_weights/Adagrad/Initializer/CollectiveBcastRecv}}]]
2020-04-09 01:57:51.501170: W tensorflow/core/framework/op_kernel.cc:1622] OP_REQUIRES failed at collective_ops.cc:365 : Internal: [_Derived_]RecvBufResponse returned 2408 bytes where to_tensor expected 808
	 [[{{node memberFeatures_geoRegion_weights/Adagrad/Initializer/CollectiveBcastRecv}}]]
2020-04-09 01:57:51.501175: W tensorflow/core/framework/op_kernel.cc:1622] OP_REQUIRES failed at collective_ops.cc:365 : Internal: [_Derived_]RecvBufResponse returned 2408 bytes where to_tensor expected 808
	 [[{{node memberFeatures_geoRegion_weights/Adagrad/Initializer/CollectiveBcastRecv}}]]
2020-04-09 01:57:51.501192: W tensorflow/core/framework/op_kernel.cc:1622] OP_REQUIRES failed at collective_ops.cc:365 : Internal: [_Derived_]RecvBufResponse returned 2408 bytes where to_tensor expected 808
	 [[{{node memberFeatures_geoRegion_weights/Adagrad/Initializer/CollectiveBcastRecv}}]]
2020-04-09 01:57:51.501223: W tensorflow/core/framework/op_kernel.cc:1622] OP_REQUIRES failed at collective_ops.cc:365 : Internal: [_Derived_]RecvBufResponse returned 2408 bytes where to_tensor expected 808
	 [[{{node memberFeatures_geoRegion_weights/Adagrad/Initializer/CollectiveBcastRecv}}]]
Traceback (most recent call last):
  File "<>/site-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
    return fn(*args)
  File "<>/site-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
    target_list, run_metadata)
  File "<>/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InternalError: From /job:worker/replica:0/task:1:
RecvBufResponse returned 2408 bytes where to_tensor expected 808
	 [[{{node memberFeatures_geoRegion_weights/Adagrad/Initializer/CollectiveBcastRecv}}]]


Thanks,
Arup





   

Arup De

unread,
Apr 10, 2020, 9:04:08 PM4/10/20
to Rick Chao, Priya Gupta, Goldie Gadde, Ayush Dubey, Ran Chen, TensorFlow Developers
Hi Rick, 
   I would definitely use Keras with MultiWorkerMirroredStrategy scope for new models. However, the existing models may still incline to use estimator API. 
What would be the long term planning for MultiWorkerMirroredStrategy? Will it be only supported by Keras with tf.distribute?
 
About the above issue regarding distributed training with MultiWorkerMirroredStrategy using estimator API, https://www.tensorflow.org/api_docs/python/tf/estimator/Estimator the input_fn and 
model_fn both ran in graph mode.  Unfortunately, in graph mode for multi-worker mirrored strategy, it didn't able handle python dict properly (unorderedness). I solved it by maintaining the 
consistent feature ordering using OrderedDict. I found the debugging is extremely difficult in graph mode. Could you please share the methodology to detect any graph mode discrepancy early for 
model_fn, and any modeling guidelines/supported constructs. 

Thanks,
Arup


On Thu, Apr 9, 2020 at 9:24 PM Rick Chao <rc...@google.com> wrote:
Hello Arup,

I was wondering if it'd be possible to use a Keras model with a MultiWorkerMirroredStrategy scope as opposed to an estimator - Keras with tf.distribute is a better supported path. Let us know if that's a possibility and thanks!

Best,
Rick

Priya Gupta

unread,
Apr 11, 2020, 1:40:29 PM4/11/20
to Arup De, Rick Chao, Goldie Gadde, Ayush Dubey, Ran Chen, TensorFlow Developers
On Fri, Apr 10, 2020 at 6:04 PM Arup De <ard...@gmail.com> wrote:
Hi Rick, 
   I would definitely use Keras with MultiWorkerMirroredStrategy scope for new models. However, the existing models may still incline to use estimator API. 
What would be the long term planning for MultiWorkerMirroredStrategy? Will it be only supported by Keras with tf.distribute?
Yes, MultiWorkerMirroredStrategy support with Estimator is experimental and not being improved. All future improvements are targeted towards the Keras integration.
 
 
About the above issue regarding distributed training with MultiWorkerMirroredStrategy using estimator API, https://www.tensorflow.org/api_docs/python/tf/estimator/Estimator the input_fn and 
model_fn both ran in graph mode.  Unfortunately, in graph mode for multi-worker mirrored strategy, it didn't able handle python dict properly (unorderedness). I solved it by maintaining the 
consistent feature ordering using OrderedDict. 

Do you mean you were able to fix the above error "RecvBufResponse returned 2408 bytes where to_tensor expected 808" by using an ordered dict? 

Priya Gupta

unread,
Apr 11, 2020, 1:41:28 PM4/11/20
to Arup De, Rick Chao, Goldie Gadde, Ayush Dubey, Ran Chen, TensorFlow Developers
BTW, which version of TF are you using when you ran into the "RecvBufResponse returned 2408 bytes where to_tensor expected 808" error?

Arup De

unread,
Apr 13, 2020, 1:57:22 PM4/13/20
to Priya Gupta, Rick Chao, Goldie Gadde, Ayush Dubey, Ran Chen, TensorFlow Developers
Hi Priya,
       I'm using TensorFlow 2.0. 

Thanks,
Arup

Arup De

unread,
Apr 13, 2020, 1:57:36 PM4/13/20
to Priya Gupta, Rick Chao, Goldie Gadde, Ayush Dubey, Ran Chen, TensorFlow Developers
Hi Priya,
       That's correct. It was fixed by using ordered dict.

Thanks,
Arup
     

On Sat, Apr 11, 2020 at 10:40 AM Priya Gupta <pri...@google.com> wrote:

Priya Gupta

unread,
Apr 13, 2020, 2:03:12 PM4/13/20
to Arup De, Rick Chao, Goldie Gadde, Ayush Dubey, Ran Chen, TensorFlow Developers
Thanks Arup. This was a dict that you created and maintained? MultiWorkerMirroredStrategy does require variables to be created in the same order in all workers and using dict can be error prone, so that is the right fix. Good to know this worked for you! 

Arup De

unread,
Apr 13, 2020, 2:16:51 PM4/13/20
to Priya Gupta, Rick Chao, Goldie Gadde, Ayush Dubey, Ran Chen, TensorFlow Developers
Yes, that dict keeps the features information. 

Thanks,
Arup

Arup De

unread,
May 7, 2020, 8:47:21 PM5/7/20
to Priya Gupta, Rick Chao, Goldie Gadde, Ayush Dubey, Ran Chen, TensorFlow Developers
Hi Priya,
      I'm getting below error when I used MultiWorkerMirroredStrategy for distributed training.The model works fine with parameter server strategy.  I'm using TF 2.0.
Error:
2020-05-07 23:12:21.596480: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Internal: Inconsistent output shapes, got [4], but expected is [6].
         [[{{node allreduce_2/CollectiveGather_43}}]]
2020-05-07 23:12:21.597835: E tensorflow/core/common_runtime/ring_alg.cc:279] Aborting RingGather with Internal: [_Derived_]Inconsistent output shapes, got [4], but expected is [6].
         [[{{node allreduce_2/CollectiveGather_43}}]]
2020-05-07 23:12:21.598012: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Internal: [_Derived_]Inconsistent output shapes, got [4], but expected is [6].
         [[{{node allreduce_2/CollectiveGather_43}}]]
2020-05-07 23:12:21.598126: E tensorflow/core/common_runtime/ring_alg.cc:279] Aborting RingGather with Internal: [_Derived_]Inconsistent output shapes, got [4], but expected is [6].


The model has an embedding layer on top, that used  tf.compat.v1.nn.safe_embedding_lookup_sparse for lookuo embedding.


sp_ids = tf.SparseTensor(indices=feature.indices, values=feature.indices[:, -1],

                                         dense_shape=feature.dense_shape)

               

embeddings = tf.compat.v1.nn.safe_embedding_lookup_sparse(embedding_weights=weights,

                                                                          sparse_ids=sp_ids,

                                                                          sparse_weights=feature,

                                                                          combiner=combiner,

                                                                          partition_strategy="mod"

                                                                          )


Could you please confirm MultiWorkerMirroredStrategy supports above API. Please let me know how to fix this issue. 

Do you think upgrade to TF 2.2 would help?


Thanks,

Arup 



Priya Gupta

unread,
May 8, 2020, 1:48:15 AM5/8/20
to Arup De, Chenkai Kuang, Rick Chao, Goldie Gadde, Ayush Dubey, Ran Chen, TensorFlow Developers
Hi Arup,

Yes this has been fixed in TF 2.2 - please test with that and let us know if it works. 

Paul Cox

unread,
May 10, 2020, 3:10:54 AM5/10/20
to Priya Gupta, Arup De, Chenkai Kuang, Rick Chao, Goldie Gadde, Ayush Dubey, Ran Chen, TensorFlow Developers

Arup De

unread,
Oct 16, 2020, 8:18:19 PM10/16/20
to Priya Gupta, Chenkai Kuang, Rick Chao, Goldie Gadde, Ayush Dubey, Ran Chen, TensorFlow Developers
Hi Priya and Rick,
       I was running tensorflow distributed training with parameter strategy using estimator API.
I used tensorflow 2.2 for this experiment with 6 workers (with 1 V100 GPU.each) and 1 ps node. I observed the training performance 
significantly improved after adding a GPU to the ps node (4.6675 global_step/sec to 41.0646 global_step/sec)

Performance summary in terms of global_step/sec:
6 workers (1 V100 GPU each) +  ps node (without GPU): global_step/sec: 4.6675
6 workers (1 V100 GPU each) +  ps node (1 V100 GPU):  global_step/sec: 41.0646

Could you please explain why the performance improves after adding a GPU to the ps node? what operations got accelerated by the GPU in the ps node?
Currently, the tensorflow profiler didn't  show the ps node trace. How can we see the ps node trace or any other de debugging methodology to investigate this issue? 

Thanks,
Arup
Reply all
Reply to author
Forward
0 new messages