Probabilistic inference using the best distributed strategy

284 views
Skip to first unread message

Angel Berihuete

unread,
Mar 2, 2020, 12:07:28 PM3/2/20
to TensorFlow Probability
Dear TensorFlow Probability community.

Recently our research group has finished the code of a hierarchical model using Tensorflow probability and MCMC (NUTs) to do the inference. The inference works pretty well in one machine with 8 CPUs and a low sample size (1e4). Our intention is to increase the sample size to 1e8 and distribute batch of data across 8 GPUs in one machine. 

We are wondering what is the best type of Tensorflow strategy to distribute our data (model?) across 8 GPUs in one machine using NUTs? Is it possible to do the inference with several chains distributing the data across 8 GPUs? Could you please share some links containing examples about how to do this?

Best regards,
Ángel

Brian Patton 🚀

unread,
Mar 2, 2020, 8:35:46 PM3/2/20
to Angel Berihuete, TensorFlow Probability, Colin Carroll
That sounds like a neat problem! The first thing I would try would be to split the computation in your target_log_prob computation across GPUs using MirroredStrategy. We don't have any examples of doing this but I would love for us to have one. I will talk tomorrow with Colin (on cc) to gauge his interest in coming up with a basic demo.

--
You received this message because you are subscribed to the Google Groups "TensorFlow Probability" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tfprobabilit...@tensorflow.org.
To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/tfprobability/fe0160ec-873e-4b80-b382-da54a8954e4e%40tensorflow.org.

Angel Berihuete

unread,
Mar 4, 2020, 2:42:22 AM3/4/20
to TensorFlow Probability, angel.b...@gm.uca.es, colca...@google.com
Many thanks Brian!

First of all, we don't understand the increasing of computation time when the inference is done in one GPU (136 minutes) instead of CPU (4 minutes). It's the same code, nothing about MirroredStrategy etc ... maybe we need to assign the target_log_prob computation to the GPU? something like

with tf.device('/device:GPU:0'):
    target_log_prob computation

So if I understood correctly, you suggest to include something like

tf.debugging.set_log_device_placement(True)

strategy = tf.distribute.MirroredStrategy()
with strategy.scope():
    target_log_prob computation

and the MCMC samples with NUTs *outside* of this scope i.e, 

@tf.function(autograph=False)
def do_sampling(burnin_steps=1000,sample_steps=100,step_size=0.01,target_accept=0.8):

    sampler = tfp.mcmc.TransformedTransitionKernel(
                    tfp.mcmc.NoUTurnSampler(
                        target_log_prob_fn=target_log_prob_fn,
                        step_size=step_size),
                    bijector=unconstraining_bijectors)

    adaptive_sampler = tfp.mcmc.DualAveragingStepSizeAdaptation(
                                inner_kernel=sampler,
                                num_adaptation_steps=int(0.8 * burnin_steps),
                                target_accept_prob=target_accept,
                                # NUTS inside of a TTK requires custom getter/setter functions.
                                step_size_setter_fn=lambda pkr, new_step_size: pkr._replace(
                                                    inner_results=pkr.inner_results._replace(
                                                        step_size=new_step_size)
                                                    ),
        step_size_getter_fn=lambda pkr: pkr.inner_results.step_size,
        log_accept_prob_getter_fn=lambda pkr:pkr.inner_results.log_accept_ratio,
        )
    return tfp.mcmc.sample_chain(
        kernel=adaptive_sampler,
        current_state=initial_state,
        num_results=sample_steps,
        num_burnin_steps=burnin_steps,
        trace_fn=lambda _, pkr: pkr.inner_results.inner_results.is_accepted)

#===================== Sampling ==========================

chains, is_accepted = do_sampling(burnin_steps=100,sample_steps=100)


El martes, 3 de marzo de 2020, 2:35:46 (UTC+1), Brian Patton escribió:
That sounds like a neat problem! The first thing I would try would be to split the computation in your target_log_prob computation across GPUs using MirroredStrategy. We don't have any examples of doing this but I would love for us to have one. I will talk tomorrow with Colin (on cc) to gauge his interest in coming up with a basic demo.

On Mon, Mar 2, 2020, 12:07 PM Angel Berihuete <angel.b...@gm.uca.es> wrote:
Dear TensorFlow Probability community.

Recently our research group has finished the code of a hierarchical model using Tensorflow probability and MCMC (NUTs) to do the inference. The inference works pretty well in one machine with 8 CPUs and a low sample size (1e4). Our intention is to increase the sample size to 1e8 and distribute batch of data across 8 GPUs in one machine. 

We are wondering what is the best type of Tensorflow strategy to distribute our data (model?) across 8 GPUs in one machine using NUTs? Is it possible to do the inference with several chains distributing the data across 8 GPUs? Could you please share some links containing examples about how to do this?

Best regards,
Ángel

--
You received this message because you are subscribed to the Google Groups "TensorFlow Probability" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tfprob...@tensorflow.org.

Angel Berihuete

unread,
Mar 4, 2020, 6:11:23 AM3/4/20
to TensorFlow Probability, angel.b...@gm.uca.es, colca...@google.com
Hi Brian.

I've tested the following 

with tf.device('/device:GPU:0'):
    def target_log_prob_fn(*arguments):
        return model.log_prob(arguments + (observations,))


The inference is running 

node14                  Wed Mar  4 12:05:14 2020  418.87.00
[0] GeForce RTX 2080 Ti | 47'C,   9 % |   693 / 10989 MB | abm(683M)
[1] GeForce RTX 2080 Ti | 31'C,   0 % |   165 / 10989 MB | abm(155M)
[2] GeForce RTX 2080 Ti | 30'C,   0 % |   165 / 10989 MB | abm(155M)
[3] GeForce RTX 2080 Ti | 33'C,   0 % |   165 / 10989 MB | abm(155M)
[4] GeForce RTX 2080 Ti | 30'C,   0 % |   165 / 10989 MB | abm(155M)
[5] GeForce RTX 2080 Ti | 32'C,   0 % |   165 / 10989 MB | abm(155M)
[6] GeForce RTX 2080 Ti | 28'C,   0 % |   165 / 10989 MB | abm(155M)
[7] GeForce RTX 2080 Ti | 29'C,   0 % |   165 / 10989 MB | abm(155M)

but some parameters are assigned to CPU 

2020-03-04 11:02:31.608948: I tensorflow/core/common_runtime/placer.cc:54] JointDistributionCoroutine/log_prob/JointDistributionCoroutine_log_prob_IndependentJointDistributionCoroutine_log_prob_CholeskyLKJ_CholeskyLKJ/log_prob/JointDistributionCoroutine_log_prob_CholeskyLKJ_CholeskyLKJ/log_prob/assert_near/Assert/AssertGuard/else/_19/Assert/data_4: (Const): /job:localhost/replica:0/task:0/device:CPU:0

I think it is better to test this things in a simple example ... maybe "Bayesian modeling with Joint Distribution" example? What do yo think. 

Brian Patton 🚀

unread,
Mar 4, 2020, 6:36:34 AM3/4/20
to Angel Berihuete, TensorFlow Probability, Colin Carroll
You should be able to disable asserts by seeing validate_args=False, if that doesn't remove that cpu node please file an issue on GitHub.

Disabling validation should speed things up a lot! 

Colin and I started a proof of concept yesterday. I would consider using strategy.experimental_run_v2 and possibly strategy.distribute_dataset

Can you share what your joint distribution looks like?


To unsubscribe from this group and stop receiving emails from it, send an email to tfprobabilit...@tensorflow.org.
To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/tfprobability/94c4c1fd-9d19-4fcb-a181-6f3ae16aae55%40tensorflow.org.

Angel Berihuete

unread,
Mar 4, 2020, 10:08:03 AM3/4/20
to TensorFlow Probability, angel.b...@gm.uca.es, colca...@google.com
Great! I'll set validate_args=False. 

Is it possible to share with you our joint distribution privately? Maybe using a private repo on GitHub?


Angel Berihuete

unread,
Mar 5, 2020, 5:38:49 AM3/5/20
to TensorFlow Probability, angel.b...@gm.uca.es, colca...@google.com
I'm checking where the parts of our model are executing, and something weird happens with tfd.Gamma.

In the following simple code, RandomGamma is executing on CPU,  but I set specifically to execute on GPU:7. tfd.Gamma is part of our model, and every time I try to run inference, RandomGamma is executing on CPU. 


from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import tensorflow as tf
import tensorflow_probability as tfp

tfd = tfp.distributions

tf.debugging.set_log_device_placement(True)

with tf.device('/device:GPU:7'):
    a = tfd.Gamma(concentration=2.0, rate=tf.ones(2))
    c = a.sample(10)

print(c)


2020-03-05 11:16:17.687364: I tensorflow/core/common_runtime/eager/execute.cc:573] Executing op Fill in device /job:localhost/replica:0/task:0/device:GPU:7
2020-03-05 11:16:18.429198: I tensorflow/core/common_runtime/eager/execute.cc:573] Executing op Shape in device /job:localhost/replica:0/task:0/device:GPU:7
2020-03-05 11:16:18.429527: I tensorflow/core/common_runtime/eager/execute.cc:573] Executing op BroadcastArgs in device /job:localhost/replica:0/task:0/device:GPU:7
2020-03-05 11:16:18.429670: I tensorflow/core/common_runtime/eager/execute.cc:573] Executing op BroadcastTo in device /job:localhost/replica:0/task:0/device:GPU:7
2020-03-05 11:16:18.430082: I tensorflow/core/common_runtime/eager/execute.cc:573] Executing op RandomGamma in device /job:localhost/replica:0/task:0/device:CPU:0
2020-03-05 11:16:18.430822: I tensorflow/core/common_runtime/eager/execute.cc:573] Executing op RealDiv in device /job:localhost/replica:0/task:0/device:GPU:7
2020-03-05 11:16:18.431462: I tensorflow/core/common_runtime/eager/execute.cc:573] Executing op Maximum in device /job:localhost/replica:0/task:0/device:GPU:7
2020-03-05 11:16:18.432204: I tensorflow/core/common_runtime/eager/execute.cc:573] Executing op StridedSlice in device /job:localhost/replica:0/task:0/device:GPU:7
2020-03-05 11:16:18.432512: I tensorflow/core/common_runtime/eager/execute.cc:573] Executing op ConcatV2 in device /job:localhost/replica:0/task:0/device:GPU:7
2020-03-05 11:16:18.432714: I tensorflow/core/common_runtime/eager/execute.cc:573] Executing op Reshape in device /job:localhost/replica:0/task:0/device:GPU:7
tf.Tensor(
[[3.9978743  1.64144   ]
 [0.6349332  0.4437775 ]
 [1.1139497  3.6174998 ]
 [0.7444977  0.7803583 ]
 [3.0109859  2.126362  ]
 [3.0622323  1.7784046 ]
 [0.80915487 3.2489107 ]
 [1.2188517  0.26362702]
 [0.50049067 1.0800475 ]
 [1.2107674  2.3682237 ]], shape=(10, 2), dtype=float32)

Brian Patton 🚀

unread,
Mar 5, 2020, 5:42:49 AM3/5/20
to Angel Berihuete, TensorFlow Probability, Colin Carroll
TF only has a CPU impl of RandomGamma. We recently made a change that makes it possible to xla-compile gamma sampling, which should put more ops on GPU, but I'm not certain it will necessarily be faster since the rejection sampling control flow still necessitates a few round-trips to CPU. It would be worth seeing if tfp-nightly gets you a speedup.

Brian Patton | Software Engineer | b...@google.com



To unsubscribe from this group and stop receiving emails from it, send an email to tfprobabilit...@tensorflow.org.
To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/tfprobability/16b461c5-94e7-4663-beec-63212ddf465a%40tensorflow.org.

Angel Berihuete

unread,
Mar 5, 2020, 7:40:57 AM3/5/20
to TensorFlow Probability, angel.b...@gm.uca.es, colca...@google.com
I understand. Thanks for the information, I'll check tfp-nightly and try to put all the dataset and the model on a single GPU.

rif

unread,
Mar 5, 2020, 10:44:24 AM3/5/20
to Angel Berihuete, TensorFlow Probability
In my benchmarks, the new gamma sampler gives a very large speedup (~10x) for large numbers of samples (10000). Is gamma sampling a bottleneck for you? Note that if this were really a bottleneck, we could experiment with unrolling the loop some.

--
You received this message because you are subscribed to the Google Groups "TensorFlow Probability" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tfprobabilit...@tensorflow.org.

Angel Berihuete

unread,
Mar 26, 2020, 4:55:14 AM3/26/20
to TensorFlow Probability, angel.b...@gm.uca.es

Hi Rif, Brian,


Many thanks for your response. I am so sorry for this huge delay on my response, but here, in Spain, the situation is complicated due to COVID-19.


The latest tests of my code do not show problems with the Gamma distribution, but still I wonder how to do NUTs calculations into multi-GPUs.


I have a hierarchical model and I need to include uncertainities of the observations. The first attempt was to use the tfd.JointDistributionCoroutine followgin the advices on TPF notebook Modeling_with_JointDistribution.ipynb. The problem coding this way was how to include uncertainities and observations using batches in order to replicate on GPUs. I tried to do a


probabilistic_model(uncertainities, hyperparameters):

...

concrete_model = functools.partial(probabilistic_model, batch_uncerteainities, hyperparameters)

model = tfd.JointDistributionCoroutine(concrete_model)


but this was pointless to me, because I need to do a model for each batch of uncertainites. My first question, do you have examples delaying with uncertainities using tfd.JointDistributionCoroutine?


Then, my sencond attempt (where I am working on) was to follow Brendan Hasz’s notebook (https://github.com/brendanhasz/svi-gaussian-mixture-model/blob/master/BayesianGaussianMixtureModel.ipynb) in order to do include my model in a tf.Model and shift the model with batches of observations and uncertainities to multi-GPUs properly. Then I wrote


class Mymodel(tf.Module):

def __init__(self, hyperparameters);

hyperparameters

priors

def __call__ (self, x):

x # batch of observations and uncertainities

return log_joint_prob


and then I follow the recommendations to do a [custom training loop](https://www.tensorflow.org/tutorials/distribute/custom_training#training_loop) within a mirrored strategy:


model = Mymodel(hyperparamters)

optimizer = tf.keras.optimizers.Adam(lr=1e-3)


def train_step(inputs):

def distributed_train_step(batches of dataset_inputs)

for epoch in range(EPOCHS):


and it works properly on multi-GPUs!! … but now, I do not know how to translate this in terms of NUTS. Any clue/help will be welcome.


My best wishes to you.


El jueves, 5 de marzo de 2020, 16:44:24 (UTC+1), Rif A. Saurous escribió:
In my benchmarks, the new gamma sampler gives a very large speedup (~10x) for large numbers of samples (10000). Is gamma sampling a bottleneck for you? Note that if this were really a bottleneck, we could experiment with unrolling the loop some.

On Mon, Mar 2, 2020 at 9:07 AM Angel Berihuete <angel.b...@gm.uca.es> wrote:
Dear TensorFlow Probability community.

Recently our research group has finished the code of a hierarchical model using Tensorflow probability and MCMC (NUTs) to do the inference. The inference works pretty well in one machine with 8 CPUs and a low sample size (1e4). Our intention is to increase the sample size to 1e8 and distribute batch of data across 8 GPUs in one machine. 

We are wondering what is the best type of Tensorflow strategy to distribute our data (model?) across 8 GPUs in one machine using NUTs? Is it possible to do the inference with several chains distributing the data across 8 GPUs? Could you please share some links containing examples about how to do this?

Best regards,
Ángel

--
You received this message because you are subscribed to the Google Groups "TensorFlow Probability" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tfprob...@tensorflow.org.

Brian Patton 🚀

unread,
Mar 30, 2020, 10:43:34 PM3/30/20
to Angel Berihuete, TensorFlow Probability
Sorry for the slow response. I did cook up an example of splitting the likelihood calculation across GPUs but I was not happy with the speed and had wanted to do some further work (which has gotten derailed by higher priority stuff lately).

That said, maybe I can get it committed and we can improve speed later. I'll try to get something in under discussion/examples this week.

To unsubscribe from this group and stop receiving emails from it, send an email to tfprobabilit...@tensorflow.org.
To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/tfprobability/b293b44b-2c67-40c1-8c9d-caa828111c7a%40tensorflow.org.

Angel Berihuete

unread,
Apr 1, 2020, 5:14:51 AM4/1/20
to TensorFlow Probability, angel.b...@gm.uca.es
Hi Brian! It sounds great.

I've tried to split the likelihood calculation across GPUs without success. I've stopped developing tf.Model and back to the model coded with tfd.JointDistributionCoroutine. Then I've followed the doc on using tf.distribute.Strategy (Mirrored) with custom training loops. Something weird happens. I have defined the model and the likelihood calculations inside the strategy scope

with mirrored_strategy.scope():
    mymodel:  # the model using JointDistributionCorutine

    log_prob_fn:  # log_prob_prior + log_prob_likelihood

    @tf.function
    def distributed_target_log_prob_fn(*args):
        per_replica_log_prob = mirrored_strategy.experimental_run_v2(log_prob_fn, args)
        return mirrored_strategy.reduce(tf.distribute.ReduceOp.SUM, per_replica_log_prob, axis=None)

    def target_log_prob_fn(*args):
        total_log_prob = 0.0
        num_batches = 0
        for x in train_dist_dataset:
            total_log_prob += distributed_target_log_prob_fn(*args, x)
            num_batches += 1
        return total_log_prob / num_batches


 and then, OUTSIDE of mirrored strategy scope, the typical NUTs code

@tf.function(autograph=False, experimental_compile=True)
def run_chain(
        init_state,
        step_size,
        target_log_prob_fn,
        unconstraining_bijectors,
        num_steps=5,
        burnin=2):

  def trace_fn(_, pkr):
    return (
        pkr.inner_results.inner_results.target_log_prob,
        pkr.inner_results.inner_results.leapfrogs_taken,
        pkr.inner_results.inner_results.has_divergence,
        pkr.inner_results.inner_results.energy,
        pkr.inner_results.inner_results.log_accept_ratio
           )

  kernel = tfp.mcmc.TransformedTransitionKernel(
    inner_kernel=tfp.mcmc.NoUTurnSampler(
      target_log_prob_fn,
      step_size=step_size),
    bijector=unconstraining_bijectors)

  hmc = tfp.mcmc.DualAveragingStepSizeAdaptation(
    inner_kernel=kernel,
    num_adaptation_steps=burnin,
    step_size_setter_fn=lambda pkr, new_step_size: pkr._replace(
        inner_results=pkr.inner_results._replace(step_size=new_step_size)),
    step_size_getter_fn=lambda pkr: pkr.inner_results.step_size,
    log_accept_prob_getter_fn=lambda pkr: pkr.inner_results.log_accept_ratio
  )

  # Sampling from the chain.
  chain_state, sampler_stat = tfp.mcmc.sample_chain(
      num_results=num_steps,
      num_burnin_steps=burnin,
      current_state=init_state,
      kernel=hmc,
      trace_fn=trace_fn)
  return chain_state, sampler_stat


#step_size = [tf.cast(i, dtype=dtype) for i in 8*[.1]]
step_size=1.

print("="*50," Sampling ",50*"=")
t0 = time.time()
samples, sampler_stat = run_chain(
        init_state,
        step_size,
        target_log_prob_fn,
        constraining_bijectors)
t1 = time.time()
print("Inference ran in {:.2f}s.".format(t1-t0))
print("="*100)

Functions log_prob_fn, distributed_target_log_prob_fn and target_log_prob_fn work properly, but when I try to execute run_chain I have this error 

 LMC_BSC-v2.py:318 distributed_target_log_prob_fn  *
        per_replica_log_prob = mirrored_strategy.experimental_run_v2(log_prob_fn, args)
    /home/angel/anaconda3/envs/tf2/lib/python3.7/site-packages/tensorflow_core/python/distribute/distribute_lib.py:763 experimental_run_v2
        return self._extended.call_for_each_replica(fn, args=args, kwargs=kwargs)
    LMC_BSC-v2.py:289 log_prob_fn  *
        ((*params, true_data), (obs_data, uncerts)) = args

    ValueError: too many values to unpack (expected 2)


I think run_chain is setting not only chain proposals in target_log_prob_fn but some other parameters obtained by the sampler ... 
Please, it would very useful see how you split the likelihood on the GPUs. 

Angel Berihuete

unread,
Apr 3, 2020, 3:33:32 AM4/3/20
to TensorFlow Probability, angel.b...@gm.uca.es
Hi Brian! 

Do not worry. I understand the exceptional circumstances nowadays.

When you can, please, share with us your proposal to split the likelihood calculation across GPUs.
 

El martes, 31 de marzo de 2020, 4:43:34 (UTC+2), Brian Patton escribió:

Brian Patton 🚀

unread,
Apr 7, 2020, 11:55:59 AM4/7/20
to Angel Berihuete, TensorFlow Probability
Here's the notebook I cooked up. There is probably a fair amount of headroom for performance improvement, but it's a start.


Brian Patton | Software Engineer | b...@google.com


To unsubscribe from this group and stop receiving emails from it, send an email to tfprobabilit...@tensorflow.org.
To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/tfprobability/787554a2-eec1-4985-ac5e-a7e58608603b%40tensorflow.org.

Angel Berihuete

unread,
Apr 9, 2020, 2:42:04 AM4/9/20
to TensorFlow Probability, angel.b...@gm.uca.es
Many thanks Brian!  This would be a great help to us.

Angel Berihuete

unread,
Apr 11, 2020, 1:31:15 PM4/11/20
to TensorFlow Probability, angel.b...@gm.uca.es
Hi Brian. 

Your notebook works fine on Google Colab, but when I try to run the code on our cluster on one node using 4 GPUs, the execution stops trying to evaluate 

```python
singleton_vals = tfp.math.value_and_gradient(target_log_prob, (loc, scale_tril))
```

When print ```singleton_vals``` using your notebook on Google Colab I obtain:

```bash
WARNING:tensorflow:From <timed exec>:24: StrategyBase.experimental_run_v2 (from tensorflow.python.distribute.distribute_lib) is deprecated and will be removed in a future version. Instructions for updating: renamed to `run` WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/linalg/linear_operator_lower_triangular.py:158: calling LinearOperator.__init__ (from tensorflow.python.ops.linalg.linear_operator) with graph_parents is deprecated and will be removed in a future version. Instructions for updating: Do not pass `graph_parents`. They will no longer be used.
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:GPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1', '/job:localhost/replica:0/task:0/device:GPU:2', '/job:localhost/replica:0/task:0/device:GPU:3'). INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:GPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1', '/job:localhost/replica:0/task:0/device:GPU:2', '/job:localhost/replica:0/task:0/device:GPU:3'). INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:GPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1', '/job:localhost/replica:0/task:0/device:GPU:2', '/job:localhost/replica:0/task:0/device:GPU:3'). (<tf.Tensor: shape=(), dtype=float32, numpy=-16.259338>, [<tf.Tensor: shape=(3,), dtype=float32, numpy=array([ 1.6021107 , 0.26656806, -0.55235565], dtype=float32)>, <tf.Tensor: shape=(3, 3), dtype=float32, numpy= array([[-1.4498465 , 0.25800592, -1.9094281 ], [-1.9821234 , -0.6860708 , -0.76569635], [ 1.1947346 , -0.50319874, -1.886622 ]], dtype=float32)>]) CPU times: user 3.76 s, sys: 218 ms, total: 3.98 s 
Wall time: 7.51 s 
```
When run the code on my cluster stops here:

```bash 
Do not pass `graph_parents`. They will no longer be used.
```

Do you think this issue is because I use TF 2.1 and not TF2.2 as you do? Of course I do not use logical GPUs but physical GPUs. The resources (4 GPUs) are allocated using a Slurm queue. 

Below the final part of the output:

```bash
2020-04-11 19:22:33.164543: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0, 1, 2, 3
2020-04-11 19:22:33.164604: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1096] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-04-11 19:22:33.164615: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102]      0 1 2 3
2020-04-11 19:22:33.164623: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] 0:   N Y Y Y
2020-04-11 19:22:33.164631: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] 1:   Y N Y Y
2020-04-11 19:22:33.164639: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] 2:   Y Y N Y
2020-04-11 19:22:33.164648: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] 3:   Y Y Y N
2020-04-11 19:22:33.171516: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1241] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14978 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0004:04:00.0, compute capability: 7.0)
2020-04-11 19:22:33.172906: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1241] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 14978 MB memory) -> physical GPU (device: 1, name: Tesla V100-SXM2-16GB, pci bus id: 0004:05:00.0, compute capability: 7.0)
2020-04-11 19:22:33.174283: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1241] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 14978 MB memory) -> physical GPU (device: 2, name: Tesla V100-SXM2-16GB, pci bus id: 0035:03:00.0, compute capability: 7.0)
2020-04-11 19:22:33.175654: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1241] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 14978 MB memory) -> physical GPU (device: 3, name: Tesla V100-SXM2-16GB, pci bus id: 0035:04:00.0, compute capability: 7.0)
Number of devices: 4
2020-04-11 19:22:35.107488: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
tf.Tensor([-1.0574996   0.24829748  1.0737331 ], shape=(3,), dtype=float32)
tf.Tensor(
[[ 1.1475685   0.          0.        ]
 [ 1.9094281   0.5724521   0.        ]
 [-1.1899896   0.49813363  1.5088601 ]], shape=(3, 3), dtype=float32)
tf.Tensor([-0.9953702  0.3626416  1.0675195], shape=(3,), dtype=float32)
WARNING:tensorflow:From /apps/PYTHON/3.7.4_ML/lib/python3.7/site-packages/tensorflow_core/python/ops/linalg/linear_operator_lower_triangular.py:158: calling LinearOperator.__init__ (from tensorflow.python.ops.linalg.linear_operator) with graph_parents is deprecated and
will be removed in a future version.
Instructions for updating:
Do not pass `graph_parents`.  They will  no longer be used.
```

Any clue?

Many thanks.

El martes, 7 de abril de 2020, 17:55:59 (UTC+2), Brian Patton escribió:

Angel Berihuete

unread,
Apr 13, 2020, 2:43:39 AM4/13/20
to TensorFlow Probability, angel.b...@gm.uca.es
Update. I found the problem.

I have a machine with 4 GPUs. When I try to run the code without logical GPUs or one logical GPU for each physical GPU, the code hangs at 

```singleton_vals = tfp.math.value_and_gradient(target_log_prob, (loc, scale_tril))```
 
When I set two or more logical GPUs for each physical GPU the code runs perfectly. 

```python

physical_gpus = tf.config.experimental.list_physical_devices('GPU')
print(physical_gpus)

for gpu in physical_gpus:
    tf.config.experimental.set_virtual_device_configuration(gpu,
    [tf.config.experimental.VirtualDeviceConfiguration(memory_limit=2000)] * 2)
gpus = tf.config.list_logical_devices('GPU')
print(gpus)

```

This kind of configuration (two or more logical GPUs) have impact on performance?

Brian Patton 🚀

unread,
Apr 13, 2020, 8:03:20 AM4/13/20
to Angel Berihuete, TensorFlow Probability
I would skip running that configuration block, as long as all 4 are in the results of list_physical_devices.

I expect there is a performance penalty for using logical devices.

To unsubscribe from this group and stop receiving emails from it, send an email to tfprobabilit...@tensorflow.org.
To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/tfprobability/80fa4d86-a1af-42b4-bbe2-e02d3d9e9888%40tensorflow.org.

Angel Berihuete

unread,
Apr 16, 2020, 4:01:50 AM4/16/20
to TensorFlow Probability, angel.b...@gm.uca.es
Looking each step in detail to understand why the process hangs at  

singleton_vals = tfp.math.value_and_gradient(target_log_prob, (loc, scale_tril))

when using 4 GPUs, I'm asking two questions:

1) Is TF version important? When we use Google Colab we have TF 2.2.0-rc2 and TFP 0.9.0, but in the machine with 4 GPUs we have TF 2.1 and TFP 0.9.0.

We have seen that new version of TF will change  experimental_run_v2

2) Why the function target_log_prob returns lp.values[0] or g.values[0]? Why not lp.values[1] or any number in range(4)? 

@tf.function(autograph=False)
@tf.custom_gradient
def target_log_prob(loc, scale_tril):
lp, grads = st.experimental_run_v2(log_prob_and_grad, (loc, scale_tril, observations))
return lp.values[0], lambda grad_lp: [grad_lp * g.values[0] for g in grads]




Reply all
Reply to author
Forward
0 new messages