Convolution2DFlipout Layer with Non-Diagonal Covariance Prior

Haley Jennings

unread,

Aug 4, 2023, 10:44:00 AM8/4/23

to TensorFlow Probability

Good Morning!

My thesis advisor and I are trying to train models on data with well understood texture, and we think the model could benefit from imposing a prior that enforces the covariant structure of those textures.

Is it currently possible to use a Convolution2DFlipout layer kernel with a prior that has a non-diagonal covariance structure? I've set the kernel_prior_fn argument of the layer to sample from a tfp.distributions.MultivariateNormalTriL distribution, in which I set the scale_tril appropriately for the covariance matrix that we'd like to use (see attached images).

However, while the layer appears to be correctly implemented, I get the following error upon running my code:

NotImplementedError: No KL(distribution_a || distribution_b) registered for distribution_a type Independent and distribution_b type MultivariateNormalTriL

It would appear that there is no built-in KL(distribution_a || distribution_b) registration for what I am trying to do. Is there a way around this or some modification I can make that I'm not seeing? Any assistance would be greatly appreciated.

Thanks!

Haley

Image 8-4-23 at 7.36 AM.jpeg

Image 8-4-23 at 7.37 AM.jpeg

Christopher Suter

unread,

Aug 4, 2023, 11:45:06 AM8/4/23

to Haley Jennings, TensorFlow Probability

What are you passing as `kernel_posterior_fn`? Or, more to the point, what is the distribution it's creating? It sounds like it's an Independent wrapped around something else. If that something else is another (mv)normal, you should be able to write a custom kernel_divergence_fn that computes this KL without too much trouble. A bit more context will help us help you further. Thanks for the well-posed question so far, though!

--
You received this message because you are subscribed to the Google Groups "TensorFlow Probability" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tfprobabilit...@tensorflow.org.
To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/tfprobability/2fb68418-3bce-4c28-bf72-352a2f0f3318n%40tensorflow.org.

Andrew and Haley Jennings

unread,

Aug 4, 2023, 12:51:07 PM8/4/23

to Christopher Suter, TensorFlow Probability

Hi Chris,

The kernel_posterior_fn is currently as follows:

Thanks,

Haley

Christopher Suter

unread,

Aug 4, 2023, 1:12:42 PM8/4/23

to Andrew and Haley Jennings, TensorFlow Probability

Thanks, can you try this:

```

loc_scale_fn = tfp.layers.default_loc_scale_fn(
loc_initializer=tf.keras.initializers.HeNormal(),
untransformed_scale_initializer=tf.keras.initializeres.TruncatedNormal(
mean=mean_norm,
stddev=stddev_norm)
def kernel_posterior_fn(dtype, shape, name, trainable, add_variable_fn):
return tfd.MultivariateNormalDiag(
*loc_scale_fn(dtype, shape, name, trainable, add_variable_fn))

```

Andrew and Haley Jennings

unread,

Aug 4, 2023, 2:31:13 PM8/4/23

to Christopher Suter, TensorFlow Probability

Chris,

I added the above code, and now I think we're getting to the root of the issue... The error I get now is:

TypeError: `Conv2DFlipout` requires `kernel_posterior_fn` produce an instance of `tfd.Independent(tfd.Normal)` (saw: "conv2d_flipout_MultivariateNormalDiag").

- Haley

Christopher Suter

unread,

Aug 7, 2023, 9:25:33 AM8/7/23

to Andrew and Haley Jennings, TensorFlow Probability

Right, I see. It's dependent on the mean field posterior assumption. In that case I think we need to use the old posterior_fn and rewrite the KL divergence instead:

```

kl_divergence_fn = lambda q, p, _: (1. / num_training_samples) * tfp.distributions.kl_divergence(

q, tfd.MultivariateNormalDiag(p.distribution.loc, scale_diag=p.distribution.scale)

)

```

I *think* that should work. I'm just reaching into the Independent(Normal), pulling out the params, and constructing an MVN that has a registered KL with your prior MVN.

If you get shape errors (or any errors, really) let me know and I'll try and help debug.

HTH!

Andrew and Haley Jennings

unread,

Aug 7, 2023, 1:53:28 PM8/7/23

to Christopher Suter, TensorFlow Probability

Chris,

I tried the above code, and am getting an error similar to our original error again:

"NotImplementedError: No KL(distribution_a || distribution_b) registered for distribution_a type Independent and distribution_b type MultivariateNormalDiag"

Unless I'm mistaken, p is generally the prior and q is the posterior, correct? If so, that would make me think it's the (currently independent) posterior (q) that we need to pull the params out of, in order to match to the prior. In that vein, I tried flipping your code around, as follows:

kl_divergence_fn = lambda q, p, _: (1. / num_training_samples) * tfp.distributions.kl_divergence(tfp.distributions.MultivariateNormalDiag(loc=q.distribution.loc, scale_diag=q.distribution.scale), p)

That gave me a new error, as follows:

File "/home/haley.j...@ern.nps.edu/prob_sas/trainer/BayesianResNet_hj.py", line 254, in <lambda>
kernel_divergence_fn_v2 = lambda q, p, _: (1. / num_training_samples) * tfp.distributions.kl_divergence(tfp.distributions.MultivariateNormalDiag(loc=q.distribution.loc, scale_diag=q.distribution.scale), p)
File "/home/haley.j...@ern.nps.edu/miniconda3/envs/thesis/lib/python3.8/site-packages/tensorflow_probability/python/distributions/kullback_leibler.py", line 101, in kl_divergence
kl_t = kl_fn(distribution_a, distribution_b, name=name)
File "/home/haley.j...@ern.nps.edu/miniconda3/envs/thesis/lib/python3.8/site-packages/tensorflow_probability/python/distributions/mvn_linear_operator.py", line 383, in _kl_brute_force
b_inv_a = b.scale.solve(a.scale.to_dense())
File "/home/haley.j...@ern.nps.edu/miniconda3/envs/thesis/lib/python3.8/site-packages/tensorflow/python/ops/linalg/linear_operator.py", line 873, in solve
tensor_shape.dimension_at_index(
File "/home/haley.j...@ern.nps.edu/miniconda3/envs/thesis/lib/python3.8/site-packages/tensorflow/python/framework/tensor_shape.py", line 281, in assert_is_compatible_with
raise ValueError("Dimensions %s and %s are not compatible" %
ValueError: Dimensions 9 and 16 are not compatible

Any ideas?

Thanks again for all your help!

Haley Jennings

Christopher Suter

unread,

Aug 7, 2023, 2:07:31 PM8/7/23

to Andrew and Haley Jennings, TensorFlow Probability

Yes, you're right re: prior vs posterior.

Would need more info about your model to help with the remaining shape issue. Can you share some kind of minimal repro in a colab notebook or something so I can debug?

Andrew and Haley Jennings

unread,

Aug 7, 2023, 6:47:04 PM8/7/23

to Christopher Suter, TensorFlow Probability

Chris,

Just shared a collab notebook with you that has my model. I can add any other files you need as well - just can't share the actual data.

Thanks again!

Haley Jennings

Christopher Suter

unread,

Aug 8, 2023, 10:51:26 AM8/8/23

to Andrew and Haley Jennings, TensorFlow Probability

Thanks for sharing this Haley. In order to help I will need to be able to reproduce the error. The colab contains the model but no indication of how to run it so as to produce the error. Maybe you can just pass in some fake data of the correct shape? Let me know if you can update the colab to produce the error you're seeing, and I'm happy to dig into it.

Andrew and Haley Jennings

unread,

Aug 8, 2023, 12:42:02 PM8/8/23

to Christopher Suter, TensorFlow Probability

*resending, as I hit reply instead of reply-all*

Hi Chris,

I've updated the colab with a few cells that create some random data of the correct shape, and build the model, and I've managed to reproduce the error I was getting on my own system. Hopefully you're able to see those changes, and maybe figure out where I'm going wrong now!

It appears to be hanging up in the first layer where the shape changes from (None, 300, 300, 16) to (None, 150, 150, 32), so my best guess is that something about how we're initializing the Convolution2DFlipout kernel is causing the error when the shape changes. I tried a couple different methods, but I can't figure out what I need to do to get it through that transition.

I appreciate your help!

Haley Jennings

Andrew and Haley Jennings

unread,

Aug 8, 2023, 3:07:41 PM8/8/23

to Christopher Suter, TensorFlow Probability

Update: I've managed to fix the shape error and the model compiles now. However, I'm getting a new error that I'm not sure what to do with... I've shared a colab notebook with the fixed code.

Thanks again!

Haley

Christopher Suter

unread,

Aug 9, 2023, 2:35:20 PM8/9/23

to Andrew and Haley Jennings, TensorFlow Probability

I managed to fix it by wrapping the return value of the modified KL divergence with tf.reduce_sum(...). I edited the colab heavily to make it easier for me to navigate and understand; hope that's ok. just take a look at kl_divergence_fn_v2 to see the only salient change i made.

Andrew and Haley Jennings

unread,

Aug 9, 2023, 2:47:26 PM8/9/23

to Christopher Suter, TensorFlow Probability

Chris,

Just implemented your change in my original code, and my model is training now! I really appreciate all your help!

Thanks,

Haley

Christopher Suter

unread,

Aug 9, 2023, 2:50:25 PM8/9/23

to Andrew and Haley Jennings, TensorFlow Probability

🙌

Andrew and Haley Jennings

unread,

Aug 31, 2023, 10:38:52 AM8/31/23

to Christopher Suter, TensorFlow Probability

Hi Chris!

Thanks again for all your help with my last issue. I have a follow-up question for you. Do you know if it's possible to use a DistributionLambda layer as the posterior for a Convolution2DFlipout Layer? From my understanding, the DistributionLambda layer should allow me to pick a tfp distribution and then sample from that distribution in a similar manner to the default_mean_field_normal_fn() layer, which is the default posterior for Convolution2DFlipout.

Basically, I'm looking for a way to try out various distributions as posteriors and priors on our texture data, and trying to find a way that simplifies editing the kl divergence equation by making sure that the posterior/prior distributions are compatible. I'm currently looking at the Weibull distribution, but would like the option to use any of them, theoretically.

I've attached some screenshots of what I've tried, and the errors I'm getting. I can update the code in the collab too if that would be easier to troubleshoot (this is the only real change since before).

Thanks again,

Haley Jennings

Image 8-31-23 at 7.32 AM.jpeg

Image 8-31-23 at 7.33 AM.jpeg

Image 8-31-23 at 7.36 AM.jpeg

Christopher Suter

unread,

Aug 31, 2023, 10:44:04 AM8/31/23

to Andrew and Haley Jennings, TensorFlow Probability

Is there a reason you want to use DistributionLambda, instead of just a callable (python lambda or function)? It might be possible to make this work but it seems like keras is unhappy with this nesting of Layers. I'm not very conversant in Keras, so not sure how deep the issue is. I'd recommend just returning the Weibull directly from your kernel_posterior_fn (et al), rather than returning a DistributionLambda. Does that make sense/would that work for you?

Andrew and Haley Jennings

unread,

Aug 31, 2023, 11:46:21 AM8/31/23

to Christopher Suter, TensorFlow Probability

That's a good point - a callable should do the trick. I changed it to a lambda that returns the distribution and now I'm running into some new errors:

Any ideas?

Thanks,

Haley Jennings

Image 8-31-23 at 8.43 AM.jpeg

Image 8-31-23 at 8.45 AM.jpeg

Christopher Suter

unread,

Aug 31, 2023, 12:17:28 PM8/31/23

to Andrew and Haley Jennings, TensorFlow Probability

The function you pass as the `kernel_posterior_fn` argument should have this interface: https://github.com/tensorflow/probability/blob/v0.21.0/tensorflow_probability/python/layers/util.py#L174

Dig in a bit there and hit me with more questions? Sorry, this API is a bit old and has not gotten a lot of usability-love...basically ever!

Andrew and Haley Jennings

unread,

Aug 31, 2023, 3:56:11 PM8/31/23

to Christopher Suter, TensorFlow Probability

Chris,

For the most part, that worked! Only issue now is what we ran into last time:

TypeError: `Conv2DFlipout` requires `kernel_posterior_fn` produce an instance of `tfd.Independent(tfd.Normal)` (saw: "conv2d_flipout_Independentconv2d_flipout_Weibull").

Do you have any ideas for how to get around this? Or am I running into an issue with the theory itself that necessitates a Normal posterior?

Thanks again!

Haley Jennings

Christopher Suter

unread,

Sep 1, 2023, 12:17:26 AM9/1/23

to Andrew and Haley Jennings, TensorFlow Probability

Oof, sorry about that. From §3.1 of the flipout paper it sounds like the only real requirement is a surrogate posterior symmetric around zero but the library hard-codes a Normality assumption. You might be able to make a copy of the layer code and hack it up to relax this assumption. I'm not sure there's an easier workaround than that.

Andrew and Haley Jennings

unread,

Sep 3, 2023, 9:39:12 PM9/3/23

to Christopher Suter, TensorFlow Probability

Chris,

I've been playing with this for a couple days now, and making some progress. I reconstructed the layer itself to remove the Normal distribution assumption; however, I keep running into the following error:

ValueError: Shape must be rank 4 but is rank 0 for '{{node conv2d_flipout_new/Conv2D}} = Conv2D[T=DT_FLOAT, data_format="NHWC", dilations=[1, 1, 1, 1], explicit_paddings=[], padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true](Placeholder, conv2d_flipout_new/Conv2D/Identity)' with input shapes: [?,300,300,1], [].

I've attached the new version of the kernel posterior function, as well as the edits I made to the _apply_variational_kernel function in my version of the ConvVariational layer. I'm pretty sure the issue is in my implementation of the posterior function rather than the layer edits, because it seems that the layer isn't getting the shape it expects, but I could be wrong. Anything sticking out to you that might be causing my issue?

Thanks again!

Haley Jennings

Image 9-3-23 at 6.33 PM.jpeg

Image 9-3-23 at 6.37 PM.jpeg

Christopher Suter

unread,

Sep 3, 2023, 10:38:00 PM9/3/23

to Andrew and Haley Jennings, TensorFlow Probability

I haven't looked closely but I think the distribution you produce, weibull here, needs to have events whose shape and dtype equal the arguments passed to the posterior fn. You are only producing scalar shaped parameters, and hence distributions wise events are scalar shaped. Iiuc the learnable parameters will be one per weight, so they need to match the conv kernel shapes. Does that make sense? Look closely at the code generating the normal distribution and see if you can see that pattern and replicate it.

Andrew and Haley Jennings

unread,

Sep 4, 2023, 8:43:39 PM9/4/23

to Christopher Suter, TensorFlow Probability

Yep! That was definitely the culprit - I rebuilt the posterior to behave the same way as default_mean_field_normal_fn does, and got some models to train. Thanks again for all the help!

Haley Jennings

Christopher Suter

unread,

Sep 4, 2023, 9:04:25 PM9/4/23

to Andrew and Haley Jennings, TensorFlow Probability

🎉🎉🎉

Christopher Suter

unread,

Sep 5, 2023, 3:35:21 PM9/5/23

to Andrew and Haley Jennings, TensorFlow Probability

Btw you probably also want to constrain the weibull params to be positive. Just call tf.math.softplus on them before passing into the constructor.

Andrew and Haley Jennings

unread,

Sep 7, 2023, 10:58:05 AM9/7/23

to Christopher Suter, TensorFlow Probability

That's a very good point - thanks!

Haley Jennings

Andrew and Haley Jennings

unread,

Nov 7, 2023, 12:48:24 PM11/7/23

to Christopher Suter, TensorFlow Probability

Hi Chris,

Thanks for all your help in the past! Got a follow-up question for you, going back to the Multivariate model (with full, non-diagonal covariance matrix) we were discussing a couple months back. Current structure of the model is a ResNet20 base model which has three stages, where the feature map size is halved (32x32 to 16x16 to 8x8) and the number of filters is doubled (16 to 32 to 64) between stages. The issue I'm running into is that I can implement my 16x16 full covariance matrix prior in the first stage (where the number of filters is 16), but not at the other two stages (if I try to do that, I get a value error because the shape of 16x16 is not compatible with the number of filters: 32 at stage 2 and 64 at stage 3).

My question is: why does the number of filters need to match the size of the prior's covariance matrix?

Any help you can provide would be great - thanks again!

Haley Jennings

I've been able to implement and train the model with a full-covariance prior, with a ResNet 20

Christopher Suter

unread,

Nov 7, 2023, 1:24:42 PM11/7/23

to Andrew and Haley Jennings, TensorFlow Probability

Hi Haley, hope you're well. The prior for each conv layer must be a distribution whose samples are shaped like the filters of that layer. If you have, say 32 outputs and 5x5 filters with 3 channels, your filter kernel will have shape [5, 5, 3, 32], and we'll need to populate all of those parameters with samples from our prior distribution. I'm a little unclear on what's done about the [5, 5, 3] bit of that shape, but you'll either need a covariance of shape [5, 5, 3, 32, 32] or something that can broadcast up to that shape (eg, just 32x32). But 16 won't work for a layer with such shape. The code that populates the kernel_shape parameter of your prior/posterior distribution maker fns, found here: https://github.com/tensorflow/probability/blob/v0.22.0/tensorflow_probability/python/layers/conv_variational.py#L179, is:

kernel_shape = self.kernel_size + (input_dim, self.filters)

where kernel_size is built up from the kernel_size parameter, repeated however many times is appropriate for your conv dimension (probably 2). The base kernel_size in my example above was 5, whihc for conv2d becomes [5, 5].

So in short, IIUC, you probably need to learn priors for each layer separately. LMK if this makes sense (or doesn't).

Andrew and Haley Jennings

unread,

Nov 7, 2023, 3:30:47 PM11/7/23

to Christopher Suter, TensorFlow Probability

Chris,

Thanks for the explanation! I think it mostly makes sense, but I'm still a little confused about why the covariance matrix shape needs to match the filter rather than the kernel size. Maybe that's from a lack of understanding exactly what a filter vs kernel is? Here's what I'm understanding from what you said above:

- Assuming a kernel size of 5x5, and 3 channels, that's where we get the first three pieces of the shape.

- In your example above, is number of outputs the same as number of filters? So you'd have (5,5,3,32) due to (kernel_height, kernel_width, num_channels, num_filters)?

- Where is the second 32 coming from in your example (5,5,3,32,32) for the covariance shape that we need to give? Is that just because the covariance matrix needs to be square?

Sorry for my lack of understanding here - just trying to figure out exactly what's happening so I can implement it in the most correct way.

Thanks again!

Haley Jennings

Christopher Suter

unread,

Nov 7, 2023, 3:56:01 PM11/7/23

to Andrew and Haley Jennings, TensorFlow Probability

Yeah it's all a bit confusing. My understanding in this context (partly from the docstrings, like this one for the `filters` argument to the Conv2d layer init fn), is:

- filters is the number of outputs of the layer

- kernel refers to all the weights in the layer (excluding bias terms). so this is a big tensor of shape (kernel_height, kernel_width, num_channels, num_filters), as you say in bullet 2.

The second 32 is, yeah, just bc it's a covariance over a 32-dimensional thing.

Andrew and Haley Jennings

unread,

Nov 15, 2023, 11:46:41 AM11/15/23

to Christopher Suter, TensorFlow Probability

Chris,

Thanks for your explanation! If you don't mind, I'm going to walk through the structure of my model so that I can make sure I understand exactly what's happening here.

Assume we have a ResNet model, kernel size of 4x4, and 3 channel images. The model has three stages:

- Stage 1: Filter shape is (4,4,3,16) due to 4x4 kernel, 3 channels, 16 filters

- Stage 2: Filter shape is (4,4,3,32) due to number of filters doubling from stage 1 to 2

- Stage 3: Filter shape is (4,4,3,64) due to number of filters doubling from stage 2 to 3

When I have the code print the shape of the prior function at stage 1, I get (4,4,3,16,16). When it prints at stage 2, I get (4,4,3,16,32), which is where it errors out. Looking at the code you referenced before (line 193), is that because the kernel prior adopts the same shape as the kernel from line 179, which would be (4,4,3,32), and can't reconcile 16 vs 32?

Is that what's going on here or am I missing something?

Thanks again!

Haley Jennings

Christopher Suter

unread,

Nov 15, 2023, 12:29:28 PM11/15/23

to Andrew and Haley Jennings, TensorFlow Probability

Can you link me to an updated colab where I can reproduce the error?

Reply all

Reply to author

Forward