Recommendations on how to improve sampling for a sparse regression model

jwei...@umich.edu

unread,

Mar 12, 2019, 3:11:37 PM3/12/19

to TensorFlow Probability

Hi TFP,

I'm seeking some advice on how best to sample from a sparse regression model with a horseshoe prior. I'm thinking of a toy example like this:

P = 200

N = 1000

num_true_effects = 10

zero_effects = P - num_true_effects

true_effects = np.random.normal(loc=0, scale = .5, size= num_true_effects)

effects = np.append(true_effects, np.zeros(zero_effects))

x = np.random.multivariate_normal(mean = np.zeros([P]), cov = np.eye(P), size = N)

y = np.matmul(x, effects) + np.random.normal(loc = 0., scale = 1., size = N)

def model():

sigma = ed.HalfNormal(scale=1., name = "sigma")

tau = ed.HalfCauchy(loc = 0., scale = sigma, name = "tau")

beta = ed.Horseshoe(scale = tf.ones([P]) * tau, name = "beta")

residual_sigma = ed.HalfCauchy(loc=0., scale=1., name = "residual_sigma")

y = ed.Normal(loc = tf.tensordot(x.astype(np.float32), beta, [[1], [0]]), scale = 1., name = "y")

return y

log_joint = ed.make_log_joint_fn(model)

def target_log_prob_fn(sigma, tau, beta, residual_sigma):

return log_joint(

sigma = sigma, tau = tau,

beta = beta,

residual_sigma = residual_sigma,

y = y

)

num_results = 3000

num_burnin_steps = 4000

step_size = tf.get_variable(

initializer = .01,

name = "step_size"

)

states, kernel_results = tfp.mcmc.sample_chain(

num_results=num_results,

num_burnin_steps=num_burnin_steps,

num_steps_between_results=5,

current_state=[

tf.ones([], name = "init_sigma"),

tf.ones([], name = "init_tau"),

tf.random_normal([P], name = "init_beta"),

tf.ones([], name = "residual_sigma")

],

kernel=tfp.mcmc.HamiltonianMonteCarlo(

target_log_prob_fn=target_log_prob_fn,

num_leapfrog_steps=10,

step_size = step_size,

step_size_update_fn=tfp.mcmc.make_simple_step_size_update_policy(num_adaptation_steps=None)

)

I found the above to be very sensitive to the initialization points for beta, and to produce posterior samples with high auto-correlation, leading to very low effective sample size values for the beta values (e.g., 15 for the first beta value).

I recognize that the horseshoe prior is not that easy to sample from, but any suggestions for how to improve sampling here with TFP tools? Perhaps an alternative parameterization, or maybe this is a matter of cranking up the leapfrog steps?

Thanks,

Josh

Dave Moore

unread,

Mar 12, 2019, 4:48:32 PM3/12/19

to jwei...@umich.edu, TensorFlow Probability

Hi Josh, that's a great question! Our (very limited) internal testing agrees with your experience that HMC struggles with the marginalized Horseshoe prior (`tfd/ed.Horseshoe`), likely because the tails are really fat.

I've gotten decent results from the formulation proposed by https://arxiv.org/abs/1610.05559 (Piironen and Vehtari, 2017), specifically the Stan code in the appendix. They start with the HalfCauchy->Normal representation, then unpack the HalfCauchy into an InverseGamma->HalfNormal compound (to eliminate the fat Cauchy tails), so you ultimately get a three-level InverseGamma->HalfNormal->Normal hierachy, and can additionally write the Normals in a noncentered parameterization which seems to help. Here's a slightly adapted, mostly untested version of my Edward2 implementation:

```python

def horseshoe_prior_in_compound_representation(num_features, global_scale=1.):

local_scale_variance = ed.InverseGamma(

0.5 * tf.ones([num_features]),

name='local_scale_scale')

local_scale_noncentered = ed.HalfNormal(

scale=tf.ones([num_features]),

name='local_scale_noncentered')

local_scale = local_scale_noncentered * tf.sqrt(local_scale_variance)

weights_noncentered = ed.Normal(loc=tf.zeros([num_features]),

scale=tf.ones([num_features]),

name='weights_noncentered')

weights = weights_noncentered * local_scale * global_scale

return weights

```

(where `weights` here would correspond to your `beta`, `global_scale` to `tau` in the conventional notation, and `local_scale` collects the `lambda_i` variables).

I never tried inferring tau, but the Piironen and Vehtari paper also recommends the InverseGamma->HalfNormal representation for the HalfCauchy prior, so that's probably worth trying for your model.

Tuning the number of leapfrog steps is also likely pretty important. We're hoping to have better answers for this in TFP, but for tuning by hand, Matt Hoffman's recommendation is (probably misquoting here) to use "as many leapfrog steps as you can, but no more". That is, keep increasing the number of leapfrog steps until your trace plots look like a random walk, i.e., have essentially zero autocorrelation, and then stop. The optimal number of leapfrog steps can be much higher than you might expect: hundreds or even 1-2k is not unheard of; NUTS chooses length-1024 trajectories quite often. The intuition is that early trajectories are dominated by random-walk behavior from the random initial momentum, while later in the trajectory the sampler is able to accumulate gradient information and behave asymptotically more efficiently, so you prefer to lengthen the trajectories as much as you can until you're essentially drawing an independent sample on each step.

Sorry this is all somewhat vague at the moment. :-) We'd love to hear more about your experience and what works for you!

Dave

--
You received this message because you are subscribed to the Google Groups "TensorFlow Probability" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tfprobabilit...@tensorflow.org.
Visit this group at https://groups.google.com/a/tensorflow.org/group/tfprobability/.

jwei...@umich.edu

unread,

Mar 14, 2019, 9:50:44 AM3/14/19

to TensorFlow Probability, jwei...@umich.edu

Thanks Dave. Yea, moving to the regularized horseshoe or other horseshoe versions is something that I'm exploring. This seems to help a little bit. However, I still find that even for toy problems, 300 leapfrog steps still is insufficient (out of 2000 posterior samples I'll get an ESS of like 20 or so). I'll keep increasing the number of leapfrog steps and trying other horseshoe variants.

Josh

Matt Hoffman

unread,

Mar 14, 2019, 12:30:04 PM3/14/19

to jwei...@umich.edu, TensorFlow Probability

It looks like you're working in the constrained space, i.e., sigma, tau, and residual_sigma are all constrained to be positive. Our HMC implementation doesn't like this kind of constraint, so that could be part of the problem. You might look into applying some kind of unconstraining transformation; for this kind of heavy-tailed prior, I'd recommend log (implemented as Invert(Exp())); the log-transform makes very heavy-tailed distributions like the half-Cauchy a bit better behaved.

Also, I agree with pretty much everything Dave said.

Matt

jwei...@umich.edu

unread,

Mar 16, 2019, 10:11:08 PM3/16/19

to TensorFlow Probability, jwei...@umich.edu

Thanks so much for the helpful responses, Matt and Dave. Just to clarify about mapping to an unconstrained space - you're recommending something like:

    kernel=tfp.mcmc.TransformedTransitionKernel(
        inner_kernel = tfp.mcmc.HamiltonianMonteCarlo(
            target_log_prob_fn=target_log_prob_fn,
            num_leapfrog_steps=num_leapfrog_steps,
            step_size = step_size,
            step_size_update_fn=tfp.mcmc.make_simple_step_size_update_policy(num_adaptation_steps=None)
        ),
        bijector = [
            tfp.bijectors.Invert(tfp.bijectors.Exp()),
            tfp.bijectors.Invert(tfp.bijectors.Exp()),
            tfp.bijectors.Identity(),
            tfp.bijectors.Invert(tfp.bijectors.Exp())
        ] 
)

Sadly, my initial experiments suggest that this isn't helping too much.

Josh

Josh Weinstock

unread,

Jan 15, 2020, 2:22:59 PM1/15/20

to TensorFlow Probability, jwei...@umich.edu

Hi Dave,

Is there a recommended way to express the compound horseshoe distribution using the new JointDistribution* API? I have had some trouble porting old horseshoe Edward2 code to the new API. Specifically, what is the best way to express the

local_scale = local_scale_noncentered * tf.sqrt(local_scale_variance)

step with JointDistributionSequential? Returning the product of the three distributions with the non-centered parameterization is also a bit challenging.

Thanks,

Josh

On Tuesday, March 12, 2019 at 4:48:32 PM UTC-4, Dave Moore wrote:

To unsubscribe from this group and stop receiving emails from it, send an email to tfprob...@tensorflow.org.

Dave Moore

unread,

Jan 25, 2020, 3:46:02 AM1/25/20

to Josh Weinstock, TensorFlow Probability

Hi Josh,

Sorry for the slow reply; I've been on vacation the past couple of weeks.

In general, I'd expect that E2 code would port more-or-less directly to JointDistributionCoroutine just by replacing `ed.Normal(...)` with `yield Normal(...)`, etc, everywhere. If you want to factor out the horseshoe to a reusable subroutine, then in the calling model that *uses* the horseshoe prior you'd write

weights = yield from horseshoe_prior_in_compound_representation(...)

where Python requires 'yield from' to pass through the nested random variables. I believe `return weights` inside the nested generator will do the right thing here; if not, please let us know. :-)

For JDSequential, there's not really a notion of deterministic intermediate values, so you unfortunately have to do the substitution manually: instead of explicitly defining the weights, you'd just define all the relevant precursor values and then use them to build the weights 'on the fly' as you do the regression. That is, you could write something like

joint_dist_including_regression_outputs = tfd.JointDistributionSequential([

ed.InverseGamma(0.5 * tf.ones([num_features]), 0.5 * tf.ones([num_features]), name='local_scale_scale'),

ed.HalfNormal(scale=tf.ones([num_features]), name='local_scale_noncentered'),

ed.Normal(loc=tf.zeros([num_features]), scale=tf.ones([num_features]), name='weights_noncentered'),

lambda weights_noncentered, local_scale_noncentered, local_scale_scale: tfd.Normal(

# Do the regression with implicit defn of `weights`:

loc=tf.matmul(weights_noncentered * local_scale_noncentered *

tf.sqrt(local_scale_variance) * global_scale),

regression_inputs),

scale=regression_noise_stddev,

name='regression_outputs')])

My take is that JDS wasn't really designed to make this use-case clean, so I'd probably prefer to use JDCoroutine here, but YMMV.

Dave

To unsubscribe from this group and stop receiving emails from it, send an email to tfprobabilit...@tensorflow.org.
To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/tfprobability/eb42fa6f-1401-4df0-b6b1-2a37f989da88%40tensorflow.org.

Josh Weinstock

unread,

Jan 27, 2020, 4:58:53 PM1/27/20

to TensorFlow Probability, jwei...@umich.edu

Thanks Dave, greatly appreciate your help here. The following seems to work - does this seem right to you? Agree that this seems more natural than JointDistributionSequential

Root = tfd.JointDistributionCoroutine.Root
def horseshoe_prior_in_compound_representation(num_features=100, global_scale=.1):
  local_scale_variance = yield Root(tfd.Independent(
      tfd.InverseGamma(

        0.5 * tf.ones([num_features]),
        0.5 * tf.ones([num_features]),

        ),
      reinterpreted_batch_ndims = 1
      )
    )
  local_scale_noncentered = yield Root(tfd.Independent(
      tfd.HalfNormal(scale=tf.ones([num_features])),
      reinterpreted_batch_ndims = 1
      ))

  local_scale = local_scale_noncentered * tf.sqrt(local_scale_variance)

  weights_noncentered = yield Root(tfd.Independent(
          tfd.Normal(
              loc=tf.zeros([num_features]),
              scale=tf.ones([num_features])
            ),
            reinterpreted_batch_ndims = 1
        ))

  weights = weights_noncentered * local_scale * global_scale
  return weights

To unsubscribe from this group and stop receiving emails from it, send an email to tfprob...@tensorflow.org.

Reply all

Reply to author

Forward