Stan HMC vs TFP HMC ?

348 views
Skip to first unread message

Shira Mitchell

unread,
Apr 17, 2021, 3:15:17 PM4/17/21
to TensorFlow Probability
Dear TFP community,

I'm a statistician and Stan user, new to this group and to TFP.

I tried to compare Stan HMC vs TFP HMC for the same model, data generated from that model, and number of iterations. Stan takes 2 minutes, TFP takes 20+ minutes. Changing to Colab GPU reduces TFP to 5 minutes. However, this isn't really a fair comparison to Stan, and I would like to run these models on CPUs. 

Is this dramatic runtime difference expected, or might there be something inefficient about the way I’m using the library?

Apologies if this is entirely covered by other conversations or documentation.


Thanks so much,
Shira

Pavel Sountsov

unread,
Apr 19, 2021, 9:48:23 PM4/19/21
to Shira Mitchell, TensorFlow Probability
There's at least four sources of inefficiency in the TFP implementation:

1. TensorFlow, will only parallelize within the atomic operations (e.g. it'll run matrix multiplies/einsums in parallel) but not much else. This is unlike Stan where each chain is sitting completely independently on each core, achieving near perfect parallelizations (cache issues aside).
2. TFP's NUTS is batch aware, and will run all 4 chains in a SIMD fashion, which implies that each NUTS step will take as maximum number of leapfrog steps across chains.
3. TFP's NUTS has quite a bit of control flow, which TensorFlow struggles with.
4. You're using XLA compilation, which will compile the program the first time you run a @tf.function-decorated function. This compilation can be surprisingly slow at times. In Stan, you're measuring this in a separate cell.

I can believe all those points multiplied together could result in the 10x slowdown you're seeing.

1 and 2 could potentially be addressed by using TensorFlow's SPMD facilities, which will explicitly run computations on different threads. Sadly, the API is a little byzantine, but here's a rather large snippet to get you started:

NUM_CORES = 2
NUM_CHAINS = 4
NUM_CHAINS_PER_CORE = NUM_CHAINS // NUM_CORES
assert NUM_CORES * NUM_CHAINS_PER_CORE == NUM_CHAINS

physical_devices = tf.config.experimental.list_physical_devices()
tf.config.experimental.set_virtual_device_configuration(
physical_devices[0],
[tf.config.experimental.VirtualDeviceConfiguration()] * NUM_CORES)

print(tf.config.list_logical_devices())

strategy = tf.distribute.MirroredStrategy(
devices=tf.config.list_logical_devices())

def target_log_prob_fn(x):
return -tf.reduce_sum((x / tf.linspace(0.1, 1., 10)) **2, -1)

@tf.function(autograph=False, jit_compile=True)
def sample(seed):
return tfp.mcmc.sample_chain(
1000,
tf.zeros([NUM_CHAINS_PER_CORE, 10]),
kernel=tfp.mcmc.NoUTurnSampler(target_log_prob_fn, step_size=0.1),
trace_fn=None,
seed=seed)

seeds = tfp.random.split_seed((0, 0), NUM_CORES)
seeds = strategy.experimental_distribute_values_from_function(
lambda ctx: seeds[ctx.replica_id_in_sync_group])

chain = tf.nest.map_structure(lambda x: tf.concat(x.values, 1),
strategy.run(sample, (seeds,)))

--
You received this message because you are subscribed to the Google Groups "TensorFlow Probability" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tfprobabilit...@tensorflow.org.
To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/tfprobability/8ef21b2e-49ce-4001-9ee1-68e067f84c31n%40tensorflow.org.

Dave Moore

unread,
Apr 19, 2021, 11:01:38 PM4/19/21
to Pavel Sountsov, Shira Mitchell, TensorFlow Probability
Another general observation is that with TFP it's often efficient to run tens or hundreds of simultaneous chains, because you're limited by the hardware's SIMD capability rather than the number of CPU cores (of course this is especially powerful if you're using a GPU). That is, the point in the num_chains vs num_steps tradeoff space that works best for Stan may not be optimal for TFP, and vice versa.

I'd guess that Pavel's advice is more likely to be useful to you specifically, since it looks like you're using a decent-sized data set (N=10000) where computing the likelihood is probably already saturating your CPU's vectorized instructions. But for others reading this thread: increasing the number of chains is sometimes a free win that's worth being aware of.

Dave

Junpeng Lao

unread,
Apr 20, 2021, 5:06:04 AM4/20/21
to TensorFlow Probability, dav...@google.com, Shira Mitchell, TensorFlow Probability, Pavel Sountsov
Hi Shira,
A few things I would like to add to the great points raised by Pavel and Dave, benchmarking are difficult to do right as there are so many moving parts in different environment.
In your particular case, there are a few specific reasons why TFP is slow:
- the sampling routine in TFP is suboptimal. A poorly tuned NUTS suffer even more speed-wise in SIMD
- the model is different in Stan and TFP
  - the implementation in TFP use einsum to multiplying a pretty large array

So I rewrite the model a bit, and change the TFP sampling routine (this is something I usually use and gives fairly good answer in a reasonable time), now it runs in CPU for 4 chains 1000 samples + 1000 tuning:
Stan 39s
TFP 35s

(note that the TFP sampling routine use a simple tuning so the posterior is not as high quality as a more careful tuning scheme like window adaptation)

Happy to follow up with you if you have question around the changes in the model/sampling routine.
Best,
Junpeng

Reply all
Reply to author
Forward
0 new messages