Scaling data and parameters

629 views
Skip to first unread message

Cameron Bracken

unread,
Jun 1, 2015, 8:11:33 PM6/1/15
to stan-...@googlegroups.com
My question is in regards to this statement in the Stan manual: "we found that normalizing the data to unit sample mean and variance sped up the fits by an order of magnitude"

Does this apply to normalizing the parameter values too? I have a model where the actual values of different parameters vary over a few orders of magnitude, could this be slowing my model down?

Thanks,
Cameron

Bob Carpenter

unread,
Jun 1, 2015, 9:58:42 PM6/1/15
to stan-...@googlegroups.com
The parameters are actually what matter. Stan tries
to fit the scale of each during adaptation (technically by
estimating a diagonal mass matrix used to scale each parameter
during the Hamiltonian updates) but it helps immensely to keep
everything on roughly a unit scale.

I don't know how much this matters for parameters that are
small. So if you have some regression parameters that due to
pooling are pulled very close to zero.

- Bob
> --
> You received this message because you are subscribed to the Google Groups "Stan users mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Andrew Gelman

unread,
Jun 1, 2015, 10:08:04 PM6/1/15
to stan-...@googlegroups.com
From a Bayesian point of view, keeping things on a unit scale is a good idea because then it’s generally easy to include prior information and to batch parameters in hierarchical models. I discuss this a bit in my 2004 paper.
A

Michael Betancourt

unread,
Jun 1, 2015, 11:30:41 PM6/1/15
to stan-...@googlegroups.com
To add another data point — scaling helps in both modeling
and computational. As Andrew notes scaling is essentially
identifying the natural units of the problem which makes it
much easier to define weakly informative priors (a weakly
informative prior essentially identifies the expected order of
magnitude of a parameter — if the parameters are all
scaled appropriately than these are all O(1) ). Additionally,
when all of the parameters are scaled the posterior becomes
more or less isotropic which makes it makes any computational
algorithm much better conditioned. For example, in HMC this
means a much more stable and efficient integrator as the cost
of the leapfrog integrator goes as largest_scale / smallest_scale
which is minimized when largest_scale ~ smallest_scale.
Moreover, floating point arithmetic is essentially the most accurate
when the parameters are all O(1), further avoiding computational
problems.

Bob Carpenter

unread,
Jun 1, 2015, 11:35:41 PM6/1/15
to stan-...@googlegroups.com
Thanks --- that's a really nice summary, Michael.

- Bob

Cameron Bracken

unread,
Jun 2, 2015, 1:48:21 PM6/2/15
to stan-...@googlegroups.com
Thanks for the excellent explanations, I will definitely work on scaling my parameters. 
Reply all
Reply to author
Forward
0 new messages