just a thought on fitting big hierarchical regression models

71 views
Skip to first unread message

Andrew Gelman

unread,
Oct 27, 2015, 3:16:08 PM10/27/15
to stan development mailing list
Hi, as we’ve discussed, it can be slow to fit big hierarchical models in Stan. In practice we might be reduced to something yucky (unstable, inflexible) such as lme4. An example is the project I’ve been doing with the National Board of Medical Examiners in which I have something like 200,000 data points and some hierarchical factors (one factor with 50 levels, another with 15 levels, and possibly their interaction with 50*15 levels). So, what to do? Stan is fine but it could take 30 minutes to run, with lme4 just taking 2 minutes. One option we’re looking at is ADVI, that might solve our problems. Another option is MML, once we implement it. Another option is to make the model non-hierarchical. Often we have some prior info on the group-level variance parameters. If we treat these group-level variance parameters as known, we can simply run Stan-optimize with big bad BFGS. But what if we have only partial info on the group-level variances? Then some iterative approach could work? Maybe MML could be really useful in such settings. Or EP?

I guess what I’m getting at is that this is really annoying. When N is large, we can learn a lot about the variance parameters, so really things should go fast, not slow. It’s great to have full Nuts around as a backup, but something’s really bothering me about how we’re fitting these models now.

A

Ben Goodrich

unread,
Oct 27, 2015, 10:54:01 PM10/27/15
to stan development mailing list, gel...@stat.columbia.edu
On Tuesday, October 27, 2015 at 3:16:08 PM UTC-4, Andrew Gelman wrote:
I guess what I’m getting at is that this is really annoying.  When N is large, we can learn a lot about the variance parameters, so really things should go fast, not slow.  It’s great to have full Nuts around as a backup, but something’s really bothering me about how we’re fitting these models now.

Are you talking about this example?

https://groups.google.com/d/msg/stan-dev/4cHQoaaS8GQ/_-ebxNDm6mUJ

Analyzing that model under rstanarm (which has a different parameterization than in the .stan file you wrote, which is where the csr_matrix_times_vector stuff comes from), the functions that appeared at least 3 times when the computer sampled a time slice were, where
  1. Number of profiling samples in this function
  2. Percentage of profiling samples in this function
  3. Percentage of profiling samples in the functions printed so far
  4. Number of profiling samples in this function and its callees
  5. Percentage of profiling samples in this function and its callees
  6. Function name
goodrich@T540p:/tmp/nbme_fake$ google-pprof --text /usr/lib/R/bin/exec/R nbme_fake_rstanarm.log
Using local file /usr/lib/R/bin/exec/R.
Using local file nbme_fake_rstanarm.log.
Total: 6730 samples

     (1)    (2)   (3)       (4)   (5)  (6)

    2482  36.9%  36.9%     2506  37.2% _IO_str_seekoff
    1272  18.9%  55.8%     2185  32.5% __libc_malloc
     574   8.5%  64.3%      574   8.5% __signbitl
     290   4.3%  68.6%      299   4.4% stan::math::add
     269   4.0%  72.6%     3168  47.1% stan::math::csr_matrix_times_vector
     199   3.0%  75.6%      199   3.0% Eigen::internal::redux_impl::run
     174   2.6%  78.2%     1610  23.9% stan::math::dot_product
     144   2.1%  80.3%     1474  21.9% stan::math::multiply
     120   1.8%  82.1%      120   1.8% stan::math::::add_vv_vari::chain [clone .lto_priv.1660]
     119   1.8%  83.8%      119   1.8% stan::math::::dot_product_vari::chain
     112   1.7%  85.5%      112   1.7% cfree
      87   1.3%  86.8%      103   1.5% RunGenCollect (inline)
      54   0.8%  87.6%      662   9.8% stan::math::bernoulli_logit_log
      52   0.8%  88.4%     1509  22.4% Eigen::internal::conditional_aligned_new_auto
      48   0.7%  89.1%       48   0.7% std::vector::emplace_back [clone .constprop.1413]
      38   0.6%  89.7%       38   0.6% __sched_yield
      37   0.5%  90.2%       37   0.5% stan::math::::dot_product_vari::initialize [clone .isra.2247] [clone .part.2248] [clone .constprop.1273]
      33   0.5%  90.7%       77   1.1% model_binomial_namespace::make_b
      33   0.5%  91.2%       33   0.5% stan::math::stored_gradient_vari::chain
      32   0.5%  91.7%     6039  89.7% stan::math::gradient
      29   0.4%  92.1%       29   0.4% stan::math::check_size_match
      29   0.4%  92.5%      280   4.2% stan::math::grad [clone .constprop.1445]
      28   0.4%  92.9%       28   0.4% _init@64680
      28   0.4%  93.4%      214   3.2% exp
      23   0.3%  93.7%       26   0.4% Eigen::PlainObjectBase::lazyAssign
      23   0.3%  94.0%       23   0.3% stan::math::assign
      22   0.3%  94.4%       22   0.3% strerror_l
      21   0.3%  94.7%       21   0.3% stan::math::check_range [clone .constprop.1394]
      18   0.3%  94.9%       18   0.3% stan::math::initialize [clone .constprop.1294]
      16   0.2%  95.2%       16   0.2% stan::math::csr_u_to_z
      15   0.2%  95.4%       15   0.2% stan::interface_callbacks::writer::sum_values::operator
      15   0.2%  95.6%       15   0.2% stan::math::::not_nan::check [clone .lto_priv.1527]
      14   0.2%  95.8%       26   0.4% CONS_NR.localalias.14
      14   0.2%  96.0%     6453  95.9% bcEval
      13   0.2%  96.2%       13   0.2% cov_na_1 (inline)
      13   0.2%  96.4%       13   0.2% stan::math::log1p
      12   0.2%  96.6%       12   0.2% Eigen::internal::assign_impl::run
      12   0.2%  96.8%       12   0.2% SortNodes (inline)
       9   0.1%  96.9%        9   0.1% R_qsort
       9   0.1%  97.1%     5721  85.0% model_bernoulli_namespace::model_bernoulli::log_prob
       8   0.1%  97.2%        8   0.1% getc
       8   0.1%  97.3%       62   0.9% model_bernoulli_namespace::model_bernoulli::write_array
       7   0.1%  97.4%        8   0.1% Rf_mkPROMISE
       6   0.1%  97.5%     6457  95.9% Rf_eval
       6   0.1%  97.6%        6   0.1% _IO_feof
       6   0.1%  97.7%        6   0.1% memset
       5   0.1%  97.7%        5   0.1% stan::math::OperandsAndPartials::OperandsAndPartials [clone .constprop.1316]
       4   0.1%  97.8%        4   0.1% GLIBC_2.15
       4   0.1%  97.9%        4   0.1% R_HashGet
       4   0.1%  97.9%       95   1.4% Rf_allocVector3
       4   0.1%  98.0%       22   0.3% Rf_matchArgs
       4   0.1%  98.0%        4   0.1% boost::random::detail::unit_normal_distribution::operator
       4   0.1%  98.1%        4   0.1% rcmp
       4   0.1%  98.2%        5   0.1% rstan::rstan_sample_writer::operator [clone .constprop.572]
       4   0.1%  98.2%        4   0.1% stan::math::::multiply_vv_vari::chain [clone .lto_priv.1646]
       4   0.1%  98.3%        7   0.1% stan::math::normal_log
       4   0.1%  98.3%     5905  87.7% stan::mcmc::base_leapfrog::evolve
       3   0.0%  98.4%        3   0.0% AgeNodeAndChildren
       3   0.0%  98.4%       16   0.2% Eigen::Matrix::Matrix [clone .constprop.1302]
       3   0.0%  98.5%        3   0.0% Eigen::PlainObjectBase::_set_noalias
       3   0.0%  98.5%        5   0.1% Rf_copyVector
       3   0.0%  98.6%        7   0.1% Rf_findFun
       3   0.0%  98.6%        3   0.0% Rf_findVarInFrame3
       3   0.0%  98.6%        3   0.0% Rf_length (inline)
       3   0.0%  98.7%        5   0.1% Rf_mkCharLenCE
       3   0.0%  98.7%        3   0.0% __hypot_finite
       3   0.0%  98.8%        3   0.0% _init@64df0
       3   0.0%  98.8%        3   0.0% do_array
       3   0.0%  98.9%        5   0.1% stan::io::reader::vector

I don't know exactly what _IO_str_seekoff does, but it is some low-level C library function. The other stuff looks pretty typical in the sense that it is 89.7% things called when evaluating the gradient. If you are doing a hierarchical model with this many

groups:  specialty:state, 433; state, 49; specialty, 16

group-level parameters, then I don't know where we are going to get much extra speed without using an algorithm that makes fewer gradient calls.

Ben



Andrew Gelman

unread,
Oct 28, 2015, 1:09:33 AM10/28/15
to Ben Goodrich, stan development mailing list
Hi, yes, I’m thinking we make fewer gradient calls by doing optimization rather than Nuts.
A

Andrew Gelman

unread,
Oct 28, 2015, 4:42:16 PM10/28/15
to Seth Flaxman, stan development mailing list
Sounds cool, I’d like to hear about this. Our current approach seems wrong, somehow. We have full Nuts but then when it’s slow, we resort to all sorts of crude approximations. There’s gotta be a bridge.


> On Oct 28, 2015, at 7:02 AM, Seth Flaxman <fla...@gmail.com> wrote:
>
> One idea might be to investigate stochastic gradient Langevin dynamics
> (http://www.icml-2011.org/papers/398_icmlpaper.pdf). Has anyone used
> it? I'm starting to use it for some GP models, will be happy to report
> back in a few weeks.
>
> Seth
>> --
>> You received this message because you are subscribed to the Google Groups
>> "stan development mailing list" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to stan-dev+u...@googlegroups.com.
>> For more options, visit https://groups.google.com/d/optout.


Dustin Tran

unread,
Oct 28, 2015, 4:48:35 PM10/28/15
to stan...@googlegroups.com, Seth Flaxman
more directly, we can do sgd in hmc following papers from emily fox, e.g., http://arxiv.org/abs/1506.04696 it performs better than sgd + langevin dynamics. i think michael would be against having data subsampling in our monte carlo methods though

another (radical?) alternative is using variational distributions for mcmc http://arxiv.org/abs/1410.6460

Dustin

Andrew Gelman

unread,
Oct 28, 2015, 4:51:32 PM10/28/15
to stan...@googlegroups.com, Seth Flaxman
Data subsampling is great but it would be good to have an algorithm that does not require data subsampling.  I say this not out of any sort of purism but simply because data subsampling requires some structure on the data representation, and I’d prefer to do the algoritihm more generically if possible.
Again, I think we gotta do something.  I say this because we already are doing something.  It’s just a shitty something.
A

Michael Betancourt

unread,
Oct 28, 2015, 4:56:21 PM10/28/15
to stan...@googlegroups.com
I wrote a paper explaining exactly why these methods don’t work,
and I’ll be damned if they get into Stan. (╯°□°)╯︵ ┻━┻

Michael Betancourt

unread,
Oct 28, 2015, 5:01:11 PM10/28/15
to stan...@googlegroups.com

Again, I think we gotta do something.  I say this because we already are doing something.  It’s just a shitty something.

It takes 30 minutes to fit a giant model?  That’s incredibly advanced
to where we were even a few years ago.  Any more approximate
method is going to sacrifice the accuracy of the fit, which won’t be
easy to verify.  And I’d consider that pretty shitty.

What is the ultimate utility of being able to run models so quickly if
it compromises the fit? I get the desire to prototype models, but that 
can be done with fewer data and then scaled up once the model is
complete.  

I look forward to 10 years from now when we finally understand
these methods enough to diagnose pathologies and see just how
bad the fits were (as we’re seeing for lots of BUGS models now)…

Andrew Gelman

unread,
Oct 28, 2015, 5:05:12 PM10/28/15
to stan...@googlegroups.com
Here’s the deal.  If it takes 30 minutes in Stan, I will typically wait the 30 min, but sometimes I’ll run the 1-minute version using lme4.  I think Stan should have something that’s as good as lme4 and as fast as lme4.

That is, I want us to own the efficient frontier.  We own full Bayes, we own optimizing, I want to own some more points along that frontier.  I really really don’t like having to go outside of Stan to fit my models.

Michael Betancourt

unread,
Oct 28, 2015, 6:02:58 PM10/28/15
to stan...@googlegroups.com

That is, I want us to own the efficient frontier.  We own full Bayes, we own optimizing, I want to own some more points along that frontier.  I really really don’t like having to go outside of Stan to fit my models.

_That_ is something I can get behind.  I just want to users to be aware that
these methods aren’t competing with HMC — they have a different purpose
(exploratory data analysis, prototyping, etc) and can coexist with Stan.  

Let’s destroy lme4 while teaching users to say “I only needed an approximation
so I used fastStan instead of hmcStan” instead of “hmcStan was too slow but
fastStan gave me the same answer in much less time”.  If only for my sanity.

Also, we absolutely need a 10 second video of Andrew saying "I want us to own 
the efficient frontier.  We own full Bayes, we own optimizing, I want to own some 
more points along that frontier.” to Patriotic music.

Andrew Gelman

unread,
Oct 28, 2015, 6:04:25 PM10/28/15
to stan...@googlegroups.com
Well, we own much of the efficient frontier.  For really hard problems, you’ll need a Turing-complete language like Church.

Michael Betancourt

unread,
Oct 28, 2015, 6:16:32 PM10/28/15
to stan...@googlegroups.com
I heard particle methods are the future of all statistics.
Reply all
Reply to author
Forward
0 new messages