EP and data partitioning

72 views
Skip to first unread message

Andrew Gelman

unread,
Jul 9, 2015, 4:10:22 PM7/9/15
to stan...@googlegroups.com
Hi all . . . just thinking aloud . . . I was just fitting a model in Stan, something like 10^6 data points, I realized I’d want to subset the data to debug the program first. And I realized this comes up all the time, it’s making me think we should have some sort of data-subsetting widget . . . and this reminds me of EP-like algorithms . . . and . . . I’m not sure how best to do this. I’m thinking the way to start would be, not to program it into Stan but to do it on a couple examples to get a workflow going. Kinda like what we just did with LOO/WAIC.

Anyway, no specific suggestions here, I just wanted to share these thoughts.
A


Daniel Lee

unread,
Jul 9, 2015, 4:15:26 PM7/9/15
to stan...@googlegroups.com
I'm with you. The conceptual issue I had was how to describe subsetting unambiguously. It's never been clear to me how to specify subsetting... which variables do you subset over. How do you do the subsetting, etc. We need some sort of language that can do describe it. Trying to read the mind of the user is going to be impossible because the same data can be subset in many ways.


A


--
You received this message because you are subscribed to the Google Groups "stan development mailing list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stan-dev+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Andrew Gelman

unread,
Jul 9, 2015, 4:26:41 PM7/9/15
to stan...@googlegroups.com
Yes, I agree, that’s why it probably makes sense to do some examples first.

But I think it can be done, in the same way that LOO and WAIC can be done.  LOO and WAIC require a likleihood to be divided into N parts, they don’t work with any arbitrary posterior.

Daniel Lee

unread,
Jul 9, 2015, 4:30:27 PM7/9/15
to stan...@googlegroups.com
On Thu, Jul 9, 2015 at 4:26 PM, Andrew Gelman <gel...@stat.columbia.edu> wrote:
Yes, I agree, that’s why it probably makes sense to do some examples first.

Yes. If you think of any, let me know. I'm happy to help work on it so I can get a better grasp at what needs to be subset and how we need to describe that generally.



But I think it can be done, in the same way that LOO and WAIC can be done.  LOO and WAIC require a likleihood to be divided into N parts, they don’t work with any arbitrary posterior.

Once we restrict how we can subset, I think it'll be feasible to do this well. I still want to build an EP implementation in Stan... once I get some time. That definitely needs some subsetting of data. I think Alp and Dustin are also interested in subsetting for other purposes.



Dustin Tran

unread,
Jul 9, 2015, 6:18:44 PM7/9/15
to stan...@googlegroups.com
Alp, Michael (Betancourt), and I had lunch today about this in fact. Unrelated to the problem of the language specification, there was mention of being able to implement a default subsampling with fvar context or so, but it went over my head.

Yes, it’s still unclear to me how to specify the subsampling properly. In the mini-batch/stochastic variational inference framework, the subsampling scheme is left to be completely open. The standard is to uniformly sample without replacement for a fixed M number of samples per iteration; once you’ve done a full pass over the data, you repeat, i.e., uniformly sample without replacement over the data set until convergence. However, there are a number of extensions where I can see this going, e.g., you do importance sampling and actively change the probability weights of your sampling procedure as the algorithm runs.

While we may not want to open this complete freedom for the user, the internals in Stan should at least allow the developer to write a function to do the subsampling in their own terms—by directly choosing which data points to access. There are a variety of subsampling schemes that I imagine would be invented in the future, and so I feel like a good interface—at least for stochastic variational inference—is to specify the subsampling during runtime and not compile time. For instance, you specify the mini batch size subsample=M as an argument, or you specify subsample=intelligent_importance_resampling_thing. Then the coded internals would simply find out the number of data points available and run that subsampling scheme.

Dustin

Michael Betancourt

unread,
Jul 9, 2015, 6:47:04 PM7/9/15
to stan...@googlegroups.com
What I was suggesting at lunch was the following:

In the model specify all of your sizes as data,

data {
  int N1;
  int N2;
  ...
}

Normally these would be the full size.  For subsampling
just change your data file to make these sizes smaller,
and then implement a new var_context that does something
like

vector_t subsample_var_context::get_vector(“name”, size) {
  vector_t v = base_var_context.get_vector(“name”);
  
  if (v.size() < size)
    return subsample(v, size);
  if (v.size() > size)
    throw std::exception(“Size mismatch”);
  return v;

}

In other words, if you ask the var_context for a smaller
object it doesn’t complain but rather subsamples.  This
will require a bunch of other stuff to get done (think Stan3)
but would be straightforward to add then.

Alp Kucukelbir

unread,
Jul 10, 2015, 9:40:22 AM7/10/15
to stan...@googlegroups.com
to follow up on dustin and michael's comments, i implemented this with a hack for the ADVI paper. in terms of variational inference, the adjustment to the Stan model is quite straightforward. compare figure 10 to figure 11 in the paper (http://arxiv.org/pdf/1506.03431.pdf)

this is the simplest version i can think of. S_in_minibatch is fixed. outside of Stan i randomly selected S_in_minibatch number of rows from the full data matrix without replacement.

i would be eager to work on this once i'm back from traipsing around europe.

Bob Carpenter

unread,
Jul 10, 2015, 12:58:13 PM7/10/15
to stan...@googlegroups.com
What are the data scalability and model generality goals here?

Is the only need to handle subsetting array y of i.i.d. data?
What about predictors or group identifiers for multilevel structure?

Ideally, we could figure out how to subset without having to
reload the data each minibatch if all the data fit in memory. Reading
in non-binary data from disk is horrendously expensive.
And if we blow memory locality by choosing random subbatches, it
gets that much more expensive either to do more I/O or even to grab
the data out of RAM.

As to the var_context idea, maybe two different methods or a trait as
to whether to allow subsetting? I worry that otherwise it defeats
all the size consistency testing. I'm very picky that way!

We could also broadcast up if the user didn't provide enough data and
make everything very robust in the sense of never stopping to say "hey, I
don't have any data". Given the way we're setting up var_context, we might
even be able to pretty easily allow random generation of data if the user
specified constraints appropriately.

- Bob

Michael Betancourt

unread,
Jul 10, 2015, 4:21:41 PM7/10/15
to stan...@googlegroups.com
Given the giant mess that would be adding general subsampling
into the language, I was just offering the subsampling var context
as an intermediate solution to allow for algorithm development.

Andrew Gelman

unread,
Jul 12, 2015, 9:41:26 PM7/12/15
to stan...@googlegroups.com
Just to be clear, for subsetting we want some sort of “N” which represents individual data points. No need for these data to be “iid.” The iid thing is a red herring here.
Reply all
Reply to author
Forward
0 new messages