a question from jeff alstott

112 views
Skip to first unread message

Alp Kucukelbir

unread,
Jan 27, 2016, 10:10:39 PM1/27/16
to stan development mailing list, jeffa...@gmail.com
jeff asks:

I'm a long time Python user just starting with Stan via Pystan. I've come to Stan because I want to do a large out-of-core calculation: I want to read into RAM large datasets (16-30 million data points), fit a particular kind of model to them, then do it again, updating the model with more data.

I'm using a conditional logistic regression model, which someone else has implemented here.

I'm using a computer with 64GB of RAM and 16 CPUs.

I load in 15 million data points, and Stan seems to choke. I wait around for ages and while the machine whirs, the calculation doesn't complete. Jim told to do "optimizing" first, but I haven't found any clear guide on that (there's lots of references to "optimizing" with stan, but no how to guide with Pystan).

This is where I am right now:

from pystan import StanModel
sm = StanModel(model_code=clogit_stan) #clogit_stan is from here. This compiles without errors/warnings.

### data_dict is boring and just sets up stuff for the model
data_dict = {'N':data.shape[0],
'n_grp': data['Agent_Entry'].nunique(),
'n_coef': len(data_cols),
'x': data[data_cols].values,
'y': data['Entered'].astype('int').values,
'grp': data['Agent_Entry'].values+1
}

### what am I doing wrong?
op = sm.optimizing(data=data_dict)
fit = sm.sampling(data=data_dict, init=op)


If I were smart I would start with smaller data sets :-D Unfortunately I just want to jump right into the deep end, but it seems I'm missing some understanding.

Thanks!

Michael Betancourt

unread,
Jan 27, 2016, 11:28:23 PM1/27/16
to stan...@googlegroups.com
This belongs on the Users’ List.

> I load in 15 million data points, and Stan seems to choke. I wait around for ages and while the machine whirs, the calculation doesn't complete.

The problem is likely not the data but the number of parameters and the memory
needed to hold that state + the history of the Markov chain in memory. As has
been recommended many times on the Users’ List, Cmdstan’s streaming I/O is
typically much more appropriate for large problems like this.

> Jim told to do "optimizing" first, but I haven't found any clear guide on that (there's lots of references to "optimizing" with stan, but no how to guide with Pystan).

Please don’t make statements like this when there is doc literally
on the front page of the PyStan website, https://pystan.readthedocs.org/en/latest/.

> If I were smart I would start with smaller data sets :-D Unfortunately I just want to jump right into the deep end, but it seems I'm missing some understanding.

No, you have the right understanding you’re just not using it.

Bob Carpenter

unread,
Jan 28, 2016, 12:24:56 PM1/28/16
to stan...@googlegroups.com
Who's Jim here?

We can't be held responsible for people jumping into
the deep end before learning to swim!

If we don't already, we should have a way to stream
output from RStan and PyStan. Maybe if we ever get
through this refactor.

- Bob
> --
> You received this message because you are subscribed to the Google Groups "stan development mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to stan-dev+u...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>
>

Ben Goodrich

unread,
Jan 28, 2016, 12:36:51 PM1/28/16
to stan development mailing list
On Thursday, January 28, 2016 at 12:24:56 PM UTC-5, Bob Carpenter wrote:
If we don't already, we should have a way to stream
output from RStan and PyStan.  Maybe if we ever get
through this refactor.

We do already in RStan; it is the sample_file argument which specifies the CSV file to write to. But you still have to read the CSV file in at some point, which is very slow.

Krzysztof Sakrejda

unread,
Jan 28, 2016, 1:19:04 PM1/28/16
to stan development mailing list
I've been looking into alternative formats that might be useful and so far my recommendation is HDF5 (https://www.hdfgroup.org/HDF5/doc/cpplus_RM/create_8cpp-example.html). Pros are:

1) friendly C++/C interfaces
2) interfaces in R, Python, etc..., no guarantee about completeness for those.
3) really solid spec for the binary format---as far as I can tell, yes I did read much of it (https://www.hdfgroup.org/HDF5/doc/Specs.html)
4) they take large file sizes seriously, with options to fragment one "file" in a transparent way so the filesystem doesn't choke
5) the library takes care of locking and pre-allocation of filesystem space
6) seems like somebody has taken a shot at writing Eigen structures into HDF5 (https://github.com/garrison/eigen3-hdf5), so maybe they made some of the mistakes for us already.

It's not optimized for streaming specifically, but it does look like we could do read/write by blocks in a way that made it scalable for the interfaces (and faster than parsing csv). I have a few things to take care of first but I'd like to get a writer done for this.



Bob Carpenter

unread,
Jan 28, 2016, 3:01:55 PM1/28/16
to stan...@googlegroups.com
Their C++ implementation uses a home-grown modified BSD license:

https://www.hdfgroup.org/products/licenses.html

Might as well shoot your project in the foot from the get-go. Or
maybe they wanted to thumb their noses at the GPL.
Because it's not GPL compatible. See:

http://www.gnu.org/licenses/license-list.en.html#GPLCompatibleLicenses

Both the MPL that Eigen uses and the Boost license are
on that list of compatible licenses, so Stan's compatible.

- Bob

Krzysztof Sakrejda

unread,
Jan 28, 2016, 3:21:12 PM1/28/16
to stan development mailing list
On Thursday, January 28, 2016 at 3:01:55 PM UTC-5, Bob Carpenter wrote:
> Their C++ implementation uses a home-grown modified BSD license:
>
> https://www.hdfgroup.org/products/licenses.html
>
> Might as well shoot your project in the foot from the get-go. Or
> maybe they wanted to thumb their noses at the GPL.
> Because it's not GPL compatible. See:
>
> http://www.gnu.org/licenses/license-list.en.html#GPLCompatibleLicenses
>
> Both the MPL that Eigen uses and the Boost license are
> on that list of compatible licenses, so Stan's compatible.

Looking at their license it doesn't seem like we can't use it---or am I missing something?

Krzysztof

Bob Carpenter

unread,
Jan 28, 2016, 3:25:54 PM1/28/16
to stan...@googlegroups.com
Licenses are a swamp.

You're missing GPL compatibility. Follow the link.

It's not that the license is bad, it's that only a finite
set of officially approved licenses are approved as compatible
with GPL. So we couldn't distribute anyting with this C++
library and RStan, I don't think.

The problem we have is that R is GPL-ed.

Or maybe I'm missing something here? I'm not an IP lawyer.

- Bob

Krzysztof Sakrejda

unread,
Jan 28, 2016, 3:33:05 PM1/28/16
to stan development mailing list
On Thursday, January 28, 2016 at 3:25:54 PM UTC-5, Bob Carpenter wrote:
> Licenses are a swamp.
>
> You're missing GPL compatibility. Follow the link.
>
> It's not that the license is bad, it's that only a finite
> set of officially approved licenses are approved as compatible
> with GPL. So we couldn't distribute anyting with this C++
> library and RStan, I don't think.
>
> The problem we have is that R is GPL-ed.
>
> Or maybe I'm missing something here? I'm not an IP lawyer.

Well, these guys haven't been sued yet: http://www.bioconductor.org/packages/release/bioc/html/rhdf5.html

I find that encouraging.

Bob Carpenter

unread,
Jan 28, 2016, 3:48:55 PM1/28/16
to stan...@googlegroups.com

> On Jan 28, 2016, at 3:33 PM, Krzysztof Sakrejda <krzysztof...@gmail.com> wrote:
>
> On Thursday, January 28, 2016 at 3:25:54 PM UTC-5, Bob Carpenter wrote:
>> Licenses are a swamp.
>>
>> You're missing GPL compatibility. Follow the link.
>>
>> It's not that the license is bad, it's that only a finite
>> set of officially approved licenses are approved as compatible
>> with GPL. So we couldn't distribute anyting with this C++
>> library and RStan, I don't think.
>>
>> The problem we have is that R is GPL-ed.
>>
>> Or maybe I'm missing something here? I'm not an IP lawyer.
>
> Well, these guys haven't been sued yet: http://www.bioconductor.org/packages/release/bioc/html/rhdf5.html
>
> I find that encouraging.

I don't. Just because someone else gets away with a crime
doesn't mean you won't get caught.

What makes you think they use that custom-licensed C++ source?
It points to something called zlibc, whose license they can't quite
wrap into the "artistic license" they think they can.

One possible way to get around these things is to have
them be standalone services that get called separately rather
than linked. Might that be what the package is doing?

- Bob

Krzysztof Sakrejda

unread,
Jan 28, 2016, 3:55:45 PM1/28/16
to stan development mailing list
On Thursday, January 28, 2016 at 3:48:55 PM UTC-5, Bob Carpenter wrote:
> > On Jan 28, 2016, at 3:33 PM, Krzysztof Sakrejda <krzysztof...@gmail.com> wrote:
> >
> > On Thursday, January 28, 2016 at 3:25:54 PM UTC-5, Bob Carpenter wrote:
> >> Licenses are a swamp.
> >>
> >> You're missing GPL compatibility. Follow the link.
> >>
> >> It's not that the license is bad, it's that only a finite
> >> set of officially approved licenses are approved as compatible
> >> with GPL. So we couldn't distribute anyting with this C++
> >> library and RStan, I don't think.
> >>
> >> The problem we have is that R is GPL-ed.
> >>
> >> Or maybe I'm missing something here? I'm not an IP lawyer.
> >
> > Well, these guys haven't been sued yet: http://www.bioconductor.org/packages/release/bioc/html/rhdf5.html
> >
> > I find that encouraging.
>
> I don't. Just because someone else gets away with a crime
> doesn't mean you won't get caught.

More seriously, I will look at it more carefully before doing coding, it's not
something I want to get bit by. I did look at the license links but I didn't see
there was something specific to the C++ code. Do you have a better
pointer than the generic link? The overall HDF5 license looks fine.

Allen B. Riddell

unread,
Jan 28, 2016, 4:24:23 PM1/28/16
to stan...@googlegroups.com
I've been hearing more criticisms of HDF5 lately (although I use it
rather often and haven't had much of a problem). I think there are some
promising new, permissively-licensed alternatives on the horizon.

Krzysztof Sakrejda

unread,
Jan 28, 2016, 4:29:47 PM1/28/16
to stan development mailing list, a...@ariddell.org
On Thursday, January 28, 2016 at 4:24:23 PM UTC-5, Allen B. Riddell wrote:
> I've been hearing more criticisms of HDF5 lately (although I use it
> rather often and haven't had much of a problem). I think there are some
> promising new, permissively-licensed alternatives on the horizon.

Got any suggestions on what those alternatives might be or links for
criticisms?

Bob Carpenter

unread,
Jan 28, 2016, 6:21:55 PM1/28/16
to stan...@googlegroups.com
Yes:

https://www.hdfgroup.org/products/licenses.html

It's not that the HDF5 C++ isn't permissively licensed, it's
just that it's not a standard license, so not explicitly
OK-ed by the Free Software Foundation to be compatible with GPL.

At least that's what it looks like to me.

And wasn't the plan to use protocol buffers anyway?

- Bob

Krzysztof Sakrejda

unread,
Jan 28, 2016, 6:57:43 PM1/28/16
to stan development mailing list
Protocol buffers has very different features. I think it's great for transmitting data between processes but not so much for storage of big-ish data.

Allen B. Riddell

unread,
Jan 28, 2016, 7:33:27 PM1/28/16
to Krzysztof Sakrejda, stan development mailing list
Some recent discussion: https://news.ycombinator.com/item?id=10858189

I think one recurring concern is the risk of data corruption in
multithreaded situations. Seems like this could be relevant for Stan.

Bob Carpenter

unread,
Jan 29, 2016, 12:21:56 PM1/29/16
to stan...@googlegroups.com
Thanks --- the OP blog post was an awesome rundown.

- Bob

Krzysztof Sakrejda

unread,
Jan 29, 2016, 2:26:27 PM1/29/16
to stan development mailing list, krzysztof...@gmail.com, a...@ariddell.org
On Thursday, January 28, 2016 at 7:33:27 PM UTC-5, Allen B. Riddell wrote:
> Some recent discussion: https://news.ycombinator.com/item?id=10858189
>
> I think one recurring concern is the risk of data corruption in
> multithreaded situations. Seems like this could be relevant for Stan.

Interesting, I read through the comments as well as the post and all in all
it doesn't look too bad (in the sense that I don't know that there's a format
that handles all the same stuff and comes out looking better).

Bob Carpenter

unread,
Jan 29, 2016, 2:40:05 PM1/29/16
to stan...@googlegroups.com, krzysztof...@gmail.com, a...@ariddell.org
Were we reading the same thing?

http://cyrille.rossant.net/moving-away-hdf5/

The summary is:

• High risks of data corruption
• Bugs and crashes in the HDF5 library and in the wrappers
• Poor performance in some situations
• Limited support for parallel access
• Impossibility to explore datasets with standard Unix/Windows tools
• Hard dependence on a single implementation of the library
• High complexity of the specification and the implementation
• Opacity of the development and slow reactivity of the development team

And it looks like it offers much more than we need.
As far as I know, we need to store/retrieve

* [ONE SHOT] mass matrix for adaptation (usually vector)

* [ONE SHOT] step size for adaptation

* [ONE SHOT] config

* [STREAM] sequences of parameter values (draws for sample; steps for optimization)

* [STREAM] diagnostics

None of that should require mutable containers that act
like little file systems. And each is pretty simple. As
the OP said, "We've moved from writing a monolithic application
to writing a library."

- Bob

Allen B. Riddell

unread,
Jan 29, 2016, 2:40:39 PM1/29/16
to Krzysztof Sakrejda, stan development mailing list
I think there's active work on alternatives. For example, it seems like
there are some Apache projects that are stable/approaching stability.
The Spark/Hadoop folks/corporations do not lack energetic developers.

Bob Carpenter

unread,
Jan 29, 2016, 2:50:56 PM1/29/16
to stan...@googlegroups.com
I'm confused --- Spark and Hadoop are for distributed
processing (like map/reduce). How does that relate to
serialization formats?

- Bob

Krzysztof Sakrejda

unread,
Jan 29, 2016, 3:47:52 PM1/29/16
to stan development mailing list, krzysztof...@gmail.com, a...@ariddell.org
On Friday, January 29, 2016 at 2:40:05 PM UTC-5, Bob Carpenter wrote:
> Were we reading the same thing?

Yeah, I commented inline on how I read it:

>
> http://cyrille.rossant.net/moving-away-hdf5/
>
> The summary is:
>
> • High risks of data corruption

If your program crashes while writing to the file. I guess it looks like your metadata can get wrecked if the program crashes while writing. I give it some slack for this as a) there's a solution on what looks like the near horizon (journaling); b) you can set an option to keep the metadata in a separate file which protects it from this issue. You still loose the given dataset you are writing but without some type of transactions that's just going to happen.

> • Bugs and crashes in the HDF5 library and in the wrappers

rstan crashes too, and we've had stability problems with ADVI (though if I do say so myself HMC/optimization/CmdStan seem like they're rock-solid). The real issue is whether a) bugs get fixed after getting reported; and b) work-around are available from the dev team; I don't see that they do worse than any other software with lots of users.

> • Poor performance in some situations

The original performance comparison was wrong, and the second comparison was wrong and... I think on the third or fourth try the OP with the help of a commenter came up with a comparison that I read as "meh", although the OP didn't fully update the post.

> • Limited support for parallel access

Yes, parallel write they don't do. Parallel read seems to be fine. (?)

> • Impossibility to explore datasets with standard Unix/Windows tools

Yes, but gloriously easy from C/C++/Python, etc.. without messing around with heterogenous file-system-based interfaces. That's one trade-off.

> • Hard dependence on a single implementation of the library

Yup.

> • High complexity of the specification and the implementation

Yup, but there is a spec and it is readable.

> • Opacity of the development and slow reactivity of the development team

Come on, any Stan power user who gets frustrated by rstan bugging out on a large data set could make the same claim. At the moment our answer for big data sets/models is "use CmdStan". Given the OP's thoughtless performance complaint/comparison I don't give this a lot of weight. A lot of the other material I could find from the dev team looks like a good effort (though maybe not up to the task for how widely the format is used.

>
> And it looks like it offers much more than we need.

This I agree with, and that's why I'm on the fence about the format. If it were rock solid I would say the complexity is worth it. At this point I have serious doubts.

> As far as I know, we need to store/retrieve
>
> * [ONE SHOT] mass matrix for adaptation (usually vector)
>
> * [ONE SHOT] step size for adaptation
>
> * [ONE SHOT] config
>
> * [STREAM] sequences of parameter values (draws for sample; steps for optimization)

At the moment this is the one I want a different solution for. We need to do more than just stream. We need to stream to something that's accessible by iteration/parameter without loading the whole thing into memory. I realize I'm not the usual Stan user but at the moment I often can't open output files on my laptop because CmdStan is the only thing that reliably churns through the models and the .csv output is hard to process. Yes I could post-process and chunk the files, but instead I've been putting some time into reading about alternative formats with well established libraries---I don't think we should mess around with file formats unless we've explored the other alternatives.

>
> * [STREAM] diagnostics
>
> None of that should require mutable containers that act
> like little file systems. And each is pretty simple. As
> the OP said, "We've moved from writing a monolithic application
> to writing a library."

Yes, and if somebody else writes a nice library for portable IO that deals with large datasets we should use it.

Krzysztof

Krzysztof Sakrejda

unread,
Jan 29, 2016, 3:50:29 PM1/29/16
to stan development mailing list
On Thursday, January 28, 2016 at 6:21:55 PM UTC-5, Bob Carpenter wrote:
> Yes:
>
> https://www.hdfgroup.org/products/licenses.html
>
> It's not that the HDF5 C++ isn't permissively licensed, it's
> just that it's not a standard license, so not explicitly
> OK-ed by the Free Software Foundation to be compatible with GPL.
>
> At least that's what it looks like to me.
>
> And wasn't the plan to use protocol buffers anyway?

BTW, not totally discounting protocol buffers for storage, it just seems like it's not their main goal. ... unfortunately, "cap'n proto" looks like it might be closer to a bare-bones storage format (writes binary without having a wire format so you _can_, for example, mmap them from Python.

K

Bob Carpenter

unread,
Jan 29, 2016, 5:29:57 PM1/29/16
to stan...@googlegroups.com
Thanks for the closer read. And the comments on the Stan
family. Point taken. All software is buggy, brittle, and
has less-than-ideally-responsive dev teams.

I think if you want to do this, all of the I/O layers people
want should be layered on top of the refactored C++ API:

* R dump format (CmdStan uses, optional in RStan)
* JSON (what Python uses now, I think)
* protocol buffers (TBD)
* HDF5 (TBD)
* SQL database of some sort (TBD)

They could arguably be in Stan, but I think given their
complexity and dependencies, it might make sense to make
them separate packages that depend on Stan C++.

Then the format used by RStan, PyStan, CmdStan, etc.
would be up to those interface developers.

I have a very strong preference for keeping the core of
Stan in C++ as simple as possible so that it's both easy to
doc and easy to test.

- Bob

Bob Carpenter

unread,
Jan 29, 2016, 5:38:57 PM1/29/16
to stan...@googlegroups.com
I should've mentioned that RStan uses R's I/O not
the dump reader we wrote for CmdStan.

PyStan just uses the JSON built into Python. The
interface is then at the memory level to call Stan.
I don't know what PyStan does with the output.

Krzysztof Sakrejda

unread,
Jan 29, 2016, 6:20:39 PM1/29/16
to stan development mailing list
On Friday, January 29, 2016 at 5:29:57 PM UTC-5, Bob Carpenter wrote:
> Thanks for the closer read. And the comments on the Stan
> family. Point taken. All software is buggy, brittle, and
> has less-than-ideally-responsive dev teams.

I gotta say, I think we do pretty well compared to a lot of groups and I appreciate
that 1) we're welcoming to new dev's; 2) we're generous with our
time to the users (esp the mailing list); and 3) we build out from solid core
functionality. Responding to your comments made me go back
and rethink more of the cons rather than just listing pro's so it was
very helpful to do. I'm hoping to scrounge a few minutes (fingers crossed) to do a Wiki so we have a place
to keep this kind of exchange in summary form for future reference.

> I think if you want to do this, all of the I/O layers people
> want should be layered on top of the refactored C++ API:

Agreed.

> * R dump format (CmdStan uses, optional in RStan)
> * JSON (what Python uses now, I think)
> * protocol buffers (TBD)
> * HDF5 (TBD)
> * SQL database of some sort (TBD)
>
> They could arguably be in Stan, but I think given their
> complexity and dependencies, it might make sense to make
> them separate packages that depend on Stan C++.

I really value that we try to keep core Stan simple and well
tested so I'm 100% on board with keeping more complex
output format stuff out of core Stan. Really whatever more complex output
layer we do I'd like to also make well tested and as simple
as possible.

> Then the format used by RStan, PyStan, CmdStan, etc.
> would be up to those interface developers.

Agreed.

>
> I have a very strong preference for keeping the core of
> Stan in C++ as simple as possible so that it's both easy to
> doc and easy to test.

+1
Reply all
Reply to author
Forward
0 new messages