I'm a long time Python user just starting with Stan via Pystan. I've come to Stan because I want to do a large out-of-core calculation: I want to read into RAM large datasets (16-30 million data points), fit a particular kind of model to them, then do it again, updating the model with more data.
I'm using a conditional logistic regression model, which someone else has implemented here.
I'm using a computer with 64GB of RAM and 16 CPUs.
I load in 15 million data points, and Stan seems to choke. I wait around for ages and while the machine whirs, the calculation doesn't complete. Jim told to do "optimizing" first, but I haven't found any clear guide on that (there's lots of references to "optimizing" with stan, but no how to guide with Pystan).
This is where I am right now:
from pystan import StanModel
sm = StanModel(model_code=clogit_stan) #clogit_stan is from here. This compiles without errors/warnings.
### data_dict is boring and just sets up stuff for the model
data_dict = {'N':data.shape[0],
'n_grp': data['Agent_Entry'].nunique(),
'n_coef': len(data_cols),
'x': data[data_cols].values,
'y': data['Entered'].astype('int').values,
'grp': data['Agent_Entry'].values+1
}
### what am I doing wrong?
op = sm.optimizing(data=data_dict)
fit = sm.sampling(data=data_dict, init=op)
If I were smart I would start with smaller data sets :-D Unfortunately I just want to jump right into the deep end, but it seems I'm missing some understanding.
Thanks!
If we don't already, we should have a way to stream
output from RStan and PyStan. Maybe if we ever get
through this refactor.
Looking at their license it doesn't seem like we can't use it---or am I missing something?
Krzysztof
Well, these guys haven't been sued yet: http://www.bioconductor.org/packages/release/bioc/html/rhdf5.html
I find that encouraging.
More seriously, I will look at it more carefully before doing coding, it's not
something I want to get bit by. I did look at the license links but I didn't see
there was something specific to the C++ code. Do you have a better
pointer than the generic link? The overall HDF5 license looks fine.
Got any suggestions on what those alternatives might be or links for
criticisms?
Interesting, I read through the comments as well as the post and all in all
it doesn't look too bad (in the sense that I don't know that there's a format
that handles all the same stuff and comes out looking better).
BTW, not totally discounting protocol buffers for storage, it just seems like it's not their main goal. ... unfortunately, "cap'n proto" looks like it might be closer to a bare-bones storage format (writes binary without having a wire format so you _can_, for example, mmap them from Python.
K