saving model objects

1,387 views
Skip to first unread message

Øystein Sandvik

unread,
Jul 17, 2012, 7:32:32 AM7/17/12
to hddm-...@googlegroups.com
I am having trouble saving a model for later use. By using the pickle module (http://docs.python.org/library/pickle.html) I am trying to save a model to a file so I can use it for calculating convergence at a later time.
pickle.dump(model, open(filename, 'wb'))
gives the following error:
pickle.PicklingError: Can't pickle <function wrapper at 0x039F90F0>: it's not found as pymc.distributions.wrapper
or
pickle.PicklingError: Can't pickle <class 'kabuki.distributions.Wfpt'>: it's not found as kabuki.distributions.Wfpt

I can of course save the trace and stats (which would be sufficient for calculating convergence), but I would really like to save the complete model (if this is at all possible).
Any experience or suggestions on how to accomplish this will be greatly appreciated.

Thank you,
Øystein Sandvik

Thomas Wiecki

unread,
Jul 17, 2012, 8:08:28 AM7/17/12
to hddm-...@googlegroups.com
I'm not sure what saving the model would give if you compared to saving the trace and the loading that trace into a new model.

Pickle is the wrong approach as there are many things of a model object that can not be pickled. However, pymc does provide a mechanism to save traces to a database (http://pymc-devs.github.com/pymc/database.html#saving-data-to-disk). HDDM.mcmc() is essentially a wrapper for pymc.MCMC().

HDDM also allows you to load these back into a model.

The code could look like this:
m = hddm.HDDM(data, ...)
m.mcmc(dbname='test.db')
m.sample(5000)
# later point in time
m = hddm.HDDM(data, ...) # reconstruct model
m.load_from_db(dbname='test.db') # load traces
You do have to keep the model creation information around (i.e. what goes into HDDM, like data, depends_on etc). This is a little unfortunate. I always wanted to create a mechanism to save a complete model that then gets automatically reconstructed. It wouldn't be too difficult (since the input arguments to HDDM are almost certainly pickable) but I haven't gotten around to it.

HTH,
Thomas

> Øystein Sandvik

Guido Biele

unread,
Jul 17, 2012, 8:45:26 AM7/17/12
to hddm-...@googlegroups.com
Hi Thomas,

Oystein is working with me (Guido).

Here is why we want to save the complete model, or its definition and the traces separately:
We have access to a grid/cluster where it is best to submit different chains as independent jobs (which are submitted by slurm to  independent machines ). If we later want to test the convergence of these chains, we need the traces if we want to use the gelman-rubin  convergence criterion (which Øystein found implemented in the R_hat function of the kabuki module).* the information about the model should then be helpful to know which trace is associated with which parameter. hence, we need to keep track of model definition and save traces. It is not a big issue to slighlty re-organize our procedure so that the model definition is saved in a variable or just as a string in a text file.

I would have one question though: What is with the information about number of burn in samples? I would think that within chain variance should only be computed on samples after completion of the burn in. Is this what your R_hat function does and does the data-base which saves the samples also save information about the number of burn in samples in a way that this would be taken into account by the R_hat function?

cheers - guido

 *We could also just save the variances of parameters and use them, but i think using the chains wont take to much space, even if we have 100 parameters and 50 000 samples....

Thomas Wiecki

unread,
Jul 17, 2012, 9:18:22 AM7/17/12
to hddm-...@googlegroups.com
On Tue, Jul 17, 2012 at 8:45 AM, Guido Biele <g.p....@psykologi.uio.no> wrote:
Hi Thomas,

Oystein is working with me (Guido).

Here is why we want to save the complete model, or its definition and the traces separately:
We have access to a grid/cluster where it is best to submit different chains as independent jobs (which are submitted by slurm to  independent machines ). If we later want to test the convergence of these chains, we need the traces if we want to use the gelman-rubin  convergence criterion (which Øystein found implemented in the R_hat function of the kabuki module).* the information about the model should then be helpful to know which trace is associated with which parameter. hence, we need to keep track of model definition and save traces. It is not a big issue to slighlty re-organize our procedure so that the model definition is saved in a variable or just as a string in a text file.

I see. I also ran large variation of models on a cluster. You might be interested in the code I wrote for that:


At the end of the file you can see some example code. Essentially, you define a dictionary of all the models you want to run. run_experiments() will then farm these out and compute them in parallel (using mpi4py_map which you can find here https://github.com/twiecki/mpi4py_map). run_experiments() runs the models and saves their traces to subdirectories. In a later run you can call analyse_experiments() which recreates the models and loads the traces and the computes the output statistics etc. Now, I haven't used this to run the same model multiple times and compute Rhat over them, but it (i) seems to lend itself to that and (ii) is a good idea.

Another detail: to run your code like that you want to call the python via mpirun: mpirun -n NUMPROCESSES --machinefile PATH_TO_MACHINEFILE python my_experiments.py

If you do want to go that route note that this functionality is not in kabuki/hddm 0.2 but in the current development branch in the source repo. You would have to check out the git repo (git clone and then git branch --track ...).


I would have one question though: What is with the information about number of burn in samples? I would think that within chain variance should only be computed on samples after completion of the burn in. Is this what your R_hat function does and does the data-base which saves the samples also save information about the number of burn in samples in a way that this would be taken into account by the R_hat function?

Burned samples are not saved to the trace so they don't show up in any stats computed over them.

A note of caution, the R_hat function was written quite some time ago and I haven't really tested it extensively. I think there is also an implementation of that in pymc http://pymc-devs.github.com/pymc/modelchecking.html which I would trust more.

In any case, running multiple model runs and assessing convergence via R_hat is something that's currently not well done in kabuki/HDDM so if you come up with something that works I'd love to hear about it and maybe integrate it in the next release.

HTH,
Thomas

Guido Biele

unread,
Jul 17, 2012, 9:35:03 AM7/17/12
to hddm-...@googlegroups.com
Hi Thomas,

thanks for the feedback! We'll certainly look into convergence measures
supplied with pymc.

As to sending of multiple jobs. for now the most efficient job for us is
to use a simple c-shell script to submit a python script with the
relevant info. (there is also PySlurm, however I haven't used it so far
and we will look at this later.).

We will keep you up to date about the convergence check over instances
of the same model.

cheers - guido

On Tue Jul 17 15:18:22 2012, Thomas Wiecki wrote:
>
> I see. I also ran large variation of models on a cluster. You might be
> interested in the code I wrote for that:
>
> https://github.com/hddm-devs/kabuki/blob/develop/kabuki/experiments.py
>
> At the end of the file you can see some example code. Essentially, you
> define a dictionary of all the models you want to run.
> run_experiments() will then farm these out and compute them in
> parallel (using mpi4py_map which you can find here
> https://github.com/twiecki/mpi4py_map). run_experiments() runs the
> models and saves their traces to subdirectories. In a later run you
> can call analyse_experiments() which recreates the models and loads
> the traces and the computes the output statistics etc. Now, I haven't
> used this to run the same model multiple times and compute Rhat over
> them, but it (i) seems to lend itself to that and (ii) is a good idea.
>
> Another detail: to run your code like that you want to call the python
> via mpirun: mpirun -n NUMPROCESSES --machinefile PATH_TO_MACHINEFILE
> python my_experiments.py
>
> If you do want to go that route note that this functionality is not in
> kabuki/hddm 0.2 but in the current development branch in the source
> repo. You would have to check out the git repo (git clone and then git
> branch --track ...).
>
>
>
>
>

Guido Biele

unread,
Jul 19, 2012, 3:58:48 PM7/19/12
to hddm-...@googlegroups.com
Hi Thomas,

I tried your suggestion above, but I didn't get the desired result.

here is what worked and not:

OK: 
I can set up a model add the dbname info (which generates the message "<pymc.MCMC.MCMC object at 0x3a91e50>") and sample the model.
PROBLEM:
When I call m.load_from_db(dbname='test.db') after having reconstructed the model, I get following error message:
File "<stdin>", line 1, in <module>
AttributeError: 'HDDM' object has no attribute 'load_from_db'

i figured out that the traces are in m.mc.traces, but I thought I ask you if I implemented you suggestion incorrectly, before i recreate something that you have already done.
(I tried to find the code for saving models smilar to what is described in the pymc documentation in your code, but couldn't find it - I sure its there, I just don't know python enough...)

cheers - guido

Thomas Wiecki

unread,
Jul 19, 2012, 7:33:04 PM7/19/12
to hddm-...@googlegroups.com
On Thu, Jul 19, 2012 at 3:58 PM, Guido Biele <g.p....@psykologi.uio.no> wrote:
> Hi Thomas,
>
> I tried your suggestion above, but I didn't get the desired result.
>
> here is what worked and not:
>
> OK:
> I can set up a model add the dbname info (which generates the message
> "<pymc.MCMC.MCMC object at 0x3a91e50>") and sample the model.
> PROBLEM:
> When I call m.load_from_db(dbname='test.db') after having reconstructed the
> model, I get following error message:
> File "<stdin>", line 1, in <module>
> AttributeError: 'HDDM' object has no attribute 'load_from_db'

Oops, it's called load_db() in this version actually. This uses the
sqlite backend by default (so call .mcmc(20000, dbname='foo.db',
db='sqlite') or supply the loader function from pymc to
load_db('foo.db', db_loader=pymc.database.pickle.load) or some such.

Thomas

Guido Biele

unread,
Jul 20, 2012, 10:35:55 AM7/20/12
to hddm-...@googlegroups.com
Hi Thomas,

that worked with minor modifications.
for the benefit of other users, here a short but complete description of how to save and load traces.

How to save and load traces of an mcmc chain generated by hddm v0.2 (tested on Linux, 64 bit Centos 5) 

#1 load data and create a model
data = hddm.load_csv('path_to_my_data')
model = hddm.HDDM(data, bias=True)  # a very simple model ...
#2 add commands for saving traces in a file
model.mcmc(dbname='traces.db', db='sqlite')
#3 run model. the traces will be saved in the file traces.db in the current working directory (alternatively specify path) 
model.sample(45000, burn=15000)

## now assume that you start a new python session, after the chain started above is completed


#4 reconstruct your model
data = hddm.load_csv('path_to_my_data')
model = hddm.HDDM(data, bias=True)
#5 add traces from database
model.load_db('traces.db')  # not that for this to work you have to be in the same working directory you were in when you started the chain above. otherwise submit full path

# now you can access the traces as you can when a chain has just completed
# for example, you can access the contents of the chain for parameter v with  
# len(model.mc.trace("v")[:])

cheers - guido

On Friday, July 20, 2012 1:33:04 AM UTC+2, Thomas wrote:

Thomas Wiecki

unread,
Jul 20, 2012, 10:38:07 AM7/20/12
to hddm-...@googlegroups.com
Thanks Guido, I'll be sure to include that example in the documentation!
Message has been deleted
Message has been deleted

stein...@gmail.com

unread,
Apr 23, 2015, 5:34:50 AM4/23/15
to hddm-...@googlegroups.com
Hi everyone,

when I follow Guidos instructions I get the error "DatabaseError: file is encrypted or is not a database" (see below for the complete error message)
If I use pickle instead of SQLite, <pymc.MCMC.MCMC at 0x10cdf9a90> is created. But when trying do reload the traces, I get such an error "DatabaseError: file is encrypted or is not a database" (see below for the complete error message)
 any pointers…?


---------------------------------- ERROR MESSAGE WHILE USING SQLITE -----------------------------
---------------------------------------------------------------------------
DatabaseError                             Traceback (most recent call last)
<ipython-input-25-d96cf8e228e0> in <module>()
----> 1 m_within_subj.mcmc(dbname='traces.db', db='sqlite')

/Users/benjamin/anaconda/lib/python2.7/site-packages/kabuki/hierarchical.pyc in mcmc(self, assign_step_methods, *args, **kwargs)
    539         """
    540 
--> 541         self.mc = pm.MCMC(self.nodes_db.node.values, *args, **kwargs)
    542 
    543         self.pre_sample()

/Users/benjamin/anaconda/lib/python2.7/site-packages/pymc/MCMC.pyc in __init__(self, input, db, name, calc_deviance, **kwds)
     80             name,
     81             calc_deviance=calc_deviance,
---> 82             **kwds)
     83 
     84         self._sm_assigned = False

/Users/benjamin/anaconda/lib/python2.7/site-packages/pymc/Model.pyc in __init__(self, input, db, name, reinit_model, calc_deviance, verbose, **kwds)
    205         # Specify database backend and save its keywords
    206         self._db_args = kwds
--> 207         self._assign_database_backend(db)
    208 
    209         # Flag for model state

/Users/benjamin/anaconda/lib/python2.7/site-packages/pymc/Model.pyc in _assign_database_backend(self, db)
    572                     self._db_args['dbname'] = self.__name__ + '.' + db
    573 
--> 574                 self.db = module.Database(**self._db_args)
    575             elif db in database.__modules__:
    576                 raise ImportError(

/Users/benjamin/anaconda/lib/python2.7/site-packages/pymc/database/sqlite.pyc in __init__(self, dbname, dbmode)
    187         self.cur = self.DB.cursor()
    188 
--> 189         existing_tables = get_table_list(self.cur)
    190         if existing_tables:
    191             # Get number of existing chains

/Users/benjamin/anaconda/lib/python2.7/site-packages/pymc/database/sqlite.pyc in get_table_list(cursor)
    264         SELECT name FROM sqlite_master
    265         WHERE type='table' AND NOT name='sqlite_sequence'
--> 266         ORDER BY name""")
    267     return [row[0] for row in cursor.fetchall()]
    268 
DatabaseError: file is encrypted or is not a database


---------------------------------- ERROR MESSAGE WHILE TRYING TO RELOAD DATA FROM PICKLE DB -----------------
---------------------------------------------------------------------------
DatabaseError                             Traceback (most recent call last)
<ipython-input-24-3780245ba32c> in <module>()
----> 1 m_within_subj.load_db('traces.db')

/Users/benjamin/anaconda/lib/python2.7/site-packages/kabuki/hierarchical.pyc in load_db(self, dbname, verbose, db)
    797 
    798         # Open database
--> 799         db = db_loader(dbname)
    800 
    801         # Create mcmc instance reading from the opened database

/Users/benjamin/anaconda/lib/python2.7/site-packages/pymc/database/sqlite.pyc in load(dbname)
    236     Return a Database instance.
    237     """
--> 238     db = Database(dbname)
    239 
    240     # Get the name of the objects

/Users/benjamin/anaconda/lib/python2.7/site-packages/pymc/database/sqlite.pyc in __init__(self, dbname, dbmode)
    187         self.cur = self.DB.cursor()
    188 
--> 189         existing_tables = get_table_list(self.cur)
    190         if existing_tables:
    191             # Get number of existing chains

/Users/benjamin/anaconda/lib/python2.7/site-packages/pymc/database/sqlite.pyc in get_table_list(cursor)
    264         SELECT name FROM sqlite_master
    265         WHERE type='table' AND NOT name='sqlite_sequence'
--> 266         ORDER BY name""")
    267     return [row[0] for row in cursor.fetchall()]
    268 
DatabaseError: file is encrypted or is not a database

KILukt Experiment

unread,
Sep 15, 2015, 3:04:25 AM9/15/15
to hddm-users
Hi Thomas and others,

We are using a more complex model (model = hddm.HDDM(data, depends_on={'v': 'stim'})) and encountered the same error (DatabaseError: file is encrypted or is not a database) when trying to save our traces using sqlite. (However, everything that Guido described works for a simple model (model = hddm.HDDM(data, bias = True)).)

When trying 'pickle', we get one step further, as described in the message to which we are replying, but cannot load the traces again (and you discouraged from using pickle here, so I think this was the wrong approach).

Do you have a solution how we can save the traces of this more complex model and reload them using sqlite?

Any help appreciated! Thank you,
Christina

Thomas Wiecki

unread,
Sep 15, 2015, 7:20:03 AM9/15/15
to hddm-...@googlegroups.com
Hi Christina,

under the hood, we're just forwarding everything to pymc so that's where the issues are. Each backend has it's unique challenges unfortunately. Here are some others you can try: https://pymc-devs.github.io/pymc/database.html#the-hdf5-backend

Hope that helps,
Thomas

--
You received this message because you are subscribed to the Google Groups "hddm-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hddm-users+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

KILukt Experiment

unread,
Oct 7, 2015, 2:14:47 PM10/7/15
to hddm-users
Thanks for your response, Thomas. We actually went back to the original solution suggested in the documentation (using pickle) and this actually worked! Even for the more complex model! Best, Christina
Reply all
Reply to author
Forward
0 new messages