OLS memory issues

dieterv77

unread,

Sep 6, 2011, 8:36:27 PM9/6/11

to pystatsmodels

Hi everyone, i've been running pretty large regressions inside a loop,
and have been noticing some surprising memory usage. I've boiled it
down to the following code:

import scikits.statsmodels.api as sm
import numpy as np

n = 10000
X = np.random.rand(n,123)
beta = np.random.rand(123)

for i in xrange(300):
Y = np.dot(X,beta) + np.random.randn(n)
model = sm.OLS(Y,X)
result = model.fit()

I would expect the memory usage of this code to be relatively constant
after the first regression as executed, but instead i see that it
keeps growing. After about 200-300 iterations, it's using over 1GB of
RAM on my machine, though after the first few regressions it was using
around 200MB. I'm running on 64bit Ubuntu Natty with numpy 1.5.1 and
scipy 0.8.0. I'm seeing this with an up to date build of the github
master, although i also saw it with 0.2 and 0.3. Are my expectations
incorrect?

Thanks for your time, and for all the effort many of you put into this
package.

Dieter

Skipper Seabold

unread,

Sep 6, 2011, 9:35:26 PM9/6/11

to pystat...@googlegroups.com

Hmm, no that would've been my guess too. I have no idea off the top of
my head. I edited it to use the same Y and X in every model, and I see
roughly the same thing. Uses about .75 GB for the first ~150 models,
then stays there for the last ~150, and goes back down to about .5 GB
after garbage collection. I'm afraid I don't know enough about
Python's (anything's) memory management, but I assume that means we
are doing some copying somewhere (?), though I don't have a guess of
where just yet. We don't compute any of the heavy results until you
ask for them and it strikes me as odd that each model and results
instantiation would use several MB. Is it possible there's something
else going on under the hood with Python? Any tools for checking this
kind of thing?

> Thanks for your time, and for all the effort many of you put into this
> package.
>

Thanks for the feedback. Helps us improve things.

Skipper

josef...@gmail.com

unread,

Sep 6, 2011, 9:50:34 PM9/6/11

to pystat...@googlegroups.com

Not our fault, I think this is just standard python garbage collection
that doesn't collect immediately, but periodically depending on
available computer memory.

I watched a few cycles, on my computer it goes up to around 1GB and
then drops down to around 100MB, then goes up again, and this several
times.

if I add into the loop immediate garbage collection (import gc on top)
#del model
#del result
gc.collect()

without the del it stays around 100 to 110 MB, with the del it stays
around 80 to 100 MB.

I'm on Windows, but I don't think this should differ much across OSs.

Josef

Jayron Soares

unread,

Sep 6, 2011, 9:57:29 PM9/6/11

to pystat...@googlegroups.com

Hi guys, how can I test this script, I just press f5 and my idle ide just show me RESTART and don't see anything,

please I'm just new freshman in Python data analysis...I still learning...

Show me the way please

cheers

jayron

2011/9/6 <josef...@gmail.com>

--
" A Vida é arte do Saber...Quem quiser saber tem que Estudar!"

http://bucolick.tumblr.com

http://artecultural.wordpress.com/

josef...@gmail.com

unread,

Sep 6, 2011, 10:14:39 PM9/6/11

to pystat...@googlegroups.com

On Tue, Sep 6, 2011 at 9:57 PM, Jayron Soares <jayron...@gmail.com> wrote:
> Hi guys, how can I test this script, I just press f5 and my idle ide just
> show me RESTART and don't see anything,
> please I'm just new freshman in Python data analysis...I still learning...
> Show me the way please

This script isn't supposed to print anything. You can watch the memory
consumption in taskmanager on Windows, or whatever the Linux
equivalent is.
I wouldn't recommend printing in this example, since the model and
arrays are large.

If you want to start with examples that print the results, then you
could start with the scripts in the scikits/statsmodels/examples
folder available in the statsmodels source.

Looking at some python (and numpy/scipy) tutorials at the same time
will be useful, since statsmodels assumes currently quite a bit of
familiarity with python and numpy.
But I think python, numpy and statsmodels are easy enough, that users
can produce useful results pretty fast.

Cheers,

Josef

dieterv77

unread,

Sep 6, 2011, 10:17:04 PM9/6/11

to pystatsmodels

Python uses reference counted garbage collection, which means that
memory is freed as soon as nothing references it anymore. (Sorry if
i'm stating things that everyone already knows)
As far as i can tell, that means that with each assignment of 'model =
' and 'result =', the previous model and result objects should
get garbage collected, unless there is something else still referring
to these. Is there any chance that there are cyclical references
somewhere in
statsmodels? That's effectively the only time that the reference
counted approach to garbage collection breaks down and where garbage
collection may be delayed until
the cycles are detected.

Thanks
Dieter

On Sep 6, 9:50 pm, josef.p...@gmail.com wrote:
> On Tue, Sep 6, 2011 at 9:35 PM, Skipper Seabold <jsseab...@gmail.com> wrote:

josef...@gmail.com

unread,

Sep 6, 2011, 10:37:03 PM9/6/11

to pystat...@googlegroups.com

On Tue, Sep 6, 2011 at 10:17 PM, dieterv77 <diet...@gmail.com> wrote:
> Python uses reference counted garbage collection, which means that
> memory is freed as soon as nothing references it anymore. (Sorry if
> i'm stating things that everyone already knows)
> As far as i can tell, that means that with each assignment of 'model =
> ' and 'result =', the previous model and result objects should
> get garbage collected, unless there is something else still referring
> to these.

I didn't know this, I never really looked at the details, I only saw
it mentioned quite often what I argued about the delayed collection.

>Is there any chance that there are cyclical references
> somewhere in
> statsmodels?

Yes, currently models and results are both attached to each other,
the result instance is attached to the model instance in the fit
method as _results, and
during the __init__ of the results instance, the model gets attached
as attribute `model`.

>That's effectively the only time that the reference
> counted approach to garbage collection breaks down and where garbage
> collection may be delayed until
> the cycles are detected.

That looks like the explanation then.

Josef

Skipper Seabold

unread,

Sep 6, 2011, 10:59:54 PM9/6/11

to pystat...@googlegroups.com

On Tue, Sep 6, 2011 at 10:37 PM, <josef...@gmail.com> wrote:
> On Tue, Sep 6, 2011 at 10:17 PM, dieterv77 <diet...@gmail.com> wrote:
>> Python uses reference counted garbage collection, which means that
>> memory is freed as soon as nothing references it anymore. (Sorry if
>> i'm stating things that everyone already knows)
>> As far as i can tell, that means that with each assignment of 'model =
>> ' and 'result =', the previous model and result objects should
>> get garbage collected, unless there is something else still referring
>> to these.
>
> I didn't know this, I never really looked at the details, I only saw
> it mentioned quite often what I argued about the delayed collection.
>
>>Is there any chance that there are cyclical references
>> somewhere in
>> statsmodels?
>
> Yes, currently models and results are both attached to each other,
> the result instance is attached to the model instance in the fit
> method as _results, and
> during the __init__ of the results instance, the model gets attached
> as attribute `model`.
>
>
>>That's effectively the only time that the reference
>> counted approach to garbage collection breaks down and where garbage
>> collection may be delayed until
>> the cycles are detected.
>
> That looks like the explanation then.
>

IIUC and google correctly, the garbage collector still handles
circular references though as long as we don't implement a del method.
Could it help if the circular references were weak references? This
also might depend on the version of Python used? It just seems as
though it uses more memory that it could not that it keeps using more
memory.

Skipper

Skipper Seabold

unread,

Sep 6, 2011, 11:01:37 PM9/6/11

to pystat...@googlegroups.com

http://stackoverflow.com/questions/1035489/python-garbage-collection

dieterv77

unread,

Sep 6, 2011, 11:06:46 PM9/6/11

to pystatsmodels

> Yes, currently models and results are both attached to each other,
> the result instance is attached to the model instance in the fit
> method as _results, and
> during the __init__ of the results instance, the model gets attached
> as attribute `model`.

I see, is that something that could be changed? Does the model need
the reference to the results?
I ask because otherwise i will need need to put in some explicit del
statements to get the garbage collection to behave
nicely.

Thanks for the quick responses.

Dieter

josef...@gmail.com

unread,

Sep 6, 2011, 11:44:07 PM9/6/11

to pystat...@googlegroups.com

On Tue, Sep 6, 2011 at 11:06 PM, dieterv77 <diet...@gmail.com> wrote:
>
>> Yes, currently models and results are both attached to each other,
>> the result instance is attached to the model instance in the fit
>> method as _results, and
>> during the __init__ of the results instance, the model gets attached
>> as attribute `model`.
> I see, is that something that could be changed? Does the model need
> the reference to the results?

It might be possible.
I think for the linear_model classes, we only used _results in predict
and that was supposed to be changed.
You could try
del resultsinstance.model._results
and see if it helps or breaks things that you are using. (exceptions
should be obvious)

For some models it might require some rethinking or redesign, but I'm
not sure how much it is used. It might be a problem for estimators
that work in several steps, ARMA, and some sandbox models,.

I worried sometimes about the circular references, but since we have
never seen a clear disadvantage we kept using it because it made life
easier.

For an "official" change in policy and refactoring we would have to
look at the various models and might take some time. But I think that
modelinstance._result was more a convenience than a necessity.

-------------------
A different issue: timing

I was a bit surprised that the 300 repetition loop took so much time.
How the linear algebra is done has a large influence on the time. for
example, I tried

model = sm.OLS(Y,X)
result = model.fit(method="qr")

which reduces time by a third.

trying to switch to scipy.linalg.pinv gives me a MemoryError
>>> X.shape
(10000, 123)
>>> a = scipy.linalg.pinv(X)
Traceback (most recent call last):
File "<pyshell#3>", line 1, in <module>
a = scipy.linalg.pinv(X)
File "C:\Python26\lib\site-packages\scipy\linalg\basic.py", line 488, in pinv
return lstsq(a, b, cond=cond)[0]
File "C:\Python26\lib\site-packages\scipy\linalg\basic.py", line 435, in lstsq
overwrite_b=overwrite_b)
MemoryError

using moment equations is much faster than pinv(X), ratio 0.46 / 1.59
for this X.shape
b = np.linalg.pinv(np.dot(X.T, X)).dot(X.T.dot(Y))

The problem is that this depends on the shape of the exog, and what
results are desired, so it's currently just generic and not optimized
for a specific case.

Another point not resolved/implemented yet: Are you actually
estimating different endog, y, against the same exog, X, or was this
just as a test case?
reusing pinv(X) would save a lot of time

On the other hand, in my timing calling garbage collection increases
the time by 4 %, so a small impact compared to any of the other
adjustments.

We haven't seen much (memory and timing) profiling of statsmodels in
different use cases, so this is interesting.

Josef

Wes McKinney

unread,

Sep 6, 2011, 11:47:12 PM9/6/11

to pystat...@googlegroups.com

I caused a really bad memory leak in pandas not long ago by creating a
circular reference between NumPy arrays.

The main nuance I've noticed about Python internal memory allocation
is the whole business of memory arenas:

http://www.evanjones.ca/memoryallocator/

I am pretty sure this got fixed in Python 2.6. I will take a look at
this particular problem when I can and see if I can figure anything
out. Obviously not good

Skipper Seabold

unread,

Sep 6, 2011, 11:52:50 PM9/6/11

to pystat...@googlegroups.com

On Tue, Sep 6, 2011 at 11:06 PM, dieterv77 <diet...@gmail.com> wrote:
>

Sorry if this is sent twice.

Will you try the fix-circ-refs branch? It looks okay on my end. OLS
only right now.

Skipper

josef...@gmail.com

unread,

Sep 7, 2011, 12:00:50 AM9/7/11

to pystat...@googlegroups.com

I don't see an import _attach_results in linear_model in the changeset

?
Josef

>
> Skipper
>

Skipper Seabold

unread,

Sep 7, 2011, 12:04:56 AM9/7/11

to pystat...@googlegroups.com

Hmm. Fixed the memory problem. Broke everything else...

Skipper

Wes McKinney

unread,

Sep 7, 2011, 12:08:07 AM9/7/11

to pystat...@googlegroups.com

here's a self-contained way to reproduce the problem. Blog article forthcoming:

import numpy as np

class Faucet(object):

def __init__(self, obj):
self.obj = obj
self._water = None

def turn_on(self):
if self._water is None:
water = Water(self)
self._water = water
return self._water

class Water(object):

def __init__(self, faucet):
# well we want to know which faucet we came from!
self.faucet = faucet

def wetten(self):
pass

for i in xrange(50):
reservoir = np.empty((10000, 500), dtype=float)
reservoir.fill(0)
faucet = Faucet(reservoir)
water = faucet.turn_on()

haha, leaky faucet! =)

Skipper Seabold

unread,

Sep 7, 2011, 12:10:47 AM9/7/11

to pystat...@googlegroups.com

Have to instantiate the reference. Anyway, it doesn't seem to work as
I thought it might. I made the reference back to model weak as well.

Skipper

Wes McKinney

unread,

Sep 6, 2011, 11:57:22 PM9/6/11

to pystat...@googlegroups.com

Oho, no wonder it's leaking memory. The circle continues on forever.
Disappointing that the gc can't figure it out.

dieterv77

unread,

Sep 7, 2011, 12:18:18 AM9/7/11

to pystatsmodels

>
> Sorry if this is sent twice.
>
> Will you try the fix-circ-refs branch? It looks okay on my end. OLS
> only right now.

That branch seems to eliminate the memory growth for me, very nice.
Thanks for looking into the weak references.

FWIW, not surprisingly, it also worked to break the cycle explicitly,
for example,
by setting result.model = None

FYI, I ran into this because i was running ~ 10 consecutive robust
regressions, each of which were taking ~ 15 iterations
so in effect i had 150 different WLS regressions running and i was
seeing very strange memory behavior. So i wasn't trying to run a
bunch of regressions
with the same exogenous variables.

thanks again for these quick and helpful responses.

dieter
>
> Skipper

Skipper Seabold

unread,

Sep 7, 2011, 12:20:17 AM9/7/11

to pystat...@googlegroups.com

On Wed, Sep 7, 2011 at 12:18 AM, dieterv77 <diet...@gmail.com> wrote:
>
>>
>> Sorry if this is sent twice.
>>
>> Will you try the fix-circ-refs branch? It looks okay on my end. OLS
>> only right now.
>
> That branch seems to eliminate the memory growth for me, very nice.
> Thanks for looking into the weak references.
>

It does, but it also eliminates everything else because I used the
weakref wrong! It doesn't attach the result instance so it's no good.

Skipper

Skipper Seabold

unread,

Sep 7, 2011, 12:21:57 AM9/7/11

to pystat...@googlegroups.com

The stack overflow link I posted suggests that python should take care
of the circular references as long as the del method isn't defined for
the objects.

And it doesn't look like it grows indefinitely, it just doesn't look
like it's garbage collected at the end of every loop.

Skipper

Skipper Seabold

unread,

Sep 7, 2011, 12:36:45 AM9/7/11

to pystat...@googlegroups.com

Ha, oh no, indeed.

hasattr(result.model._results.model._results.model._results.model..., 'endog')
#True

josef...@gmail.com

unread,

Sep 7, 2011, 10:00:30 AM9/7/11

to pystat...@googlegroups.com

That's what a circle means, you can go around forever.

Does the weak reference work?

One more observation: I was looking at whether we should add a del and
qc.collect in RLM and GLM which use the WLS loop.

with result = sm.RLM(Y,X).fit() in the loop I get a MemoryError
(on my computer with lots of other things open)

adding the del and qc.collect in the loop in the script, memory
consumption has occassional jumps to close to 500MB but stays mostly
below 300MB, but it takes close to 20 Minutes for 100 repetitions in
the loop.

E:\Josef\work-oth2>python try_ols_memory.py

Traceback (most recent call last):

File "try_ols_memory.py", line 14, in <module>
result = sm.RLM(Y,X).fit()
File "E:\Josef\eclipsegworkspace\statsmodels-git\statsmodels-josef\scikits\sta
tsmodels\robust\robust_linear_model.py", line 250, in fit
weights=self.weights).fit()
File "E:\Josef\eclipsegworkspace\statsmodels-git\statsmodels-josef\scikits\sta
tsmodels\regression\linear_model.py", line 221, in fit
self.pinv_wexog = pinv_wexog = np.linalg.pinv(self.wexog)
File "C:\Python26\lib\site-packages\numpy\linalg\linalg.py", line 1545, in pin
v
u, s, vt = svd(a, 0)
File "C:\Python26\lib\site-packages\numpy\linalg\linalg.py", line 1323, in svd

u = u.transpose().astype(result_t)
MemoryError

Josef

Wes McKinney

unread,

Sep 7, 2011, 10:06:48 AM9/7/11

to pystat...@googlegroups.com

Someone had this to say to me about it (http://wesmckinney.com/blog/?p=187):

"The collector actually works with cycles, including this case.
(Actually, most objects are managed via reference-counting; the gc is
specifically for cyclic references.) The problem is that it's
triggered by # allocations, not # bytes. In this case, each cats array
is large enough that you need several GB of allocations before the
threshold is triggered. Try adding gc.collect() or gc.get_count() in
the loop (and inspect gc.get_threshold()).
http://docs.python.org/library/gc.html"

Skipper Seabold

unread,

Sep 7, 2011, 10:08:36 AM9/7/11

to pystat...@googlegroups.com

Right, just thought it was funny that I never noticed that.

> Does the weak reference work?
>

Nah, not how I thought it might. I removed the reference to _results.
As you pointed out (probably again), we don't really need it anywhere.
If we shadow predict in the results instance, it can automatically
pass params. I thought about making _results a boolean attribute, but
I just don't see that we need it. Making sure the tests are good
before pushing for review.

> One more observation: I was looking at whether we should add a del and
> qc.collect in RLM and GLM which use the WLS loop.
>
> with result = sm.RLM(Y,X).fit() in the loop I get a MemoryError
> (on my computer with lots of other things open)
>
> adding the del and qc.collect in the loop in the script, memory
> consumption has occassional jumps to close to 500MB but stays mostly
> below 300MB, but it takes close to 20 Minutes for 100 repetitions in
> the loop.

How big are the arrays? I also noticed a time problem on my laptop (no
ATLAS), but not my desktop (with ATLAS).

>
> E:\Josef\work-oth2>python try_ols_memory.py
> Traceback (most recent call last):
> File "try_ols_memory.py", line 14, in <module>
> result = sm.RLM(Y,X).fit()
> File "E:\Josef\eclipsegworkspace\statsmodels-git\statsmodels-josef\scikits\sta
> tsmodels\robust\robust_linear_model.py", line 250, in fit
> weights=self.weights).fit()
> File "E:\Josef\eclipsegworkspace\statsmodels-git\statsmodels-josef\scikits\sta
> tsmodels\regression\linear_model.py", line 221, in fit
> self.pinv_wexog = pinv_wexog = np.linalg.pinv(self.wexog)
> File "C:\Python26\lib\site-packages\numpy\linalg\linalg.py", line 1545, in pin
> v
> u, s, vt = svd(a, 0)
> File "C:\Python26\lib\site-packages\numpy\linalg\linalg.py", line 1323, in svd
>
> u = u.transpose().astype(result_t)
> MemoryError
>

Will you try again after I push? Should go away.

Skipper

Skipper Seabold

unread,

Sep 7, 2011, 10:10:36 AM9/7/11

to pystat...@googlegroups.com

Ah, ok that explains why we were seeing memory more or less constant after a
certain number of loops.

Skipper

Skipper Seabold

unread,

Sep 7, 2011, 10:15:46 AM9/7/11

to pystat...@googlegroups.com

Pushed. I'll fix tsa in the pandas-integration branch to avoid having
to do too much fixing of conflicts when that's ready to merge in soon.

Skipper

josef...@gmail.com

unread,

Sep 7, 2011, 10:22:58 AM9/7/11

to pystat...@googlegroups.com

When I ran the memory script on the commandline, I kept seeing the
seesaw pattern the entire time. Running it in IDLE that had already
800MB of memory in use, the memory usage stayed constant.

>
> Skipper
>

josef...@gmail.com

unread,

Sep 7, 2011, 10:41:50 AM9/7/11

to pystat...@googlegroups.com

I think that's the best, at least for the basic models. predict needed
to be fixed independently of this.

>>
>>> One more observation: I was looking at whether we should add a del and
>>> qc.collect in RLM and GLM which use the WLS loop.
>>>
>>> with result = sm.RLM(Y,X).fit() in the loop I get a MemoryError
>>> (on my computer with lots of other things open)
>>>
>>> adding the del and qc.collect in the loop in the script, memory
>>> consumption has occassional jumps to close to 500MB but stays mostly
>>> below 300MB, but it takes close to 20 Minutes for 100 repetitions in
>>> the loop.
>>
>> How big are the arrays? I also noticed a time problem on my laptop (no
>> ATLAS), but not my desktop (with ATLAS).

Not sure, I think the main cuplrit for memory and time of a single
operation is svd in pinv which uses large intermediate arrays.
With WLS we have to keep track of the weighted data, and several
intermediate arrays. The low looked like 70MB.

My notebook could also be slow because of too many open programs
(firefox with 2GB).

>>
>>>
>>> E:\Josef\work-oth2>python try_ols_memory.py
>>> Traceback (most recent call last):
>>> File "try_ols_memory.py", line 14, in <module>
>>> result = sm.RLM(Y,X).fit()
>>> File "E:\Josef\eclipsegworkspace\statsmodels-git\statsmodels-josef\scikits\sta
>>> tsmodels\robust\robust_linear_model.py", line 250, in fit
>>> weights=self.weights).fit()
>>> File "E:\Josef\eclipsegworkspace\statsmodels-git\statsmodels-josef\scikits\sta
>>> tsmodels\regression\linear_model.py", line 221, in fit
>>> self.pinv_wexog = pinv_wexog = np.linalg.pinv(self.wexog)
>>> File "C:\Python26\lib\site-packages\numpy\linalg\linalg.py", line 1545, in pin
>>> v
>>> u, s, vt = svd(a, 0)
>>> File "C:\Python26\lib\site-packages\numpy\linalg\linalg.py", line 1323, in svd
>>>
>>> u = u.transpose().astype(result_t)
>>> MemoryError
>>>
>>
>> Will you try again after I push? Should go away.

Will try soon, but might need to change my working directory.

>>
>> Skipper
>>
>
> Pushed. I'll fix tsa in the pandas-integration branch to avoid having
> to do too much fixing of conflicts when that's ready to merge in soon.

The changeset looks good, but hold on a bit pushing it to trunk.
Since it's the first backwards incompatible change I want to look at
other changes that we need for predict first. (**kwds ?)

Josef

>
> Skipper
>

josef...@gmail.com

unread,

Sep 7, 2011, 10:59:14 AM9/7/11

to pystat...@googlegroups.com

Looks good, memory consumption with RLM fit loop stays in the 120 to
150MB range.
All tests pass.

But change in the predict signature will require adjustments by users.

Josef

dieterv77

unread,

Sep 19, 2011, 9:38:13 PM9/19/11

to pystatsmodels

On Sep 7, 10:59 am, josef.p...@gmail.com wrote:
> On Wed, Sep 7, 2011 at 10:41 AM, <josef.p...@gmail.com> wrote:
> > On Wed, Sep 7, 2011 at 10:15 AM, Skipper Seabold <jsseab...@gmail.com> wrote:
> >> On Wed, Sep 7, 2011 at 10:08 AM, Skipper Seabold <jsseab...@gmail.com> wrote:
> >>> On Wed, Sep 7, 2011 at 10:00 AM, <josef.p...@gmail.com> wrote:
> >>>> On Wed, Sep 7, 2011 at 12:36 AM, Skipper Seabold <jsseab...@gmail.com> wrote:
> >>>>> On Wed, Sep 7, 2011 at 12:21 AM, Skipper Seabold <jsseab...@gmail.com> wrote:
> >>>>>> On Tue, Sep 6, 2011 at 11:57 PM, Wes McKinney <wesmck...@gmail.com> wrote:
> >>>>>>> On Tue, Sep 6, 2011 at 11:52 PM, Skipper Seabold <jsseab...@gmail.com> wrote:

> >>>>>>>> On Tue, Sep 6, 2011 at 11:06 PM, dieterv77 <dieter...@gmail.com> wrote:
>
> >>>>>>>>>> Yes, currently models and results are both attached to each other,
> >>>>>>>>>> the result instance is attached to the model instance in the fit
> >>>>>>>>>> method as _results, and
> >>>>>>>>>> during the __init__ of the results instance, the model gets attached
> >>>>>>>>>> as attribute `model`.
> >>>>>>>>> I see, is that something that could be changed? Does the model need
> >>>>>>>>> the reference to the results?
> >>>>>>>>> I ask because otherwise i will need need to put in some explicit del
> >>>>>>>>> statements to get the garbage collection to behave
> >>>>>>>>> nicely.
>
> >>>>>>>>> Thanks for the quick responses.
>
> >>>>>>>> Sorry if this is sent twice.
>
> >>>>>>>> Will you try the fix-circ-refs branch? It looks okay on my end. OLS
> >>>>>>>> only right now.
>
> >>>>>>>> Skipper
>

> >>>>>>> Oho, no wonder it's leakingmemory. The circle continues on forever.

> > Not sure, I think the main cuplrit formemoryand time of a single

> > operation is svd in pinv which uses large intermediate arrays.
> > With WLS we have to keep track of the weighted data, and several
> > intermediate arrays. The low looked like 70MB.
>
> > My notebook could also be slow because of too many open programs
> > (firefox with 2GB).
>
> >>>> E:\Josef\work-oth2>python try_ols_memory.py
> >>>> Traceback (most recent call last):
> >>>> File "try_ols_memory.py", line 14, in <module>
> >>>> result = sm.RLM(Y,X).fit()
> >>>> File "E:\Josef\eclipsegworkspace\statsmodels-git\statsmodels-josef\scikits\sta
> >>>> tsmodels\robust\robust_linear_model.py", line 250, in fit
> >>>> weights=self.weights).fit()
> >>>> File "E:\Josef\eclipsegworkspace\statsmodels-git\statsmodels-josef\scikits\sta
> >>>> tsmodels\regression\linear_model.py", line 221, in fit
> >>>> self.pinv_wexog = pinv_wexog = np.linalg.pinv(self.wexog)
> >>>> File "C:\Python26\lib\site-packages\numpy\linalg\linalg.py", line 1545, in pin
> >>>> v
> >>>> u, s, vt = svd(a, 0)
> >>>> File "C:\Python26\lib\site-packages\numpy\linalg\linalg.py", line 1323, in svd
>
> >>>> u = u.transpose().astype(result_t)
> >>>> MemoryError
>
> >>> Will you try again after I push? Should go away.
>
> > Will try soon, but might need to change my working directory.
>

> Looks good,memoryconsumption with RLM fit loop stays in the 120 to

> 150MB range.
> All tests pass.
>
> But change in the predict signature will require adjustments by users.
>
> Josef
>
>
>
>
>
>
>
>
>
> >>> Skipper
>
> >> Pushed. I'll fix tsa in the pandas-integration branch to avoid having
> >> to do too much fixing of conflicts when that's ready to merge in soon.
>
> > The changeset looks good, but hold on a bit pushing it to trunk.
> > Since it's the first backwards incompatible change I want to look at
> > other changes that we need for predict first. (**kwds ?)
>

I wanted to check and see if you guys thought that the changes in the
fix circular reference branches
were in good shape to be merged to master?

thanks again
dieter

josef...@gmail.com

unread,

Sep 19, 2011, 10:17:14 PM9/19/11

to pystat...@googlegroups.com

I had less time then expected in the last weeks to work on statsmodels
and haven't thought about it anymore.

I'm starting to give up on the idea to get a quick backwards
compatible "cleanup" release out. Especially since Skipper is working
pretty fast on the pandas integration.

The only question for the circular reference refactoring is what the
signature for the predict method will be. There are no problems with
removing the circular reference (results attached to models) and that
will be merged for sure.

The signature of the predict method changed in a backwards
incompatible way (for some usage patterns), and I think it needs to be
changed further, at least on the level of the "base" models. Except
for this there is no problem merging the branch.

We could merge the branch which would solve the circular reference
problem in "trunk", but the status of predict will remain in flux.
predict needs a serious review across models, which we haven't done so
far.

Josef

>
> thanks again
> dieter
>
>
>

Skipper Seabold

unread,

Sep 20, 2011, 10:20:49 AM9/20/11

to pystat...@googlegroups.com

I don't think there is anything in the pandas integration branch that
isn't backwards compatible.

> The only question for the circular reference refactoring is what the
> signature for the predict method will be. There are no problems with
> removing the circular reference (results attached to models) and that
> will be merged for sure.
>
> The signature of the predict method changed in a backwards
> incompatible way (for some usage patterns), and I think it needs to be
> changed further, at least on the level of the "base" models. Except
> for this there is no problem merging the branch.
>
> We could merge the branch which would solve the circular reference
> problem in "trunk", but the status of predict will remain in flux.
> predict needs a serious review across models, which we haven't done so
> far.
>

I agree. I think it might be worth some short term trouble to get
predict right in the next major release though. I will have a look at
merging in the circular reference branch soon-ish, if we are okay with
"breaking"/fixing predict. It will still need some more attention
though before a release.

Skipper

Wes McKinney

unread,

Sep 20, 2011, 10:23:17 AM9/20/11

to pystat...@googlegroups.com

There definitely shouldn't be! That was the whole point of the wrapper design.

Skipper Seabold

unread,

Sep 20, 2011, 10:25:38 AM9/20/11

to pystat...@googlegroups.com

On Tue, Sep 20, 2011 at 10:23 AM, Wes McKinney <wesm...@gmail.com> wrote:
> On Tue, Sep 20, 2011 at 10:20 AM, Skipper Seabold <jsse...@gmail.com> wrote:

<snip>

>> I don't think there is anything in the pandas integration branch that
>> isn't backwards compatible.
>
> There definitely shouldn't be! That was the whole point of the wrapper design.
>

Oh absolutely, any incompatible changes would be from my fiddling with
the TSA stuff. (I'm thinking mainly of predict in AR).

Skipper

josef...@gmail.com

unread,

Sep 20, 2011, 10:41:04 AM9/20/11

to pystat...@googlegroups.com

On Tue, Sep 20, 2011 at 10:25 AM, Skipper Seabold <jsse...@gmail.com> wrote:
> On Tue, Sep 20, 2011 at 10:23 AM, Wes McKinney <wesm...@gmail.com> wrote:
>> On Tue, Sep 20, 2011 at 10:20 AM, Skipper Seabold <jsse...@gmail.com> wrote:
> <snip>
>>> I don't think there is anything in the pandas integration branch that
>>> isn't backwards compatible.
>>
>> There definitely shouldn't be! That was the whole point of the wrapper design.
>>

That sounds better than I expected.

>
> Oh absolutely, any incompatible changes would be from my fiddling with
> the TSA stuff. (I'm thinking mainly of predict in AR).

To earlier message:
We have to break backwards compatibility with predict, but I would
prefer if we need to do it only once, and leave some room for future
expansion, and that predict in tsa and in basic models are not
completely different.
Go ahead with the merge then, it shouldn't be too far off that I have
to look at predict.

There will be some change in the options and return for summary,
compared to 0.3.