Hmm, no that would've been my guess too. I have no idea off the top of
my head. I edited it to use the same Y and X in every model, and I see
roughly the same thing. Uses about .75 GB for the first ~150 models,
then stays there for the last ~150, and goes back down to about .5 GB
after garbage collection. I'm afraid I don't know enough about
Python's (anything's) memory management, but I assume that means we
are doing some copying somewhere (?), though I don't have a guess of
where just yet. We don't compute any of the heavy results until you
ask for them and it strikes me as odd that each model and results
instantiation would use several MB. Is it possible there's something
else going on under the hood with Python? Any tools for checking this
kind of thing?
> Thanks for your time, and for all the effort many of you put into this
> package.
>
Thanks for the feedback. Helps us improve things.
Skipper
Not our fault, I think this is just standard python garbage collection
that doesn't collect immediately, but periodically depending on
available computer memory.
I watched a few cycles, on my computer it goes up to around 1GB and
then drops down to around 100MB, then goes up again, and this several
times.
if I add into the loop immediate garbage collection (import gc on top)
#del model
#del result
gc.collect()
without the del it stays around 100 to 110 MB, with the del it stays
around 80 to 100 MB.
I'm on Windows, but I don't think this should differ much across OSs.
Josef
This script isn't supposed to print anything. You can watch the memory
consumption in taskmanager on Windows, or whatever the Linux
equivalent is.
I wouldn't recommend printing in this example, since the model and
arrays are large.
If you want to start with examples that print the results, then you
could start with the scripts in the scikits/statsmodels/examples
folder available in the statsmodels source.
Looking at some python (and numpy/scipy) tutorials at the same time
will be useful, since statsmodels assumes currently quite a bit of
familiarity with python and numpy.
But I think python, numpy and statsmodels are easy enough, that users
can produce useful results pretty fast.
Cheers,
Josef
I didn't know this, I never really looked at the details, I only saw
it mentioned quite often what I argued about the delayed collection.
>Is there any chance that there are cyclical references
> somewhere in
> statsmodels?
Yes, currently models and results are both attached to each other,
the result instance is attached to the model instance in the fit
method as _results, and
during the __init__ of the results instance, the model gets attached
as attribute `model`.
>That's effectively the only time that the reference
> counted approach to garbage collection breaks down and where garbage
> collection may be delayed until
> the cycles are detected.
That looks like the explanation then.
Josef
IIUC and google correctly, the garbage collector still handles
circular references though as long as we don't implement a del method.
Could it help if the circular references were weak references? This
also might depend on the version of Python used? It just seems as
though it uses more memory that it could not that it keeps using more
memory.
Skipper
It might be possible.
I think for the linear_model classes, we only used _results in predict
and that was supposed to be changed.
You could try
del resultsinstance.model._results
and see if it helps or breaks things that you are using. (exceptions
should be obvious)
For some models it might require some rethinking or redesign, but I'm
not sure how much it is used. It might be a problem for estimators
that work in several steps, ARMA, and some sandbox models,.
I worried sometimes about the circular references, but since we have
never seen a clear disadvantage we kept using it because it made life
easier.
For an "official" change in policy and refactoring we would have to
look at the various models and might take some time. But I think that
modelinstance._result was more a convenience than a necessity.
-------------------
A different issue: timing
I was a bit surprised that the 300 repetition loop took so much time.
How the linear algebra is done has a large influence on the time. for
example, I tried
model = sm.OLS(Y,X)
result = model.fit(method="qr")
which reduces time by a third.
trying to switch to scipy.linalg.pinv gives me a MemoryError
>>> X.shape
(10000, 123)
>>> a = scipy.linalg.pinv(X)
Traceback (most recent call last):
File "<pyshell#3>", line 1, in <module>
a = scipy.linalg.pinv(X)
File "C:\Python26\lib\site-packages\scipy\linalg\basic.py", line 488, in pinv
return lstsq(a, b, cond=cond)[0]
File "C:\Python26\lib\site-packages\scipy\linalg\basic.py", line 435, in lstsq
overwrite_b=overwrite_b)
MemoryError
using moment equations is much faster than pinv(X), ratio 0.46 / 1.59
for this X.shape
b = np.linalg.pinv(np.dot(X.T, X)).dot(X.T.dot(Y))
The problem is that this depends on the shape of the exog, and what
results are desired, so it's currently just generic and not optimized
for a specific case.
Another point not resolved/implemented yet: Are you actually
estimating different endog, y, against the same exog, X, or was this
just as a test case?
reusing pinv(X) would save a lot of time
On the other hand, in my timing calling garbage collection increases
the time by 4 %, so a small impact compared to any of the other
adjustments.
We haven't seen much (memory and timing) profiling of statsmodels in
different use cases, so this is interesting.
Josef
I caused a really bad memory leak in pandas not long ago by creating a
circular reference between NumPy arrays.
The main nuance I've noticed about Python internal memory allocation
is the whole business of memory arenas:
http://www.evanjones.ca/memoryallocator/
I am pretty sure this got fixed in Python 2.6. I will take a look at
this particular problem when I can and see if I can figure anything
out. Obviously not good
Sorry if this is sent twice.
Will you try the fix-circ-refs branch? It looks okay on my end. OLS
only right now.
Skipper
I don't see an import _attach_results in linear_model in the changeset
?
Josef
>
> Skipper
>
Hmm. Fixed the memory problem. Broke everything else...
Skipper
here's a self-contained way to reproduce the problem. Blog article forthcoming:
import numpy as np
class Faucet(object):
def __init__(self, obj):
self.obj = obj
self._water = None
def turn_on(self):
if self._water is None:
water = Water(self)
self._water = water
return self._water
class Water(object):
def __init__(self, faucet):
# well we want to know which faucet we came from!
self.faucet = faucet
def wetten(self):
pass
for i in xrange(50):
reservoir = np.empty((10000, 500), dtype=float)
reservoir.fill(0)
faucet = Faucet(reservoir)
water = faucet.turn_on()
haha, leaky faucet! =)
Have to instantiate the reference. Anyway, it doesn't seem to work as
I thought it might. I made the reference back to model weak as well.
Skipper
Oho, no wonder it's leaking memory. The circle continues on forever.
Disappointing that the gc can't figure it out.
It does, but it also eliminates everything else because I used the
weakref wrong! It doesn't attach the result instance so it's no good.
Skipper
The stack overflow link I posted suggests that python should take care
of the circular references as long as the del method isn't defined for
the objects.
And it doesn't look like it grows indefinitely, it just doesn't look
like it's garbage collected at the end of every loop.
Skipper
Ha, oh no, indeed.
hasattr(result.model._results.model._results.model._results.model..., 'endog')
#True
That's what a circle means, you can go around forever.
Does the weak reference work?
One more observation: I was looking at whether we should add a del and
qc.collect in RLM and GLM which use the WLS loop.
with result = sm.RLM(Y,X).fit() in the loop I get a MemoryError
(on my computer with lots of other things open)
adding the del and qc.collect in the loop in the script, memory
consumption has occassional jumps to close to 500MB but stays mostly
below 300MB, but it takes close to 20 Minutes for 100 repetitions in
the loop.
E:\Josef\work-oth2>python try_ols_memory.py
Traceback (most recent call last):
File "try_ols_memory.py", line 14, in <module>
result = sm.RLM(Y,X).fit()
File "E:\Josef\eclipsegworkspace\statsmodels-git\statsmodels-josef\scikits\sta
tsmodels\robust\robust_linear_model.py", line 250, in fit
weights=self.weights).fit()
File "E:\Josef\eclipsegworkspace\statsmodels-git\statsmodels-josef\scikits\sta
tsmodels\regression\linear_model.py", line 221, in fit
self.pinv_wexog = pinv_wexog = np.linalg.pinv(self.wexog)
File "C:\Python26\lib\site-packages\numpy\linalg\linalg.py", line 1545, in pin
v
u, s, vt = svd(a, 0)
File "C:\Python26\lib\site-packages\numpy\linalg\linalg.py", line 1323, in svd
u = u.transpose().astype(result_t)
MemoryError
Josef
Someone had this to say to me about it (http://wesmckinney.com/blog/?p=187):
"The collector actually works with cycles, including this case.
(Actually, most objects are managed via reference-counting; the gc is
specifically for cyclic references.) The problem is that it's
triggered by # allocations, not # bytes. In this case, each cats array
is large enough that you need several GB of allocations before the
threshold is triggered. Try adding gc.collect() or gc.get_count() in
the loop (and inspect gc.get_threshold()).
http://docs.python.org/library/gc.html"
Right, just thought it was funny that I never noticed that.
> Does the weak reference work?
>
Nah, not how I thought it might. I removed the reference to _results.
As you pointed out (probably again), we don't really need it anywhere.
If we shadow predict in the results instance, it can automatically
pass params. I thought about making _results a boolean attribute, but
I just don't see that we need it. Making sure the tests are good
before pushing for review.
> One more observation: I was looking at whether we should add a del and
> qc.collect in RLM and GLM which use the WLS loop.
>
> with result = sm.RLM(Y,X).fit() in the loop I get a MemoryError
> (on my computer with lots of other things open)
>
> adding the del and qc.collect in the loop in the script, memory
> consumption has occassional jumps to close to 500MB but stays mostly
> below 300MB, but it takes close to 20 Minutes for 100 repetitions in
> the loop.
How big are the arrays? I also noticed a time problem on my laptop (no
ATLAS), but not my desktop (with ATLAS).
>
> E:\Josef\work-oth2>python try_ols_memory.py
> Traceback (most recent call last):
> File "try_ols_memory.py", line 14, in <module>
> result = sm.RLM(Y,X).fit()
> File "E:\Josef\eclipsegworkspace\statsmodels-git\statsmodels-josef\scikits\sta
> tsmodels\robust\robust_linear_model.py", line 250, in fit
> weights=self.weights).fit()
> File "E:\Josef\eclipsegworkspace\statsmodels-git\statsmodels-josef\scikits\sta
> tsmodels\regression\linear_model.py", line 221, in fit
> self.pinv_wexog = pinv_wexog = np.linalg.pinv(self.wexog)
> File "C:\Python26\lib\site-packages\numpy\linalg\linalg.py", line 1545, in pin
> v
> u, s, vt = svd(a, 0)
> File "C:\Python26\lib\site-packages\numpy\linalg\linalg.py", line 1323, in svd
>
> u = u.transpose().astype(result_t)
> MemoryError
>
Will you try again after I push? Should go away.
Skipper
Ah, ok that explains why we were seeing memory more or less constant after a
certain number of loops.
Skipper
Pushed. I'll fix tsa in the pandas-integration branch to avoid having
to do too much fixing of conflicts when that's ready to merge in soon.
Skipper
When I ran the memory script on the commandline, I kept seeing the
seesaw pattern the entire time. Running it in IDLE that had already
800MB of memory in use, the memory usage stayed constant.
>
> Skipper
>
I think that's the best, at least for the basic models. predict needed
to be fixed independently of this.
>>
>>> One more observation: I was looking at whether we should add a del and
>>> qc.collect in RLM and GLM which use the WLS loop.
>>>
>>> with result = sm.RLM(Y,X).fit() in the loop I get a MemoryError
>>> (on my computer with lots of other things open)
>>>
>>> adding the del and qc.collect in the loop in the script, memory
>>> consumption has occassional jumps to close to 500MB but stays mostly
>>> below 300MB, but it takes close to 20 Minutes for 100 repetitions in
>>> the loop.
>>
>> How big are the arrays? I also noticed a time problem on my laptop (no
>> ATLAS), but not my desktop (with ATLAS).
Not sure, I think the main cuplrit for memory and time of a single
operation is svd in pinv which uses large intermediate arrays.
With WLS we have to keep track of the weighted data, and several
intermediate arrays. The low looked like 70MB.
My notebook could also be slow because of too many open programs
(firefox with 2GB).
>>
>>>
>>> E:\Josef\work-oth2>python try_ols_memory.py
>>> Traceback (most recent call last):
>>> File "try_ols_memory.py", line 14, in <module>
>>> result = sm.RLM(Y,X).fit()
>>> File "E:\Josef\eclipsegworkspace\statsmodels-git\statsmodels-josef\scikits\sta
>>> tsmodels\robust\robust_linear_model.py", line 250, in fit
>>> weights=self.weights).fit()
>>> File "E:\Josef\eclipsegworkspace\statsmodels-git\statsmodels-josef\scikits\sta
>>> tsmodels\regression\linear_model.py", line 221, in fit
>>> self.pinv_wexog = pinv_wexog = np.linalg.pinv(self.wexog)
>>> File "C:\Python26\lib\site-packages\numpy\linalg\linalg.py", line 1545, in pin
>>> v
>>> u, s, vt = svd(a, 0)
>>> File "C:\Python26\lib\site-packages\numpy\linalg\linalg.py", line 1323, in svd
>>>
>>> u = u.transpose().astype(result_t)
>>> MemoryError
>>>
>>
>> Will you try again after I push? Should go away.
Will try soon, but might need to change my working directory.
>>
>> Skipper
>>
>
> Pushed. I'll fix tsa in the pandas-integration branch to avoid having
> to do too much fixing of conflicts when that's ready to merge in soon.
The changeset looks good, but hold on a bit pushing it to trunk.
Since it's the first backwards incompatible change I want to look at
other changes that we need for predict first. (**kwds ?)
Josef
>
> Skipper
>
Looks good, memory consumption with RLM fit loop stays in the 120 to
150MB range.
All tests pass.
But change in the predict signature will require adjustments by users.
Josef
I had less time then expected in the last weeks to work on statsmodels
and haven't thought about it anymore.
I'm starting to give up on the idea to get a quick backwards
compatible "cleanup" release out. Especially since Skipper is working
pretty fast on the pandas integration.
The only question for the circular reference refactoring is what the
signature for the predict method will be. There are no problems with
removing the circular reference (results attached to models) and that
will be merged for sure.
The signature of the predict method changed in a backwards
incompatible way (for some usage patterns), and I think it needs to be
changed further, at least on the level of the "base" models. Except
for this there is no problem merging the branch.
We could merge the branch which would solve the circular reference
problem in "trunk", but the status of predict will remain in flux.
predict needs a serious review across models, which we haven't done so
far.
Josef
>
> thanks again
> dieter
>
>
>
I don't think there is anything in the pandas integration branch that
isn't backwards compatible.
> The only question for the circular reference refactoring is what the
> signature for the predict method will be. There are no problems with
> removing the circular reference (results attached to models) and that
> will be merged for sure.
>
> The signature of the predict method changed in a backwards
> incompatible way (for some usage patterns), and I think it needs to be
> changed further, at least on the level of the "base" models. Except
> for this there is no problem merging the branch.
>
> We could merge the branch which would solve the circular reference
> problem in "trunk", but the status of predict will remain in flux.
> predict needs a serious review across models, which we haven't done so
> far.
>
I agree. I think it might be worth some short term trouble to get
predict right in the next major release though. I will have a look at
merging in the circular reference branch soon-ish, if we are okay with
"breaking"/fixing predict. It will still need some more attention
though before a release.
Skipper
There definitely shouldn't be! That was the whole point of the wrapper design.
Oh absolutely, any incompatible changes would be from my fiddling with
the TSA stuff. (I'm thinking mainly of predict in AR).
Skipper
That sounds better than I expected.
>
> Oh absolutely, any incompatible changes would be from my fiddling with
> the TSA stuff. (I'm thinking mainly of predict in AR).
To earlier message:
We have to break backwards compatibility with predict, but I would
prefer if we need to do it only once, and leave some room for future
expansion, and that predict in tsa and in basic models are not
completely different.
Go ahead with the merge then, it shouldn't be too far off that I have
to look at predict.
There will be some change in the options and return for summary,
compared to 0.3.
Josef
>
> Skipper
>