GLM high memory usage

已查看 28 次

跳至第一个未读帖子

Han Fang

未读，

2017年1月19日 16:54:172017/1/19

收件人 pystatsmodels

Hi,

I am experiencing some very high memory usage when using smf.glm. The input data has 1988834 obs. The degrees of freedom is 4826, 4825 of which comes from categorical variables. See below. Using memory_profiler, it seems that the memory usage spiked at smf.glm (148299.7 MiB). However, the actual peak memory maxvmem reported by the system was 582.401G.

Question:
Since I have a very sparse design matrix, do you think we could solve this by having a sparse version of GLM function? Like it was discussed here.

                 Generalized Linear Model Regression Results
==============================================================================
Dep. Variable:         ribosome_count   No. Observations:              1988834
Model:                            GLM   Df Residuals:                  1984007
Model Family:        NegativeBinomial   Df Model:                         4826
Link Function:                    log   Scale:                  0.825952377966
Method:                          IRLS   Log-Likelihood:            -5.7952e+06
Date:                Thu, 19 Jan 2017   Deviance:                   1.1605e+06
Time:                        04:00:48   Pearson chi2:                 1.64e+06
No. Iterations:                    16

Line #    Mem usage    Increment   Line Contents
================================================
    77    327.0 MiB      0.0 MiB       @profile
    78                                 def nbGlm(self):
    79                                     ## define model formula,
    80    327.0 MiB      0.0 MiB           formula = 'ribosome_count ~ C(gene_name) + C(codon) + pair_prob'
    81    327.0 MiB      0.0 MiB           print("[status]\tFormula: " + str(formula), flush=True)
    82
    83                                     ## define model fitting options
    84    327.0 MiB      0.0 MiB           sovler = "IRLS" # "lbfgs"
    85    327.0 MiB      0.0 MiB           tolerence = 1e-4
    86    327.0 MiB      0.0 MiB           numIter = 100
    87    327.0 MiB      0.0 MiB           print("[status]\tSolver: " + sovler, flush=True)
    88    327.0 MiB      0.0 MiB           print("[status]\tConvergence tolerance: " + str(tolerence), flush=True)
    89    327.0 MiB      0.0 MiB           print("[status]\tMaxiter: " + str(numIter), flush=True)
    90
    91                                     ## model fitting NegativeBinomial GLM
    92    327.0 MiB      0.0 MiB           print("[status]\tModel: smf.glm(formula, self.df, family=sm.families.NegativeBinomial(), offset=self.df['log_TPM']", flush=True)
    93 148299.7 MiB 147972.6 MiB           mod = smf.glm(formula, self.df, family=sm.families.NegativeBinomial(), offset=self.df['log_TPM'])
    94 149961.2 MiB   1661.5 MiB           res = mod.fit(method=sovler, tol=tolerence, maxiter=numIter)
    95
    96                                     ## print model output
    97 149971.1 MiB     10.0 MiB           print (res.summary())

回复全部

回复作者

0 个新帖子