Hi,
I am experiencing some very high memory usage when using smf.glm
. The input data has 1988834 obs. The degrees of freedom is 4826, 4825 of which comes from categorical variables. See below. Using memory_profiler, it seems that the memory usage spiked at smf.glm
(148299.7 MiB). However, the actual peak memory maxvmem
reported by the system was 582.401G.
Question:
Since I have a very sparse design matrix, do you think we could solve this by having a sparse version of GLM
function? Like it was discussed here.
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: ribosome_count No. Observations: 1988834
Model: GLM Df Residuals: 1984007
Model Family: NegativeBinomial Df Model: 4826
Link Function: log Scale: 0.825952377966
Method: IRLS Log-Likelihood: -5.7952e+06
Date: Thu, 19 Jan 2017 Deviance: 1.1605e+06
Time: 04:00:48 Pearson chi2: 1.64e+06
No. Iterations: 16
Line # Mem usage Increment Line Contents
================================================
77 327.0 MiB 0.0 MiB @profile
78 def nbGlm(self):
79 ## define model formula,
80 327.0 MiB 0.0 MiB formula = 'ribosome_count ~ C(gene_name) + C(codon) + pair_prob'
81 327.0 MiB 0.0 MiB print("[status]\tFormula: " + str(formula), flush=True)
82
83 ## define model fitting options
84 327.0 MiB 0.0 MiB sovler = "IRLS" # "lbfgs"
85 327.0 MiB 0.0 MiB tolerence = 1e-4
86 327.0 MiB 0.0 MiB numIter = 100
87 327.0 MiB 0.0 MiB print("[status]\tSolver: " + sovler, flush=True)
88 327.0 MiB 0.0 MiB print("[status]\tConvergence tolerance: " + str(tolerence), flush=True)
89 327.0 MiB 0.0 MiB print("[status]\tMaxiter: " + str(numIter), flush=True)
90
91 ## model fitting NegativeBinomial GLM
92 327.0 MiB 0.0 MiB print("[status]\tModel: smf.glm(formula, self.df, family=sm.families.NegativeBinomial(), offset=self.df['log_TPM']", flush=True)
93 148299.7 MiB 147972.6 MiB mod = smf.glm(formula, self.df, family=sm.families.NegativeBinomial(), offset=self.df['log_TPM'])
94 149961.2 MiB 1661.5 MiB res = mod.fit(method=sovler, tol=tolerence, maxiter=numIter)
95
96 ## print model output
97 149971.1 MiB 10.0 MiB print (res.summary())