would return an error, because the size of the result from df.groupby.apply would differ from the size of the group from the df.loc. It would also depend on the columns and index of df, which the df.groupby.apply return values would need to adhere to.
Here are a couple points related to generating models in bulk that might help:
Getting two values out of the same apply is relatively simple with the right data structures. I typically use this syntax to get new columns "in bulk" after applying a function and getting multiple values assembled into a list:
def fn(a):
return [a+1, a+2]
df[['new1', 'new2']] = df[a].apply(fn, result_type='expand', axis=1)
I frequently use this method to get the lower CI, mean, and upper CI from a prediction. It also works for "auto expanding" from model.params, for example. This requires that the relevant model be available on a row by row basis or accessible via a dataframe (or whatever) that is available from within the namespace of the input dataframe.
On the issue of actually training the models in bulk, one for each firm for example, I can attest that a groupby.apply(fn) works well, though I choose to parse out various data points into a Series that is automagically inserted into columns in a new dataframe, for example:
def get_group_fit(group):
train, test = train_test_split(group, test_size=0.1)
try:
rlm_model = sm.RLM(train[<endog>], train[<exog>]).fit()
except ValueError:
return None
rlm_y_test_results = rlm_model.predict(test[<exog>])
rlm_test_rsquared = r2_score(test[<endog>], rlm_y_test_results)
rlm_test_rmse = rmse(y_test, rlm_y_test_results)
resid,fit = probplot(rlm_model.wresid)
rlm_normal_rsquared = fit[2]**2
yresid_corr, ycorr_p = pearsonr(rlm_model.fittedvalues,rlm_model.wresid)
dw = durbin_watson(rlm_model.wresid)
return Series({'test rsquared': rlm_test_rsquared,
'test rmse': rlm_test_rmse,
'resid dist rsquared': rlm_normal_rsquared,
'resid-yfit corr': yresid_corr,
'durbin-watson': dw,
'significance': rlm_model.f_pvalue,
'model type': 'wls',
'model instance': wls_model})
which can be applied simply through df.groupby(<cols>).apply(get_group_fit). The input dataframe contains your actual data. The resulting dataframe has the grouping cols (firm name, for exampe) as the index and columns labeled according to the returned Series. This even works on Spark DataFrames with one additional wrapper.