How to do a a predict on the top 10,000 models of an exhaustive feature selection

103 views
Skip to first unread message

Arthur Hatt

unread,
Oct 15, 2019, 3:22:27 AM10/15/19
to mlxtend
Hello - many thanks for making this package available! I have run an exhaustive feature selection using linear regression which has gone through 2m different feature permutations. My X_train, y_train and X_tests are all dataframes. FYI I have limited the min_features and max_features to 2 so in fact the 2m linear regressions are all Y ~ X1 + X2.

My next step is to take the top 10,000 feature combinations and do a predict on these. I have code below (taken from your user guide) that is supposed to do this but think I may have too many models? It worked for smaller number of models...

(i) is there a better / more efficient way of doing this? I see that there is a built-in way of doing a predict on the best model but what about on the top 10000 models? Or is there another way for me to sort the results by avg_score and select the top 10,000?
(ii) my code below throws up an error message on the line metric_dict = efs.get_metric_dict() which has to do with deepcopy (being called on efs.subsets_) and leading to a memory error. But efs.subsets_ is otherwise accessible. 

I have done as follows:

lr = LinearRegression(fit_intercept=True, normalize=True)
efs = efs(lr, min_features=2, max_features=2, scoring="neg_mean_squared_error", print_progress=True, cv=5)
efs.fit(X_train, y_train)

metric_dict = efs.get_metric_dict()
df = pd.DataFrame.from_dict(metric_dict).T
df.sort_values('avg_score', inplace=True, ascending=False)

predictions = pd.Series(index=range(10000), name="predict")

for i in range(10000):
idx = metric_dict[df.index[i]]["feature_idx"]
X_train_i = X_train.iloc[:, np.array(idx)]
X_test_i = X_test.iloc[:, np.array(idx)]
lr.fit(X_train_i, y_train)
predictions[i] = lr.predict(X_test_i)

Error messages:

Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "C:\Users\ahatt\AppData\Local\Continuum\anaconda3\envs\CondaEnv\lib\site-packages\mlxtend\feature_selection\exhaustive_feature_selector.py", line 393, in get_metric_dict
    fdict = deepcopy(self.subsets_)
  File "C:\Users\ahatt\AppData\Local\Continuum\anaconda3\envs\CondaEnv\lib\copy.py", line 150, in deepcopy
    y = copier(x, memo)
  File "C:\Users\ahatt\AppData\Local\Continuum\anaconda3\envs\CondaEnv\lib\copy.py", line 240, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "C:\Users\ahatt\AppData\Local\Continuum\anaconda3\envs\CondaEnv\lib\copy.py", line 150, in deepcopy
    y = copier(x, memo)
  File "C:\Users\ahatt\AppData\Local\Continuum\anaconda3\envs\CondaEnv\lib\copy.py", line 240, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "C:\Users\ahatt\AppData\Local\Continuum\anaconda3\envs\CondaEnv\lib\copy.py", line 184, in deepcopy
    memo[d] = y
MemoryError


Sebastian Raschka

unread,
Oct 15, 2019, 8:56:04 AM10/15/19
to Arthur Hatt, mlxtend
Hi Arthur,

wow, this is a lot of models ... Just curious, how long does it take to run? Sorry to hear that you eventually got the memory error issue. I have actually never seen that before and don't know whether it is

a) your system running out of memory (could also be a Windows-specific thing; on Linux/macOS, it would do swapping when you run out of main memory.)

b) whether it has something to do with Python's size limit for dictionaries + the deepcopy call itself

In any case, the first thing I would probably do after fitting is to dump the content of efs.subsets_ to a json or yaml file. In case of another crash, you can at least read from the file for the analysis so that you don't have to rerun everything.

Other than that, the get_metric_dict code is pretty simple, you could try to run it directly:

fdict = deepcopy(self.subsets_)
for k in fdict:
std_dev = np.std(self.subsets_[k]['cv_scores'])
bound, std_err = self._calc_confidence(
self.subsets_[k]['cv_scores'],
confidence=confidence_interval)
fdict[k]['ci_bound'] = bound
fdict[k]['std_dev'] = std_dev
fdict[k]['std_err'] = std_err

I.e., in the code above, you can remove the deepcopy call and modify self.subsets_ in place:

for k in fdict:
std_dev = np.std(self.subsets_[k]['cv_scores'])
bound, std_err = self._calc_confidence(
self.subsets_[k]['cv_scores'],
confidence=confidence_interval)
self.subsets_[k]['ci_bound'] = bound
self.subsets_[k]['std_dev'] = std_dev
self.subsets_[k]['std_err'] = std_err

I would then save the contents to yaml, for example:

with open('subsets.yml', 'w') as outfile:
yaml.dump(efs.subsets_, outfile, default_flow_style=False)

Best,
Sebastian
> --
> You received this message because you are subscribed to the Google Groups "mlxtend" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to mlxtend+u...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/mlxtend/cd4fb40d-0313-42ed-a53a-61f7281656ad%40googlegroups.com.

Arthur Hatt

unread,
Oct 15, 2019, 9:41:56 AM10/15/19
to mlxtend
Sebastian

First of all, I have justy come across mlxtend after a while trying to do some of this manually (am still relatively new) and just want to say the functionality here is amazing!! And thanks for your prompt response!

In fact since my post I was looking into the code and indeed the get_metric_dict seemed straightforwardish so I had a go at creating a new version of this that returns the top N models. Thought would be sueful to have this for my situation where you run many models and want to reduce thgis to just the top N to avoid having memory issues when put in a dataframe and sort etc. Works on my initail tests and now need to try out the full set. Takes less than 12 hours. So can let you know shortly if indeed works.Hopefully useful. Do let me know if anything obviously wrong?

def get_metric_dict_topn(self, topn, confidence_interval=0.95):
"""Return metric dictionary

Parameters
----------
confidence_interval : float (default: 0.95)
A positive float between 0.0 and 1.0 to compute the confidence
interval bounds of the CV score averages.

Returns
----------
Dictionary with items where each dictionary value is a list
with the number of iterations (number of feature subsets) as
its length. The dictionary keys corresponding to these lists
are as follows:
'feature_idx': tuple of the indices of the feature subset
'cv_scores': list with individual CV scores
'avg_score': of CV average scores
'std_dev': standard deviation of the CV score average
'std_err': standard error of the CV score average
'ci_bound': confidence interval bound of the CV score average

"""

self._check_fitted()

# AH overwritten function - start
# Ranks by avg_score and takes the top n

avg_score = np.zeros(len(self.subsets_))
for c in self.subsets_:
avg_score[c] = self.subsets_[c]['avg_score']
rank = len(avg_score) - scipy.stats.rankdata(avg_score).astype(int)
topn_idx = np.where(rank < topn)[0]
fdict = dict((i, self.subsets_[i]) for i in topn_idx)

# AH overwritten function - end

for k in fdict:
std_dev = np.std(self.subsets_[k]['cv_scores'])
bound, std_err = self._calc_confidence(
self.subsets_[k]['cv_scores'],
confidence=confidence_interval)
fdict[k]['ci_bound'] = bound
fdict[k]['std_dev'] = std_dev
fdict[k]['std_err'] = std_err
    return fdict

Sebastian Raschka

unread,
Oct 15, 2019, 11:53:48 AM10/15/19
to Arthur Hatt, mlxtend
Hi Arthur,

from a quick glance at the code, I can't see anything obviously wrong here. But take this with a grain of salt :).

In general, I think a topn parameter would be a nice addition to the current method in the main package (defaulting to topn=None) for both the SequentialFeatureSelector and ExhaustiveFeatureSelector.

So, if you ever have time for a PR, it would be very welcomed!

Best,
Sebastian

-- 
You received this message because you are subscribed to the Google Groups "mlxtend" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mlxtend+u...@googlegroups.com.

Arthur Hatt

unread,
Oct 16, 2019, 7:27:53 AM10/16/19
to mlxtend
Hi Sebastian

More than happy to look into a PR. Have not done this before but notcie you have instructions on your site. Just need to finish current project and will look into it. One thing I think I noticed is that .subsets_ for efs strats at key 0 whereas sfs starts at 1. Is this correct and intended? Important for when I rank and select the topn to make sure will work consistently.

Thanks

Arthur
To unsubscribe from this group and stop receiving emails from it, send an email to mlxtend+unsubscribe@googlegroups.com.

Sebastian Raschka

unread,
Oct 20, 2019, 8:57:14 PM10/20/19
to Arthur Hatt, mlxtend
Hi Arthur,

I just realize that I completely forgot to follow up on this ... Let me open an issue about this on GitHub for further discussion: https://github.com/rasbt/mlxtend/issues/610

Thanks for noting the inconsistency also, this should ideally be fixed as well. I don't have a strong preference for one over the other. I would say starting at 0 should be preferred because this is what a typical Python user would expect. On the other hand the SFS is much more popular and widely used, and changing the default behavior may be confusing. Let's continue the discussion on the Gh Issue.

Thanks,
Sebastian
>> To unsubscribe from this group and stop receiving emails from it, send an email to mlxtend+u...@googlegroups.com.
> --
> You received this message because you are subscribed to the Google Groups "mlxtend" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to mlxtend+u...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/mlxtend/61d4942d-04d1-463c-bdec-732a0b626cab%40googlegroups.com.

Reply all
Reply to author
Forward
0 new messages