Is it possible of an imbalanced dataset to benefit from the results of the same dataset after balancing?

17 views
Skip to first unread message

Muneera

unread,
Feb 10, 2023, 5:14:59 AM2/10/23
to python-weka-wrapper
Hello everyone

I am new to data science & machine learning.

I am using Weka platform to work on a classification problem with an imbalanced dataset. I want to apply the following:

1) a cross-validation technique to the imbalanced dataset.

2) a feature selection method on the same dataset but after balancing it using oversampling technique.


Is it correct to do the above by this scenario:

Assume that I had an imbalanced dataset with 5 features: a, b, c, d, and e features. I balanced the dataset using oversampling technique. Then I applied a feature selection method on the entire dataset. Then, I got three selected features: a, b, and c. After that, I went back to the imbalanced dataset (the original one) and removed d and e features. Then I completed my procedures on the imbalanced dataset with a, b, and c features (using FilteredClassifier and MultiFilter to apply cross-validation + oversampling + a classifier)

Peter Reutemann

unread,
Feb 12, 2023, 6:48:16 PM2/12/23
to python-we...@googlegroups.com
> I am new to data science & machine learning.
>
> I am using Weka platform to work on a classification problem with an imbalanced dataset. I want to apply the following:
>
> 1) a cross-validation technique to the imbalanced dataset.

NB: k-fold cross-validation will generate k models (to collect
statistics and then throw the model away again), not a single one.

> 2) a feature selection method on the same dataset but after balancing it using oversampling technique.
>
>
> Is it correct to do the above by this scenario:
>
> Assume that I had an imbalanced dataset with 5 features: a, b, c, d, and e features. I balanced the dataset using oversampling technique. Then I applied a feature selection method on the entire dataset. Then, I got three selected features: a, b, and c. After that, I went back to the imbalanced dataset (the original one) and removed d and e features. Then I completed my procedures on the imbalanced dataset with a, b, and c features (using FilteredClassifier and MultiFilter to apply cross-validation + oversampling + a classifier)

You'd have to experiment, whether it has any impact on your model
performance: compare the results using attribute selection with
oversampling and without.

Cheers, Peter
--
Peter Reutemann
Dept. of Computer Science
University of Waikato, Hamilton, NZ
Mobile +64 22 190 2375
https://www.cs.waikato.ac.nz/~fracpete/
http://www.data-mining.co.nz/
Reply all
Reply to author
Forward
0 new messages