Theproblem can be solved using dynamic programming when the sum of the elements is not too big. As the recomputations of the same subproblems can be avoided by constructing a temporary array part[][] in a bottom-up manner using the above recursive formula and it should satisfy the following formula:
The topic of imbalanced classification has gathered a wide attention of researchers during the last several years [1,2,3,4,5]. It occurs when the classes represented in a problem show a skewed distribution, i.e., there is a minority (or positive) class, and a majority (or negative) one. This case study may be due to rarity of occurrence of a given concept, or even because of some restrictions during the gathering of data for a particular class. In this sense, class imbalance is ubiquitous and prevalent in several applications such as microarray research [6], medical diagnosis [7], oil-bearing of reservoir recognition [8], or intrusion detection systems [9].
The issue of class imbalance can be further confounded by different costs in making errors, i.e., false negatives can be more costly than false positives, such as in those problems previously presented, i.e., medical diagnosis. In these cases, the problem is that a bias towards the majority class is shown during the learning process, seeking for both accuracy and generalization. A direct consequence is that minority classes cannot be well modeled, and the final performance decays.
To successfully address the task of imbalanced classification, a number of different solutions have been proposed, which mainly fall into three categories [1,2,3,4]. The first is the family of pre-processing techniques aiming to rebalance the training data [10]. The second one is related to the algorithmic approaches that alter the learning mechanism by taking into account the different class distribution [11]. The third category comprises cost-sensitive learning approaches that consider a different cost for the misclassification of each class [12, 13].
The emergence of Big Data brings new problems and challenges for the class imbalance problem. Big Data posits the challenges of Volume, Velocity, and Variety [14, 15]. In addition, other attributes have been linked to Big Data such as Veracity or Value, among others [16]. The scalability issue must be properly addressed to develop new solutions or adapt existing ones for Big Data case studies [17, 18]. Spark [19] has emerged as a popular choice to implement large-scale Machine Learning applications on Big Data.
In this paper, we focus on learning from imbalanced data problems in the context of Big Data, especially when faced with the challenge of Volume. We will analyze the strengths and weaknesses of various MapReduce [20]-based implementation of algorithms that address imbalanced data. We have implemented a Spark library for undersampling and oversampling based on MapReduce.Footnote 1 This experimental study will show that some of the inner characteristics of imbalanced data are accentuated in this framework. In particular, we will stress the influence of the lack of data and small disjuncts that are a result from the data partitioning in this type of process. Finally, we will enumerate several guidelines that can allow researchers to develop high-quality solutions in this area of research. Specifically, we will take into account how data resampling techniques should be designed for imbalanced Big Data classification.
The task of classification in imbalanced domains is defined when the elements of a dataset are unevenly distributed among the classes [3, 5]. The majority class(es), as a result, overwhelms the data mining algorithms skewing their performance towards it. Most algorithms simply compute the accuracy based on the percentage of correctly classified observations. However, in the case study of skewed distributions, results are highly deceiving since minority classes hold minimum effect on overall accuracy. As a result, the performance on the majority class can overwhelm the poor performance on the minority class. For example, if a given dataset has a class distribution of 98:2, as is fairly common in various real-world scenarios, (a common example is medical diagnosis [21, 22]) then one can be 98% accurate by simply predicting all examples as majority class. But this prediction performance is misleading as the minority class (the class of interest) is missed.
Specific methods must be applied so that traditional classifiers are able to deal with the imbalance between classes. Three different methodologies are traditionally followed to cope with this problem [3, 13]: data level solutions that rebalance the training set [10], algorithmic level solutions that adapt the learning stage towards the minority classes [11] and cost-sensitive solutions which consider different costs with respect to the class distribution [12]. It is also possible to combine several classifiers into ensembles [26], by modifying or adapting the combination of the algorithm itself of ensemble learning and any of the techniques described above, namely at the data level or by algorithmic approaches based on cost-sensitive learning [27].
Among these methodologies, data-level solutions are more versatile, since their use is independent of the classifier selected. Three possible schemes can be applied here: undersampling of the majority class examples, oversampling of the minority class examples, and hybrid techniques [10].
The simplest approach, random undersampling, removes instances from the majority class usually until the class distribution is completely balanced. However, this may imply ignoring significant examples from the training data. On the other hand, random oversampling makes exact copies of existing minority instances. The hitch here is that this method can increase the likelihood of overfitting [10], as it tends to strengthen all minority clusters disregard their actual contribution to the problem itself.
More sophisticated methods have been proposed based on the generation of synthetic samples. The basis of this type of procedures is the Synthetic Minority Oversampling Technique (SMOTE) oversampling method [28]. Its core idea is to form new minority class examples by interpolating between several minority class examples that lie together. On the one hand, SMOTE expands the clusters of the minority data, and thus to strengthen the borderline areas between classes. On the other hand, it can result in the over-generalization problem. To tackle this issue of over-generalization versus increasing the minority class representation, various adaptive sampling methods have been proposed in recent years [29,30,31].
Although these techniques are mainly designed to rebalance the training data prior to the learning stage, there are additional characteristics that can degrade the performance in this scenario. In particular, in both [3] and [32] authors emphasize several factors that, in combination with the uneven distribution, impose harder learning constraints. Among them, they stressed the overlap between classes [33, 34], the presence of noise and small disjuncts and noisy data [35, 36], and the lack of data for training [37, 38].
To provide robust and scalable solutions, new research paradigms and developmental tools have been made available. These frameworks have been designed to ease both the storage necessities and the processing of Big Data problems [14].
The MapReduce execution environment [20] is the most common framework used in this scenario. Being a privative tool, its open source counterpart, known as Hadoop, has been traditionally used in academia research [39]. It has been designed to allow distributed computations in a transparent way for the programmer, also providing a fault-tolerant execution scheme. To take advantage of this scheme, any algorithm must be divided into two main stages: Map and Reduce. The first one is devoted to split the data for processing, whereas the second collects and aggregates the results.
Additionally, the MapReduce model is defined with respect to an essential data structure: the pair. The processed data, the intermediate and final results work in terms of pairs. To summarize its procedure, Fig. 1 illustrates a typical MapReduce program with its Map and Reduce steps. The terms \(k_i:v_j\) refer to the key and value pair that are computed within each Map process. Then, values are grouped linking them to the same key, i.e., \(k_i:v_j,\ldots , v_h\), and feed to the same Reduce process. Finally, values are aggregated with any function within the Reduce process to obtain the final result of the algorithm.
From this initial model, more recent alternative frameworks have arisen to provide more efficient computations for iterative process, which are the basis for any Machine Learning algorithm. Among them, Apache Spark [19, 40] is clearly emerging as a more commonly embraced platform for implementing Machine Learning solutions that scale with Big Data.
In this section, we discuss the current state of the art on the topic of imbalanced classification for Big Data. These include initial approaches for addressing the problem, making use of those methods and solutions that were mentioned in the previous section. In Table 2 we show the list of selected proposal in a taxonomy regarding the type of methodology applied to handle the imbalanced data distribution, or whether they comprise an application paper.
Among the different solutions to address imbalanced classification in Big Data, data pre-processing is possibly the one that has attracted the highest attention from researchers. Therefore, we may find several approaches that aim at adapting directly the standard undersampling and oversampling techniques to the MapReduce framework. In this sense, both random undersampling, random oversampling and SMOTE are the widely used algorithms, being applied within each Map process seeking for scalability. Furthermore, some ad hoc approaches based on undersampling and oversampling have been also developed, including evolutionary undersampling, a rough set-based SMOTE, and an ensemble algorithm. Finally, we will point out a first approach for the multi-class case study.
3a8082e126