|
1.INTRODUCTIONClassification learning is one of the important research directions of machine learning. However, in actual production lines and detection, there will be problems of data set imbalance. The classifier constructed according to the imbalanced data set will make the prediction result more biased towards the majority class, while the minority class samples are often important research objects. Therefore, reducing and eliminating this kind of imbalance problem has important significance. In current related research, the main way to solve the data imbalance problem at the data level is to oversample the minority class samples and undersample the majority class samples. The most widely used oversampling algorithm is SMOTE algorithm proposed by Chawla et al.1, which can effectively oversample the minority class samples and make samples balanced, but this algorithm has some blindness in neighbor selection. Hui H et al.2 proposed an improved algorithm of SMOTE, the Borderline-SMOTE algorithm. This algorithm only uses minority class samples on the boundary to synthesize new samples, effectively improving the possible problem of high repetition in SMOTE, but there is a situation in that boundary samples are difficult to identify. Bao et al.3 proposed two new SMOTE algorithms, CP-SMOTE and IO-SMOTE. CP-SMOTE algorithm generates new samples by clustering to obtain center points and linearly combines minority class samples with center points. IO-SMOTE algorithm divides samples into internal samples and external samples so that more internal samples can be used in the process of generating new samples. These two algorithms make the samples away from the classification boundary and obtain better classification performance. For undersampling methods, He Yunbin et al.4 proposed a weighted boundary point ensemble undersampling algorithm based on clustering, which effectively improved the execution efficiency of the algorithm and the accuracy of classification results. However, in some data ratio ranges, there will be a large loss of original data distribution information. Zhou Qian et al.5 proposed a distance-weighted undersampling algorithm based on adaptive k-means clustering. They used the k-means clustering method to cluster majority class samples, eliminate outliers, sort data, and sample majority class samples in order, effectively improve the impact of unbalanced data on classification accuracy. Still, this algorithm has great limitations for multi-classification problems. Wang Lei et al.6 proposed a cluster undersampling weighted random forest algorithm CUS-WRF. They used undersampling associated with clustering on the data side and weighted random forest algorithm on the algorithm side to get better classification results, But in the future, certain research is still needed in terms of time complexity and boundary sample recognition. Cui Caixia et al.7 proposed an adaptive undersampling method based on density peak clustering. They considered overlapping areas, noise, inter-class and intra-class imbalance sample sparsity degree and proposed solutions to Improve the accuracy of classification results for minority class samples, but this method is not suitable for multi-class imbalance problems. This research proposes a SMOTECU (Synthetic Minority Over-sampling Technique Cluster Undersampling) algorithm combining undersampling and oversampling. Firstly ClusterCentroids undersampling is performed on majority classes with average sample number as target reducing majority class number and retaining feature information. Then SMOTE oversampling is performed on minority classes reducing required synthetic minority classes thus reducing model’s computational complexity and noise interference finally obtaining balanced data sets. 2.ALGORITHM INTRODUCTION2.1Clustercentroids algorithm introductionThe ClusterCentroids algorithm is an under-sampling method that synthesizes the majority class samples by dividing them into K clusters using the k-means++ algorithm and replacing them with the center points of these K clusters, thereby shrinking the number of majority class samples to K. ClusterCentroids algorithm can reduce the number of samples very efficiently. Still, when the data imbalance rate is high, the number of cluster centers is too small, and there is a high chance of losing critical information, resulting in overfitting. 2.2SMOTE algorithm introductionSMOTE is a common sampling method that solves the problem of sample imbalance by creating synthetic samples from the minority class. It does this by interpolating between the nearest neighbors of the minority class samples to increase the number of samples and balance the sample quantity. When the data imbalance ratio is large, SMOTE may over-sample too much data, resulting in high computation cost, low information gain, and noise amplification. This is because the new data generated by SMOTE may have a high degree of overlap with the existing data 2.3SMOTECU algorithmTo overcome the limitations of these two algorithms, this study proposes a SMOTECU algorithm that combines them: first, ClusterCentroids is used to under-sample the majority class by replacing the original data with the cluster cores after clustering, which reduces their number while preserving the feature information of the sample set; then, SMOTE is used to oversample the minority class by synthesizing new samples between neighboring ones, which increases their number. Finally, sample quantity balance is achieved, as shown in Figure 1. Algorithm steps:
The algorithm combines the ideas of ClusterCentroids under-sampling and SMOTE oversampling, using the advantages of these two algorithms to efficiently reduce or increase the number of samples to adjust the sample size. At the same time, it uses the average value of majority class and minority class as the target number, avoiding generating or eliminating too many samples. Compared with ClusterCentroids under-sampling, SMOTECU sets more clustering centers, which can retain more features and reduce the risk of overfitting. Compared with SMOTE over-sampling, SMOTECU reduces the number of samples that need to be synthesized by the minority class, shortens the calculation time of the model, and avoids too dense sample points of minority class, thereby reducing the risk of generating meaningless data and noise data. 3.EXPERIMENTAL RESULTS AND ANALYSIS3.1Dataset introductionTo verify the effectiveness of SMOTECU algorithm, this study collected 16 standard datasets with imbalanced samples. Table 1 lists the information and the unbalance rate ubrate of these datasets. Table 1.Imbalanced dataset information
3.2Performance comparison of different algorithmsFor imbalanced datasets, the classification results tend to be biased towards the majority class. Therefore, relying solely on accuracy to evaluate classification performance is one-sided and cannot accurately measure the generalization ability of the classification model. In this study, the standard metrics for classification problems, AUC and F1-score were used to evaluate the classification performance of the classifier. We calculate the AUC8 and F1-score values by the confusion matrix of the classification results, and the closer the values are to 1, the better the classification performance. Table 2.Confusion Matrix
Then, this research uses Random Forest, RBF neural network (RBFNN), and support vector machine based on RBF (RBFSVM) to compare the classification effects of 16 datasets processed by SMOTE, ClusterCentroid,s and SMOTECU algorithms. Divide the training and testing into a 7:3 ratio and repeat ten times. The average values of the F1-score and AUC are shown in Table 3. Table 3.Comparison of F1-score and AUC values for partial datasets
According to the test results of the datasets in Table 3, Random Forest has the best classification performance, followed by RBF Neural Network, and RBFSVM performs the worst. Regarding runtime, RBF Neural Network takes much longer than Random Forest and RBFSVM. For the test results, the classification performance of the SMOTECU-based models on the car_eval_34 and the scene datasets is better than that of SMOTE and ClusterCentroids. In the other test results, the classification results based on SMOTECU are slightly worse than SMOTE. In the random forest classifier, the SMOTECU algorithm can reduce the computational complexity and maintain high classification performance compared with SMOTE oversampling, and can effectively avoid the overfitting phenomenon (100% accuracy and less than 2 seconds running time) compared with ClusterCentroids. Moreover, the RBF Neural Network classification model based on SMOTECU can significantly reduce the runtime while maintaining high accuracy. 3.3Analysis of dataset feature spaceTo further investigate the applicability of the SMOTECU algorithm, we use two dimensionality reduction algorithms, t-SNE9 and UMAP10, to reduce the high-dimensional feature space of these 16 datasets to two dimensions. Then the classification performance was compared to further analyze the results. The dimensionality reduction diagram of some datasets is shown in Figure 2. Through comparing the feature space dimensionality reduction of the tested data set, it was found that SMOTECU is good at dealing with data sets where sample points are more dispersed on the t-SNE dimensionality reduction map, and the minority and majority class are clustered separately with high distinguishability on the UMAP dimensionality reduction map. However, the classification performance based on SMOTECU is significantly worse than that of the SMOTE oversampled data set, where the sample points are linearly distributed on the t-SNE dimensionality reduction map, and the sample points of different labels are more chaotic and not clearly distinguished on the UMAP dimensionality reduction map. 4.CONCLUSIONThis study addresses the problem of imbalanced data classification and proposes the SMOTRECU algorithm from the perspective of data. The algorithm combines over-sampling and under-sampling methods to achieve data balancing. Firstly, the majority class samples are clustered by the k-means++ clustering method and replaced with cluster centroids, reducing the number of samples while retaining the main features of the data. Then, SMOTE over-sampling is applied to minority samples, reducing the number of generated sample points and mitigating the negative impact of traditional SMOTE over-sampling on excessive sample generation and noise amplification. By adjusting the number of minority and majority samples simultaneously, the algorithm makes the dataset structure more reasonable and effectively reduces the risk of overfitting. The algorithm has been tested on standard datasets, demonstrating good classification performance on highly discriminative data and random forest classification models and saving runtime for RBF neural networks. However, the advantages of the SMOTRECU algorithm are insignificant for imbalanced datasets with low discriminability, and further research is needed in this regard. REFERENCESChawla, N. V., Bowyer, K. W., Hall, L. O., et al.,
“SMOTE: Synthetic minority over-sampling technique,”
Journal of Artificial Intelligence Research, 16 321
–357
(2002). https://doi.org/10.1613/jair.953 Google Scholar
Han, H., Wang, W. Y., Mao, B. H.,
“Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning,”
in In Advances in Intelligent Computing: International Conference on Intelligent Computing, ICICProceedings, Part I,
23878
–26887
(2005). Google Scholar
Bao, Y., Yang, S. B.,
“Two Novel SMOTE Methods for Solving Imbalanced Classification Problems,”
IEEE Access, 11 5816
–5823
(2023). https://doi.org/10.1109/ACCESS.2023.3236794 Google Scholar
He, Y., Leng, X., Wan, J,
“Unbalanced data weighted boundary point integration undersampling method,”
Journal of Xidian University, 48
(4), 176
(2021). Google Scholar
Zhou, Q., Yao, Z., B. Sun, B.,
“Under-sampling Algorithm withWeighted Distance Based on Adaptive K-Means Clustering,”
Data Analysis and Knowledge Discovery, 6
(5), 127
–136
(2022). Google Scholar
Wang, L., Liu, Y., Liu, Z., et al.,
“Clustering under-sampling weighted random forest algorithm for processing unbalanced data,”
Application Research of Computers, 38
(5), 1398
–1402
(2021). Google Scholar
Cui, C., Cao, F., Liang, J.,
“Adaptive Undersampling Based on Density Peak Clustering, Pattern Recognition and,”
Artificial Intelligence, 33
(9), 811
–819
(2020). Google Scholar
Shen, Z., Hua, X., Jinhai, C,
“Resampling Algorithms for Imbalanced Data Author,”
Journal of Small and Microcomputer Systems, 1
–8
(2023). Google Scholar
Wei, V., Ivkin, N., Braverman, V., et al.,
“Sketch and Scale Geo-distributed tSNE and UMAP,”
in IEEE International Conference on Big Data,
996
–1003
(2020). Google Scholar
Sainburg, T., McInnes, L., Gentner, T. Q.,
“Parametric UMAP Embeddings for Representation and Semisupervised Learning,”
Neural Computation, 33
(11), 2881
–2907
(2021). Google Scholar
|