Improvement in the classification of data with class imbalance through a redistribution of classes by k-means

Keywords: Class imbalance, Statistical significance tests, Machine learning

Abstract

In the field of machine learning, there are several challenges that affect the performance of classification algorithms. Some of these include the curse of dimensionality or class imbalance. The dimensionality curse is a phenomenon that occurs when the number of features (p) in a dataset increases significantly compared to the number of samples (n) available. On the other hand, class imbalance occurs when one or more classes in a dataset have significantly less representation than other classes. This decreases the performance of a classifier since it generates classification biases towards the majority class. Microarray data is widely used to analyze and understand gene expression on a global level. These provide information about the expression of thousands of genes simultaneously and can be used to classify different conditions or diseases. Such data exhibits both dimensionality curse and class imbalance complexities.

In this work, a method to divide the majority class into two or more classes by means of the k-means clustering algorithm in microarray datasets is presented. Classification is performed using a variety of state of the art classification algorithms. The proposed method exceeds the classification performance of the original methods as it is reported, taking into account the balanced accuracy and a 5-fold cross-validation. After performing the Mann-Whitney statistical test, it was determined that the proposed mehotd obtains significantly better results than when the original algorithms are used.

Downloads

Download data is not yet available.

References

Narendra, P.M.; Fukunaga, K. A Branch and Bound Algorithm for Feature Subset Selection. IEEE Trans. Comput. 1977, 26, 917–922, doi:10.1109/tc.1977.1674939.

Fernández, A.; García, S.; Galar, M.; Prati, R.C.; Krawczyk, B.; Herrera, F. Learning from imbalanced data sets; Springer, 2018; Vol. 10;.

Chawla, N. V; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 2002, 16, 321–357.

Golub, T.R. Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring. Science (80-. ). 1999, 286, 531–537, doi:10.1126/science.286.5439.531.

Guyon, I.; Weston, J.; Barnhill, S.; Vapnik, V. Gene selection for cancer classification using support vector machines. Mach. Learn. 2002, 46, 389–422.

Chan, W.H.; Mohamad, M.S.; Deris, S.; Zaki, N.; Kasim, S.; Omatu, S.; Corchado, J.M.; Al Ashwal, H. Identification of informative genes and pathways using an improved penalized support vector machine with a weighting scheme. Comput. Biol. Med. 2016, 77, 102–115, doi:10.1016/j.compbiomed.2016.08.004.

Wang, L.; Han, M.; Li, X.; Zhang, N.; Cheng, H. Review of classification methods on unbalanced data sets. IEEE Access 2021, 9, 64606–64628.

Thabtah, F.; Hammoud, S.; Kamalov, F.; Gonsalves, A. Data imbalance in classification: Experimental evaluation. Inf. Sci. (Ny). 2020, 513, 429–441.

Jain, A.K. Data clustering: 50 years beyond K-means. Pattern Recognit. Lett. 2010, 31, 651–666.

Pfitzner, D.; Leibbrandt, R.; Powers, D. Characterization and evaluation of similarity measures for pairs of clusterings. Knowl. Inf. Syst. 2009, 19, 361–394.

Published
2024-07-05
How to Cite
Alarcón Paredes, A., Alegre-Ventura, R. J., & Alonso Silverio, G. A. (2024). Improvement in the classification of data with class imbalance through a redistribution of classes by k-means. XIKUA Boletín Científico De La Escuela Superior De Tlahuelilpan, 12(Especial), 111-116. https://doi.org/10.29057/xikua.v12iEspecial.12768