Mejora en la clasificación de datos con desbalance de clases mediante una redistribución de clases por k-means

Antonio Alarcón Paredes; Roberto Jhoshua Alegre-Ventura; Gustavo Adolfo Alonso Silverio

doi:10.29057/xikua.v12iEspecial.12768

Antonio Alarcón Paredes Centro de Investigación en Computación, Instituto Politécnico Nacional http://orcid.org/0000-0002-9785-1252
Roberto Jhoshua Alegre-Ventura Universidad Nacional Autónoma de México https://orcid.org/0009-0004-3038-8402
Gustavo Adolfo Alonso Silverio Universidad Autónoma de Guerrero http://orcid.org/0000-0002-2699-140X

DOI: https://doi.org/10.29057/xikua.v12iEspecial.12768

Keywords: Class imbalance, Statistical significance tests, Machine learning

Abstract

In the field of machine learning, there are several challenges that affect the performance of classification algorithms. Some of these include the curse of dimensionality or class imbalance. The dimensionality curse is a phenomenon that occurs when the number of features (p) in a dataset increases significantly compared to the number of samples (n) available. On the other hand, class imbalance occurs when one or more classes in a dataset have significantly less representation than other classes. This decreases the performance of a classifier since it generates classification biases towards the majority class. Microarray data is widely used to analyze and understand gene expression on a global level. These provide information about the expression of thousands of genes simultaneously and can be used to classify different conditions or diseases. Such data exhibits both dimensionality curse and class imbalance complexities.

In this work, a method to divide the majority class into two or more classes by means of the k-means clustering algorithm in microarray datasets is presented. Classification is performed using a variety of state of the art classification algorithms. The proposed method exceeds the classification performance of the original methods as it is reported, taking into account the balanced accuracy and a 5-fold cross-validation. After performing the Mann-Whitney statistical test, it was determined that the proposed mehotd obtains significantly better results than when the original algorithms are used.

Downloads

Download data is not yet available.

References

Narendra, P.M.; Fukunaga, K. A Branch and Bound Algorithm for Feature Subset Selection. IEEE Trans. Comput. 1977, 26, 917–922, doi:10.1109/tc.1977.1674939.

Fernández, A.; García, S.; Galar, M.; Prati, R.C.; Krawczyk, B.; Herrera, F. Learning from imbalanced data sets; Springer, 2018; Vol. 10;.

Chawla, N. V; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 2002, 16, 321–357.

Golub, T.R. Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring. Science (80-. ). 1999, 286, 531–537, doi:10.1126/science.286.5439.531.

Guyon, I.; Weston, J.; Barnhill, S.; Vapnik, V. Gene selection for cancer classification using support vector machines. Mach. Learn. 2002, 46, 389–422.

Chan, W.H.; Mohamad, M.S.; Deris, S.; Zaki, N.; Kasim, S.; Omatu, S.; Corchado, J.M.; Al Ashwal, H. Identification of informative genes and pathways using an improved penalized support vector machine with a weighting scheme. Comput. Biol. Med. 2016, 77, 102–115, doi:10.1016/j.compbiomed.2016.08.004.

Wang, L.; Han, M.; Li, X.; Zhang, N.; Cheng, H. Review of classification methods on unbalanced data sets. IEEE Access 2021, 9, 64606–64628.

Thabtah, F.; Hammoud, S.; Kamalov, F.; Gonsalves, A. Data imbalance in classification: Experimental evaluation. Inf. Sci. (Ny). 2020, 513, 429–441.

Jain, A.K. Data clustering: 50 years beyond K-means. Pattern Recognit. Lett. 2010, 31, 651–666.

Pfitzner, D.; Leibbrandt, R.; Powers, D. Characterization and evaluation of similarity measures for pairs of clusterings. Knowl. Inf. Syst. 2009, 19, 361–394.