Combining Resampling Methods, Multi Criteria Decision-Making and Clustering Analysis for Diabetes Detection in Imbalanced Data

Ali Ommi; Abbas Foroozanfar

doi:10.22105/ahse.v3i1.53

Authors

Ali Ommi * Department of Industrial Engineering, Sharif University of Technology, Tehran, Iran.
Abbas Foroozanfar Department of Industrial Engineering, Sharif University of Technology, Tehran, Iran.

https://doi.org/10.22105/ahse.v3i1.53

Abstract

In this study, the challenges of classifying imbalanced datasets—particularly in medical applications such as diabetes detection—are investigated. The research evaluates the impact of various resampling techniques, including both oversampling and undersampling methods, on the performance of classification models. By comprehensively combining four oversampling and four undersampling approaches within a multicriteria decision-making framework for criterion weighting and ranking, the study proposes an integrated and practical framework for selecting optimal resampling strategies for diabetes detection using the large BRFSS dataset (Behavioral Risk Factor Surveillance System). Machine learning algorithms such as XGBoost and the Support Vector Machine (SVM) were employed and their performance assessed under different resampling regimes. Results show that, on average, sensitivity improved across all resampling methods, with a mean increase of 87.32%, an improvement that was most pronounced for XGBoost. The F1-score likewise exhibited substantial gains across all methods, with SVM contributing a relatively larger share to the F1-score improvements. Although AUC showed little change, the findings indicate a clear enhancement in the models’ ability to detect the minority class (individuals with diabetes). To identify the best resampling approaches, a multicriteria decision-making (MCDM) procedure was applied, using the Analytic Hierarchy Process (AHP) for criterion weighting and MAIRCA for ranking and prioritizing the classification and resampling methods. In addition to the multicriteria ranking, an unsupervised clustering analysis based on the K means algorithm was conducted on the resampling–classifier combinations to further explore similarities and differences in their overall performance profiles. The optimal number of clusters was determined using the silhouette coefficient, leading to a partition that revealed distinct groups of methods characterized by different trade offs among accuracy, precision, sensitivity, F1 score, AUC, and computational cost. The clustering results were consistent with the MAIRCA ranking, with high ranked alternatives forming well separated, high quality clusters, while poorly performing methods were grouped into low performance clusters.

Keywords:

Imbalanced data, Oversampling, Undersampling, Machine learning, Classification, Multicriteria decision-making (MCDM), Clustering, K-means

References

[1] Ramyachitra, D., & Manikandan, P. (2014). Imbalanced dataset classification and solutions: A review. International Journal of Computational Business Research, 5(4), 1–29.

[2] Burnaev, E., Erofeev, P., & Papanov, A. (2015). Influence of resampling on accuracy of imbalanced classification. In Eighth International Conference on Machine Vision (ICMV 2015) (pp. 423–427). SPIE.

[3] Celik, A. (2023). Diagnosis of the diseases using resampling methods with machine learning algorithms. Proceedings of the Bulgarian Academy of Sciences, 76, 1065–1076.

[4] Afzal, W., Torkar, R., & Feldt, R. (2012). Resampling methods in software quality classification. International Journal of Software Engineering and Knowledge Engineering, 22(2), 203–223.

[5] Gurcan, F., & Soylu, A. (2024). Learning from imbalanced data: Integration of advanced resampling techniques and machine learning models for enhanced cancer diagnosis and prognosis. Cancers, 16(19), 3417. https://doi.org/10.3390/cancers16193417

[6] Ghorbani, R., & Ghousi, R. (2020). Comparing different resampling methods in predicting students’ performance using machine learning techniques. IEEE Access, 8, 67899–67911. https://doi.org/10.1109/ACCESS.2020.2986809

[7] Saputra, A. D., Arifianto, D., & Umilasari, R. (2025). Effect of random under sampling and random over sampling method on SVM performance. Computer and Information Systems Journal, 1(2), 78–86.

[8] Welvaars, K., et al. (2023). Implications of resampling data to address the class imbalance problem (IRCIP): An evaluation of impact on performance between classification algorithms in medical data. JAMIA Open, 6(2).

[9] Carvalho, M., Pinho, A. J., & Brás, S. (2025). Resampling approaches to handle class imbalance: A review from a data perspective. Journal of Big Data, 12(1), 71. https://doi.org/10.1186/s40537-025-00921-5

[10] Kou, G., Lu, Y., Peng, Y., & Shi, Y. (2012). Evaluation of classification algorithms using MCDM and rank correlation. International Journal of Information Technology & Decision Making, 11(1), 197–225. https://doi.org/10.1142/S0219622012500095

[11] Alqaysi, M. E., Albahri, A. S., & Hamid, R. A. (2022). Hybrid diagnosis models for autism patients based on medical and sociodemographic features using machine learning and multicriteria decision‑making (MCDM) techniques: An evaluation and benchmarking framework. Computational and Mathematical Methods in Medicine, 2022(1), 9410222. https://doi.org/10.1155/2022/9410222

[12] Song, Y., & Peng, Y. (2019). A MCDM-based evaluation approach for imbalanced classification methods in financial risk prediction. IEEE Access, 7, 84897–84906. https://doi.org/10.1109/ACCESS.2019.2924923

[13] Akinsola, J. E. T., Awodele, O., Kuyoro, S. O., & Kasali, F. A. (2019). Performance evaluation of supervised machine learning algorithms using multi-criteria decision making techniques. In International Conference on Information Technology in Education and Development (ITED) (pp. 17–34). Retrieved from https://www.academiainformationtechnology.org/ited2019/uploads/8135_File_03ITED19041 IEEE Paper Format Performance Evaluation of Supervised Machine Learning Algorithms Using MCDM Techniques NEW (1).pdf

[14] Kumar, A., & Kaur, K. (2024). A novel MCDM-based framework to recommend machine learning techniques for diabetes prediction. International Journal of Engineering and Technology Innovation, 14(1), 29–43. https://doi.org/10.46604/ijeti.2023.11837

[15] Das, R., et al. (2022). Performance analysis of machine learning algorithms and screening formulae for β‑thalassemia trait screening of Indian antenatal women. International Journal of Medical Informatics, 167, 104866. https://doi.org/10.1016/j.ijmedinf.2022.104866

[16] Elreedy, D., Atiya, A. F., & Kamalov, F. (2024). A theoretical distribution analysis of synthetic minority oversampling technique (SMOTE) for imbalanced learning. Machine Learning, 113(7), 4903–4923. https://doi.org/10.1007/s10994-023-06459-6

[17] He, H., Bai, Y., Garcia, E. A., & Li, S. (2008). ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence) (pp. 1322–1328). IEEE. https://doi.org/10.1109/IJCNN.2008.4633969

[18] Han, H., Wang, W.-Y., & Mao, B.-H. (2005). Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning. In International Conference on Intelligent Computing (pp. 878–887). Springer. https://doi.org/10.1007/11538059_91

[19] Tomek, I. (1976). Two modifications of CNN. IEEE Transactions on Systems, Man, and Cybernetics, SMC-6(11), 769–772. https://doi.org/10.1109/TSMC.1976.4309452

[20] Yen, S.-J., & Lee, Y.-S. (2006). Under-sampling approaches for improving prediction of the minority class in an imbalanced dataset. Lecture Notes in Control and Information Sciences, 344, 731–740. https://doi.org/10.1007/11760191_109

[21] Wilson, D. L. (1972). Asymptotic properties of nearest neighbor rules using edited data. IEEE Transactions on Systems, Man, and Cybernetics, SMC-2(3), 408–421. https://doi.org/10.1109/TSMC.1972.4309137

[22] Pamučar, D., Vasin, L., & Lukovac, L. (2014). Selection of railway level crossings for investing in security equipment using hybrid DEMATEL–MARICA model. In XVI International Scientific-Expert Conference on Railway (RAILCON) (pp. 89–92).

Combining Resampling Methods, Multi Criteria Decision-Making and Clustering Analysis for Diabetes Detection in Imbalanced Data

Authors

Abstract

Keywords:

References

Downloads

Published

Issue

Section

How to Cite

Similar Articles

Home

Submission Button

Guide_for_authors

Journal Info

Special Issues

Editors

Reviewers

Contact-us

Archives

Article-in-press

Volume

collaborate