Combining Resampling Methods, Multi Criteria Decision-Making and Clustering Analysis for Diabetes Detection in Imbalanced Data

Authors

  • Ali Ommi * Department of Industrial Engineering, Sharif University of Technology, Tehran, Iran.
  • Abbas Foroozanfar Department of Industrial Engineering, Sharif University of Technology, Tehran, Iran.

https://doi.org/10.22105/ahse.v3i1.53

Abstract

In this study, the challenges of classifying imbalanced datasets—particularly in medical applications such as diabetes detection—are investigated.  The research evaluates the impact of various resampling techniques, including both oversampling and undersampling methods, on the performance of classification models.  By comprehensively combining four oversampling and four undersampling approaches within a multicriteria decision-making framework for criterion weighting and ranking, the study proposes an integrated and practical framework for selecting optimal resampling strategies for diabetes detection using the large BRFSS dataset (Behavioral Risk Factor Surveillance System).  Machine learning algorithms such as XGBoost and the Support Vector Machine (SVM) were employed and their performance assessed under different resampling regimes. Results show that, on average, sensitivity improved across all resampling methods, with a mean increase of 87.32%, an improvement that was most pronounced for XGBoost.  The F1-score likewise exhibited substantial gains across all methods, with SVM contributing a relatively larger share to the F1-score improvements. Although AUC showed little change, the findings indicate a clear enhancement in the models’ ability to detect the minority class (individuals with diabetes). To identify the best resampling approaches, a multicriteria decision-making (MCDM) procedure was applied, using the Analytic Hierarchy Process (AHP) for criterion weighting and MAIRCA for ranking and prioritizing the classification and resampling methods.  In addition to the multicriteria ranking, an unsupervised clustering analysis based on the K means algorithm was conducted on the resampling–classifier combinations to further explore similarities and differences in their overall performance profiles.  The optimal number of clusters was determined using the silhouette coefficient, leading to a partition that revealed distinct groups of methods characterized by different trade offs among accuracy, precision, sensitivity, F1 score, AUC, and computational cost.  The clustering results were consistent with the MAIRCA ranking, with high ranked alternatives forming well separated, high quality clusters, while poorly performing methods were grouped into low performance clusters.

Keywords:

Imbalanced data, Oversampling, Undersampling, Machine learning, Classification, Multicriteria decision-making (MCDM), Clustering, K-means

References

  1. [1] Ramyachitra, D., & Manikandan, P. (2014). Imbalanced dataset classification and solutions: A review. International Journal of Computational Business Research, 5(4), 1–29.

  2. [2] Burnaev, E., Erofeev, P., & Papanov, A. (2015). Influence of resampling on accuracy of imbalanced classification. In Eighth International Conference on Machine Vision (ICMV 2015) (pp. 423–427). SPIE.

  3. [3] Celik, A. (2023). Diagnosis of the diseases using resampling methods with machine learning algorithms. Proceedings of the Bulgarian Academy of Sciences, 76, 1065–1076.

  4. [4] Afzal, W., Torkar, R., & Feldt, R. (2012). Resampling methods in software quality classification. International Journal of Software Engineering and Knowledge Engineering, 22(2), 203–223.

  5. [5] Gurcan, F., & Soylu, A. (2024). Learning from imbalanced data: Integration of advanced resampling techniques and machine learning models for enhanced cancer diagnosis and prognosis. Cancers, 16(19), 3417. https://doi.org/10.3390/cancers16193417

  6. [6] Ghorbani, R., & Ghousi, R. (2020). Comparing different resampling methods in predicting students’ performance using machine learning techniques. IEEE Access, 8, 67899–67911. https://doi.org/10.1109/ACCESS.2020.2986809

  7. [7] Saputra, A. D., Arifianto, D., & Umilasari, R. (2025). Effect of random under sampling and random over sampling method on SVM performance. Computer and Information Systems Journal, 1(2), 78–86.

  8. [8] Welvaars, K., et al. (2023). Implications of resampling data to address the class imbalance problem (IRCIP): An evaluation of impact on performance between classification algorithms in medical data. JAMIA Open, 6(2).

  9. [9] Carvalho, M., Pinho, A. J., & Brás, S. (2025). Resampling approaches to handle class imbalance: A review from a data perspective. Journal of Big Data, 12(1), 71. https://doi.org/10.1186/s40537-025-00921-5

  10. [10] Kou, G., Lu, Y., Peng, Y., & Shi, Y. (2012). Evaluation of classification algorithms using MCDM and rank correlation. International Journal of Information Technology & Decision Making, 11(1), 197–225. https://doi.org/10.1142/S0219622012500095

  11. [11] Alqaysi, M. E., Albahri, A. S., & Hamid, R. A. (2022). Hybrid diagnosis models for autism patients based on medical and sociodemographic features using machine learning and multicriteria decision‑making (MCDM) techniques: An evaluation and benchmarking framework. Computational and Mathematical Methods in Medicine, 2022(1), 9410222. https://doi.org/10.1155/2022/9410222

  12. [12] Song, Y., & Peng, Y. (2019). A MCDM-based evaluation approach for imbalanced classification methods in financial risk prediction. IEEE Access, 7, 84897–84906. https://doi.org/10.1109/ACCESS.2019.2924923

  13. [13] Akinsola, J. E. T., Awodele, O., Kuyoro, S. O., & Kasali, F. A. (2019). Performance evaluation of supervised machine learning algorithms using multi-criteria decision making techniques. In International Conference on Information Technology in Education and Development (ITED) (pp. 17–34). Retrieved from https://www.academiainformationtechnology.org/ited2019/uploads/8135_File_03ITED19041 IEEE Paper Format Performance Evaluation of Supervised Machine Learning Algorithms Using MCDM Techniques NEW (1).pdf

  14. [14] Kumar, A., & Kaur, K. (2024). A novel MCDM-based framework to recommend machine learning techniques for diabetes prediction. International Journal of Engineering and Technology Innovation, 14(1), 29–43. https://doi.org/10.46604/ijeti.2023.11837

  15. [15] Das, R., et al. (2022). Performance analysis of machine learning algorithms and screening formulae for β‑thalassemia trait screening of Indian antenatal women. International Journal of Medical Informatics, 167, 104866. https://doi.org/10.1016/j.ijmedinf.2022.104866

  16. [16] Elreedy, D., Atiya, A. F., & Kamalov, F. (2024). A theoretical distribution analysis of synthetic minority oversampling technique (SMOTE) for imbalanced learning. Machine Learning, 113(7), 4903–4923. https://doi.org/10.1007/s10994-023-06459-6

  17. [17] He, H., Bai, Y., Garcia, E. A., & Li, S. (2008). ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence) (pp. 1322–1328). IEEE. https://doi.org/10.1109/IJCNN.2008.4633969

  18. [18] Han, H., Wang, W.-Y., & Mao, B.-H. (2005). Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning. In International Conference on Intelligent Computing (pp. 878–887). Springer. https://doi.org/10.1007/11538059_91

  19. [19] Tomek, I. (1976). Two modifications of CNN. IEEE Transactions on Systems, Man, and Cybernetics, SMC-6(11), 769–772. https://doi.org/10.1109/TSMC.1976.4309452

  20. [20] Yen, S.-J., & Lee, Y.-S. (2006). Under-sampling approaches for improving prediction of the minority class in an imbalanced dataset. Lecture Notes in Control and Information Sciences, 344, 731–740. https://doi.org/10.1007/11760191_109

  21. [21] Wilson, D. L. (1972). Asymptotic properties of nearest neighbor rules using edited data. IEEE Transactions on Systems, Man, and Cybernetics, SMC-2(3), 408–421. https://doi.org/10.1109/TSMC.1972.4309137

  22. [22] Pamučar, D., Vasin, L., & Lukovac, L. (2014). Selection of railway level crossings for investing in security equipment using hybrid DEMATEL–MARICA model. In XVI International Scientific-Expert Conference on Railway (RAILCON) (pp. 89–92).

Published

2026-03-22

How to Cite

Ommi, A., & Foroozanfar, A. (2026). Combining Resampling Methods, Multi Criteria Decision-Making and Clustering Analysis for Diabetes Detection in Imbalanced Data. Annals of Healthcare Systems Engineering, 3(1), 13-31. https://doi.org/10.22105/ahse.v3i1.53

Similar Articles

11-20 of 29

You may also start an advanced similarity search for this article.