PREDICTING DIABETES MELLITUS USING CATBOOST CLASSIFIER AND SHAPLEY ADDITIVE EXPLANATION (SHAP) APPROACH
Abstract
Diabetes mellitus as a metabolic disease characterized by hyperglycemia can be dangerous if it cannot be handled properly. Early detection of existing symptoms can reduce the impact of delays in treatment. This study aims to carry out early-detection patients with diabetes mellitus using a machine learning approach through data from MIT’s GOSSIS (Global Open Source Severity of Illness Score). By using Shapley Additive Explanation (SHAP) which enables prioritization of feature that determine compound classification, this study shows that the CatBoost classifier has 14 features that significantly can be used for classification with feature ‘d1_glucose_max’ or the highest glucose concentration of the patient in their serum or plasma during the first 24 hours of their unit stay has the highest impact to classify diabetes mellitus patients, then followed by age and glucose APACHE. The selected features are then classified and get the validation AUC score of 86.86%.
Downloads
References
World Health Organization, "Diabetes," [Online]. Available: https://www.who.int/health-topics/diabetes .
J. Chaki, S. T. Ganesh, S. Cidham and S. A. Theertan, "Machine learning and artificial intelligence based Diabetes Mellitus detection and self-management: A systematic review," Journal of King Saud University - Computer and Information Sciences, pp. 1-22, 2020.
L. Kopitar, P. Kocbek, L. Cilar, A. Sheikh and Stiglic, "Early Detection of Type 2 Diabetes Mellitus Using Machine Learning-Based Prediction Models," Scientific Reports, vol. 10/11981, 2020.
Q. Zou, K. Qu, Y. Luo, D. Yin, Y. Ju and H. Tang, "Predicting Diabetes Mellitus With Machine Learning Techniques," Frontiers in Genetics, vol. 9, 2018.
Y. S. j. V. K. B. Y. S. P. Srinivasa R., "Prediction of Diabetes using Machine Learning," International Journal of Advanced Science and Technology, vol. 29, pp. 7593-9601, 2020.
H. Lai , H. Huang, K. Keshavjee, A. Guergachi and X. Gao, "Predictive Models for Diabetes Mellitus Using Machine Learning Techniques," BMC Endocr Disord, vol. 19, pp. 1-9, 2019.
R. D. Joshi and C. K. Dhakal, "Predicting Type 2 Diabetes Using Logistic Regression and Machine Learning Approaches," International Journal of Enviromental Research and Public Health, vol. 18, pp. 1-17, 2021.
P. Rajendra and S. Latifi, "Prediction of diabetes using logisctic regression and ensemble techniques," Computer Methods and Programs in Biomedicine Update, vol. 1, pp. 1-8, 2021.
P. S. Kumar, A. Kumari K, S. Mohapatra, B. Naik, J. Nayak and M. Mishra, "CatBoost Ensemble Approach for Diabetes Risk Prediction at Early Stages," in 1st Odisha International Conference on Electrical Power Engineering, Communication and Computing Technology(ODICON), Bhubaneswar, India, 2021.
R. Rodriguez-Perez and J. Bajorath, "Interpretation of Machine Learning Models Using Shapley Values: Application to Compound Potency and Multi-Target Activity Predictions," Journal of Computer-Aided Molecular Design, vol. 34, no. 10, pp. 1013-1026, 2020.
Q. A. Hathway, S. M. Roth, M. V. Pinti, D. C. Sprando, A. Kunovac, A. J. Durr, C. C. Cook, G. K. Fink, T. B. Cheuvront, J. H. Grossman, G. A. Aljahli, A. D. Taylor, A. P. Giromini, J. L. Allen and Hollander John M., "Machine-Learning to Stratify Diabetic Patients Using Novel Cardiac Biomarkers and Integrative Genomics," Cardiovasc Diabetol, vol. 18, no. 78, 2019.
W. McGinnis, "Target Encoder," 2016. [Online]. Available: https://contrib.scikit-learn.org/category_encoders/targetencoder.html.
J. T. Hancock and T. M. Khosghoftaar, "CatBoost for Big Data: An Interdisciplinary Review," Journal of Big Data, vol. 7, no. 94, pp. 1-45, 2020.
Yandex, "CatBoost," 2021. [Online]. Available: http://yandex.com/dev/catboost.
K. Hajian-Tilaki, "Receiver Operating Characteristic (ROC) Curve Analysis for Medical Diagnostic Test Evaluation," Caspian Journal of Internal Medicine, vol. 4, no. 2, pp. 627-635, 2013.
T. Akiba, S. Sano, T. Yanase and T. Ohta, "Optuna: A Next-generation Hyperparameter Optimization Framework," in KDD '19: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2019.
Authors who publish with this Journal agree to the following terms:
- Author retain copyright and grant the journal right of first publication with the work simultaneously licensed under a creative commons attribution license that allow others to share the work within an acknowledgement of the work’s authorship and initial publication of this journal.
- Authors are able to enter into separate, additional contractual arrangement for the non-exclusive distribution of the journal’s published version of the work (e.g. acknowledgement of its initial publication in this journal).
- Authors are permitted and encouraged to post their work online (e.g. in institutional repositories or on their websites) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published works.