PRE-PROCESSING DATA ON MULTICLASS CLASSIFICATION OF ANEMIA AND IRON DEFICIENCY WITH THE XGBOOST METHOD
Abstract
Anemia and iron deficiency are health problems in Indonesia and globally. In Multiclass Classification, data problems often occur, such as missing data, too many variables, and unbalanced data. Then pre-processing data will be carried out using MissForest imputation, Boruta featuring selection, and SMOTE to help improve the performance of the classification model in predicting a particular class. After the data pre-processing process is carried out, classification modeling will be carried out using the XGBoost algorithm. It was found that when pre-processing the data could improve the performance of the model in predicting multiclass classification for cases of anemia and iron deficiency in women in Indonesia by 0.815 for the accuracy value and 0.9693 for the AUC value
Downloads
References
R. Sharda, S. Voß, and S. Suthaharan, Machine Learning Models and Algorithms for Big Data Classification, vol. 36. New York, 2016. doi: 10.1007/978-1-4899-7641-3.
C. Molnar, “Interpretable Machine Learning A Guide for Making Black Box Models Explainable,” 2021.
S. I. Khan and A. S. M. L. Hoque, “SICE: an improved missing data imputation technique,” J Big Data, vol. 7, no. 1, Dec. 2020, doi: 10.1186/s40537-020-00313-w.
D. J. Stekhoven and P. Bühlmann, “Missforest-Non-parametric missing value imputation for mixed-type data,” Bioinformatics, vol. 28, no. 1, pp. 112–118, Jan. 2012, doi: 10.1093/bioinformatics/btr597.
G. Heinze, C. Wallisch, and D. Dunkler, “Variable selection – A review and recommendations for the practicing statistician,” Biometrical Journal, vol. 60, no. 3. Wiley-VCH Verlag, pp. 431–449, May 01, 2018. doi: 10.1002/bimj.201700067.
M. B. Kursa and W. R. Rudnicki, “Feature Selection with the Boruta Package,” 2010. [Online]. Available: http://www.jstatsoft.org/
J. L. Speiser, M. E. Miller, J. Tooze, and E. Ip, “A comparison of random forest variable selection methods for classification prediction modeling,” Expert Systems with Applications, vol. 134. Elsevier Ltd, pp. 93–101, Nov. 15, 2019. doi: 10.1016/j.eswa.2019.05.028.
Y. Rimal, “BORUTA ALGORITHM IS SIGNIFICANT FOR LARGE FEATURE SELECTION OF STUDENT MARKS DATA OF POKHARA UNIVERSITY NEPAL,” vol. 1, no. 2, 2020, [Online]. Available: www.uijir.com
S. Fotouhi, S. Asadi, and M. W. Kattan, “A comprehensive data level analysis for cancer diagnosis on imbalanced data,” Journal of Biomedical Informatics, vol. 90. Academic Press Inc., Feb. 01, 2019. doi: 10.1016/j.jbi.2018.12.003.
N. v Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “SMOTE: Synthetic Minority Over-sampling Technique,” 2002.
B. Sartono and U. D. Syafitri, “METODE POHON GABUNGAN: SOLUSI PILIHAN UNTUK MENGATASI KELEMAHAN POHON REGRESI DAN KLASIFIKASI TUNGGAL,” Forum Statistika dan Komputasi, vol. 15, no. No 1, pp. 1–7, 2010.
Y. Huang, C. Chen, and Y. Miao, “Prediction Model of Bone Marrow Infiltration in Patients with Malignant Lymphoma Based on Logistic Regression and XGBoost Algorithm,” Comput Math Methods Med, vol. 2022, pp. 1–7, Jun. 2022, doi: 10.1155/2022/9620780.
Z. Hoodbhoy, M. Noman, A. Shafique, A. Nasim, D. Chowdhury, and B. Hasan, “Use of machine learning algorithms for prediction of fetal risk using cardiotocographic data,” Int J Appl Basic Med Res, vol. 9, no. 4, p. 226, 2019, doi: 10.4103/ijabmr.ijabmr_370_18.
N. R. van den Broek and E. A. Letsky, “Etiology of Anemia in Pregnancy in South Malawi,” 2000. [Online]. Available: https://academic.oup.com/ajcn/article/72/1/247S/4729620
WHO, “Haemoglobin concentrations for the diagnosis of anaemia and assessment of severity,” 2011.
M. B. Kursa, A. Jankowski, and W. R. Rudnicki, “Boruta - A system for feature selection,” Fundam Inform, vol. 101, no. 4, pp. 271–285, 2010, doi: 10.3233/FI-2010-288.
R. Azmatul Barro, I. D. Sulvianti, and M. Afendi, “PENERAPAN SYNTHETIC MINORITY OVERSAMPLING TECHNIQUE (SMOTE) TERHADAP DATA TIDAK SEIMBANG PADA PEMBUATAN MODEL KOMPOSISI JAMU,” 2013.
T. Chen and C. Guestrin, “XGBoost: A scalable tree boosting system,” in Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Aug. 2016, vol. 13-17-August-2016, pp. 785–794. doi: 10.1145/2939672.2939785.
M. Kuhn and K. Johnson, Applied Predictive Modeling. 2013.
M. Schonlau and R. Y. Zou, “The random forest algorithm for statistical learning,” Stata Journal, vol. 20, no. 1, pp. 3–29, Mar. 2020, doi: 10.1177/1536867X20909688.
Copyright (c) 2023 Fathu Nurrahman, Hari Wijayanto, Aji Hamim Wigena, Nunung Nurjanah
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Authors who publish with this Journal agree to the following terms:
- Author retain copyright and grant the journal right of first publication with the work simultaneously licensed under a creative commons attribution license that allow others to share the work within an acknowledgement of the work’s authorship and initial publication of this journal.
- Authors are able to enter into separate, additional contractual arrangement for the non-exclusive distribution of the journal’s published version of the work (e.g. acknowledgement of its initial publication in this journal).
- Authors are permitted and encouraged to post their work online (e.g. in institutional repositories or on their websites) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published works.