PRE-PROCESSING DATA ON MULTICLASS CLASSIFICATION OF ANEMIA AND IRON DEFICIENCY WITH THE XGBOOST METHOD

  • Fathu Nurrahman Department of Statistics, Faculty of Mathematics and Natural Sciences, IPB University, Indonesia
  • Hari Wijayanto Department of Statistics, Faculty of Mathematics and Natural Sciences, IPB University, Indonesia
  • Aji Hamim Wigena Department of Statistics, Faculty of Mathematics and Natural Sciences, IPB University, Indonesia
  • Nunung Nurjanah National Research and Innovation Agency (BRIN), Indonesia
Keywords: Anemia, MissForest, Boruta, SMOTE, XGBoost, Multiclass Classification

Abstract

Anemia and iron deficiency are health problems in Indonesia and globally. In Multiclass Classification, data problems often occur, such as missing data, too many variables, and unbalanced data. Then pre-processing data will be carried out using MissForest imputation, Boruta featuring selection, and SMOTE to help improve the performance of the classification model in predicting a particular class. After the data pre-processing process is carried out, classification modeling will be carried out using the XGBoost algorithm. It was found that when pre-processing the data could improve the performance of the model in predicting multiclass classification for cases of anemia and iron deficiency in women in Indonesia by 0.815 for the accuracy value and 0.9693 for the AUC value

Downloads

Download data is not yet available.

References

R. Sharda, S. Voß, and S. Suthaharan, Machine Learning Models and Algorithms for Big Data Classification, vol. 36. New York, 2016. doi: 10.1007/978-1-4899-7641-3.

C. Molnar, “Interpretable Machine Learning A Guide for Making Black Box Models Explainable,” 2021.

S. I. Khan and A. S. M. L. Hoque, “SICE: an improved missing data imputation technique,” J Big Data, vol. 7, no. 1, Dec. 2020, doi: 10.1186/s40537-020-00313-w.

D. J. Stekhoven and P. Bühlmann, “Missforest-Non-parametric missing value imputation for mixed-type data,” Bioinformatics, vol. 28, no. 1, pp. 112–118, Jan. 2012, doi: 10.1093/bioinformatics/btr597.

G. Heinze, C. Wallisch, and D. Dunkler, “Variable selection – A review and recommendations for the practicing statistician,” Biometrical Journal, vol. 60, no. 3. Wiley-VCH Verlag, pp. 431–449, May 01, 2018. doi: 10.1002/bimj.201700067.

M. B. Kursa and W. R. Rudnicki, “Feature Selection with the Boruta Package,” 2010. [Online]. Available: http://www.jstatsoft.org/

J. L. Speiser, M. E. Miller, J. Tooze, and E. Ip, “A comparison of random forest variable selection methods for classification prediction modeling,” Expert Systems with Applications, vol. 134. Elsevier Ltd, pp. 93–101, Nov. 15, 2019. doi: 10.1016/j.eswa.2019.05.028.

Y. Rimal, “BORUTA ALGORITHM IS SIGNIFICANT FOR LARGE FEATURE SELECTION OF STUDENT MARKS DATA OF POKHARA UNIVERSITY NEPAL,” vol. 1, no. 2, 2020, [Online]. Available: www.uijir.com

S. Fotouhi, S. Asadi, and M. W. Kattan, “A comprehensive data level analysis for cancer diagnosis on imbalanced data,” Journal of Biomedical Informatics, vol. 90. Academic Press Inc., Feb. 01, 2019. doi: 10.1016/j.jbi.2018.12.003.

N. v Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “SMOTE: Synthetic Minority Over-sampling Technique,” 2002.

B. Sartono and U. D. Syafitri, “METODE POHON GABUNGAN: SOLUSI PILIHAN UNTUK MENGATASI KELEMAHAN POHON REGRESI DAN KLASIFIKASI TUNGGAL,” Forum Statistika dan Komputasi, vol. 15, no. No 1, pp. 1–7, 2010.

Y. Huang, C. Chen, and Y. Miao, “Prediction Model of Bone Marrow Infiltration in Patients with Malignant Lymphoma Based on Logistic Regression and XGBoost Algorithm,” Comput Math Methods Med, vol. 2022, pp. 1–7, Jun. 2022, doi: 10.1155/2022/9620780.

Z. Hoodbhoy, M. Noman, A. Shafique, A. Nasim, D. Chowdhury, and B. Hasan, “Use of machine learning algorithms for prediction of fetal risk using cardiotocographic data,” Int J Appl Basic Med Res, vol. 9, no. 4, p. 226, 2019, doi: 10.4103/ijabmr.ijabmr_370_18.

N. R. van den Broek and E. A. Letsky, “Etiology of Anemia in Pregnancy in South Malawi,” 2000. [Online]. Available: https://academic.oup.com/ajcn/article/72/1/247S/4729620

WHO, “Haemoglobin concentrations for the diagnosis of anaemia and assessment of severity,” 2011.

M. B. Kursa, A. Jankowski, and W. R. Rudnicki, “Boruta - A system for feature selection,” Fundam Inform, vol. 101, no. 4, pp. 271–285, 2010, doi: 10.3233/FI-2010-288.

R. Azmatul Barro, I. D. Sulvianti, and M. Afendi, “PENERAPAN SYNTHETIC MINORITY OVERSAMPLING TECHNIQUE (SMOTE) TERHADAP DATA TIDAK SEIMBANG PADA PEMBUATAN MODEL KOMPOSISI JAMU,” 2013.

T. Chen and C. Guestrin, “XGBoost: A scalable tree boosting system,” in Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Aug. 2016, vol. 13-17-August-2016, pp. 785–794. doi: 10.1145/2939672.2939785.

M. Kuhn and K. Johnson, Applied Predictive Modeling. 2013.

M. Schonlau and R. Y. Zou, “The random forest algorithm for statistical learning,” Stata Journal, vol. 20, no. 1, pp. 3–29, Mar. 2020, doi: 10.1177/1536867X20909688.

Published
2023-06-11
How to Cite
[1]
F. Nurrahman, H. Wijayanto, A. Wigena, and N. Nurjanah, “PRE-PROCESSING DATA ON MULTICLASS CLASSIFICATION OF ANEMIA AND IRON DEFICIENCY WITH THE XGBOOST METHOD”, BAREKENG: J. Math. & App., vol. 17, no. 2, pp. 0767-0774, Jun. 2023.