PRE-PROCESSING DATA ON MULTICLASS CLASSIFICATION OF ANEMIA AND IRON DEFICIENCY WITH THE XGBOOST METHOD

ABSTRACT


INTRODUCTION
In today's rapid technological developments, data is important and valuable for all sectors in this world. By using data, information, and knowledge can be obtained. Big data is data with large observations, various forms of data and high speed, and many variables that are becoming a trend in the current era. This is what causes classical analysis to be less able to classify big data properly [1]. According to [2], machine learning is a collection of computational methods that are useful for making and improving predictions of objects or observations into a particular class or group.
This machine-learning technique is commonly used for classification analysis. Classification analysis is divided into two based on the number of classes, namely binary and multi-class classifications. The problems that often occur in carrying out classification analysis are missing data, a large number of independent variables, unbalanced data, and limited solidity algorithms or learning methods. Therefore, the development of methods in classification analysis continues to be carried out to overcome these problems. In dealing with missing data, an imputation technique will be carried out. This imputation technique handles missing values, which will be carried out in the pre-processing step of classification modeling [3].
The MissForest imputation method can work on mixed data simultaneously and has a non-parametric character, so it does not depend on certain data distribution assumptions. According to [4], MissForest outperforms other imputations for mixed data, such as MICE, MissPALasso, and KNN, because it is the only method with consistent and relatively lower imputation errors.
According to [5], including irrelevant variables in the model will cause the model to produce lower accuracy values. Boruta is a wrapping feature selection that can select important variables through the addition of shadow variables with a processing algorithm similar to random forest classification [6]. In several studies, boruta showed better results than those who did not use the boruta model in terms of high accuracy, faster computation, and easier interpretation [7], [8].
Data imbalance also often occurs in the case of health data. This can complicate the classification method when performing generalization functions, namely, how well the model performs on new data or has never been trained in the machine learning process [9]. The synthetic Minority Oversampling Technique (SMOTE) is a data-level approach to handling unbalanced data, which is a modification of the oversampling approach.
[10] conducted preliminary research on SMOTE and concluded that the SMOTE approach could improve the accuracy of classifiers for minority classes.
Supervised learning is a classification model based on a classification tree. The advantage of a classification tree is that it does not depend on certain assumptions, such as normally distributed data or no multicollinearity between predictors. According to [11], ensemble trees are used to overcome the instability of a single tree and the height variance of a single tree. There are many classification algorithms in the ensemble tree, one of which is XGBoost (Extreme Gradient Boosting). XGBoost works by applying the concept of the boosting method. This is done by building the model sequentially and combining all the models for prediction so that the new model learns from the mistakes of the previous model.
In a study conducted by [12], when predicting the spinal cord infiltration model in patients with malignant lymphoma, logistic regression and XGBoost were used. XGBoost is a better model with an AUC value of 0.844 compared to logistic regression. In making predictions using machine learning for fetal risk analysis using cardiotocography data, it was found that the XGBoost algorithm provides a high precision value of 96% compared to other algorithms [13].
Anemia is the most common nutritional problem in both developed and developing countries and remains a major human health challenge. Cases of anemia are much more common in women than in men. In 2019 global anemia cases will affect around a third (29.9%) of women of childbearing age, 36.5% of pregnant women, and 29.6% of non-pregnant women. According to the results of the 2018 Basic Health Research (RISKESDAS), the incidence of anemia in pregnant women in Indonesia itself is quite high. Data shows that the prevalence of anemia in pregnant women increased from 2013 (37.1%) to (48.9%) in 2018. Based on the parameters of hemoglobin and ferritin, anemia is classified into 3 groups, namely anemia, iron deficiency, anemia iron (ADI)/iron deficiency anemia (IDA). Cases of anemia in pregnant and non-pregnant women in developing countries are generally suspected to be due to iron deficiency, especially during pregnancy; the need for iron increases significantly [14].
Based on the problems above, the researchers plan to carry out an analysis of the classification of anemia and iron deficiency in women in Indonesia in the age range of 10-45 years using the XGBoost algorithm to see which level of accuracy is best and to perform some data handling such as MissForest for missing data, boruta in selecting influential variables and SMOTE in handling unbalanced data.

Anemia
According to [15], anemia is a condition in which the number of red blood cells (and consequently the oxygen-carrying capacity) is insufficient to meet the physiological needs of the body. These physiological needs vary according to a person's age, gender, geographic location of residence, and different stages of pregnancy for women. Determination of the diagnosis of anemia is done by laboratory examination with hemoglobin/Hb levels in the blood using the method of determining serum ferritin levels. Serum ferritin indicates iron stores in the body.
According to the Minister of Health Regulation Number 37 of 2012 concerning the Implementation of a Public Health Center Laboratory, a person is said to have anemia when the blood hemoglobin level shows a value of less than 12 g/dL. Based on the parameters of hemoglobin and ferritin, the classification of anemia in pregnant and non-pregnant women can be seen in Table 1. In general, anemia is caused by inadequate production or quality of red blood cells and blood loss, both acute and chronic. Symptoms that are very often found in general in people with anemia are 5L (Lethargic, Tired, Weak, Tired, Neglect).

MissForest
According to [4], MissForest imputation is a technique for dealing with missing data with an iterative imputation scheme and also utilizes a random forest algorithm that is built from observed data. MissForest can work on categorical and numerical data, data that has a non-linear relationship between variables, data that has interactions between variables, and high-dimensional data. In addition, MissForest does not need any prior data-related information.

Boruta
Boruta is an algorithm that is based on random forests but can also be used on other tree algorithms without having to specify parameter values and estimate the values of important features. This method is able to increase the value of accuracy, stability, and runtime and avoid overfitting. The algorithm used by Boruta consists of the following steps 1. Add data by making copies of all the initial variables by randomizing all variables. 2. Random variables are added to remove correlation with the response 3. Run random forest classification on the new data to get Z-scores 4. Determine the maximum Z-scores from the new data and separate each variable with a better score. Perform a two-tailed equivalence test for each variable with an undetermined level of importance. 5. Discard variables that are significantly lower than the Z-scores and delete them permanently 6. Remove all shadow variables and repeat the algorithm until all important scores are obtained for each variable. PRE-PROCESSING DATA ON MULTICLASS CLASSIFICATION OF ANEMIA… .

Synthetic Minority Oversampling Technique (SMOTE)
When the number of objects in a data class is more than that of other classes, there will be a data imbalance. The data class with more objects is called the major class, and the other classes are called the minor classes. This data imbalance will cause the model to tend to be misclassified into the major class and ignore the minor class, thereby affecting estimation and accuracy [10]. The solution to this problem is to change the class distribution to get a more balanced sample using SMOTE (Synthetic Minority Oversampling Technique). This method handles unbalanced data with the principle of adding the amount of minor class data to be equivalent to the major class by generating artificial data or synthesis using k-nearest neighbors. Generating artificial data that is numerically different from categorical. Numerical data is calculated based on its proximity to the Euclidean distance, while categorical data is simpler, namely the mode value [17].

Extreme Gradient Boosting (XGBoost)
[18] developed the ensemble technique method from gradient boosting to extreme gradient boosting. This method optimizes a weak set of methods into a more accurate model by increasing performance and speed so that it is 10 times faster than other gradient-boosting methods. The algorithms in XGBoost add prevention of overfitting and speed up the computation process. Overfitting prevention is done by adding a penalty component or optimizing the loss function value. In principle, this XGBoost builds a tree sequentially with the minimum output value in nodes. The equation is as follows.
where ( ,̂( −1) ) is a loss function to measure prediction error and ( ) is a regularization parameter that will make the model avoid overfitting.

Classification Model Evaluation
The confusion matrix is a performance measurement for classification resulting from a method that is expected to classify all data correctly, but it cannot be denied that the performance of a system cannot be 100% accurate and correct [19]. To see the confusion matrix can be seen in Table 2. The performance of a classification model is measured by calculating accuracy, sensitivity (recall), and specificity. The AUC (Area Under the Curve) curve is the goodness of the model in differentiating between classes. The AUC value is between 0 to 1. The higher the AUC, the better the model is in predicting anemia class.

Data
The data used is secondary data obtained from research by the Health Research and Development Agency at RISKESDAS 2013 and a national profile study of the nutritional status of iron (Fe) and Vitamin A (VA) in Indonesia in 2016, with a response variable of anemia status. The data is specifically for pregnant women and non-pregnant women aged 10-45 years. Related the response variables and indicators used in this study are presented in Table 3.

RESULTS AND DISCUSSION
From a total of 11,327 observations of the 21 available variables, it has been identified in Table 4 that there are missing data on the variables ISPA, Diarrhea, Pneumonia, Malaria, Hepatitis, Diabetes, Weight, Height, Nutritional Status. This missing data can affect the performance of the classification model. This missing data occurs randomly. Lost data itself does not affect the occurrence of lost data. So that we assume this missing data is MCAR; therefore, an imputation process is carried out on empty data using the MissForest Imputation method approach. At MissForest, each tree is built using samples obtained from the bootstrap process. Each bootstrap sample randomly leaves about one-third of the observations. These observations left for a given tree are referred to as Out of Bag (OOB) [20]. OOB observations are not included in the tree-building process. MissForest performance can be measured based on predicted OOB and assumed test data. Imputation performance on numeric data was measured by NRMSE, and categorical data by PFC. Table 5 shows that the average MissForest imputation value is 0.0756 (NRMSE), 0.0446 (PFC) with an average time of 10 iterations of 194.3 s. This states that the performance of MissForest imputation is good if it is close to 0. In a multiclass classification analysis, it is very important to see the distribution of how many target classes are obtained as well as data exploration for each variable to facilitate the classification modeling process as shown in Figure 1. Figure 1.

EDA for some data (a) anemia and iron deficiency categories, (b) regions, (c) WUS, (d) age and category boxplots
Based on the results of the descriptive statistical analysis in Figure 1, it can be seen that of the 11327 respondents observed in the study, 9.40% were in the IDA category, 14.95% were in the ID category, 11.93% were in the Anemia category, and 63.72% were in the Normal category. Then the results on area variables show that of the 11327 respondents observed in the study, 45.98% came from rural areas, and 54.02% came from urban areas. And based on the WUS variable, it is known that 96.87% are non-pregnant women, and 3.17% are pregnant women. Furthermore, the results of the boxplot graph for each category for age show that the average age of people affected by Anemia tends to be higher than those affected by ID and IDA. In contrast, the average age of normal people is almost the same as the average age of people affected by Anemia.
In classification modeling, including irrelevant variables in the model will cause the model to produce a lower accuracy value. Then a featuring selection is performed using BORUTA. Where the selected variables are variables that have the potential to affect anemia and iron deficiency so as to increase model accuracy and streamline modeling time. Of the 21 existing variables, when BORUTA was carried out, it was found that there were 11 potential variables, namely Status, Age, WUS, History of Pregnancy, Number of Pregnancy, Gestational Age, Weight, Height, Nutritional Status, TBI Status and CRP Status. These 11 variables will later be used in the classification modeling stage.
Before carrying out classification modeling, we know that for the target class or response variables, there is an imbalance in the data. This will trigger an error in the classification. Because the classification will tend to classify to the majority class. Then unbalanced data handling will be carried out using the SMOTE method, where balancing this data will change the amount of data that is almost balanced. We know that at the beginning, the data for each category were IDA (1065), Anemia (1351), ID (1693), and Normal (7218). After SMOTE was carried out, all categories had a more or less proportional number of objects, namely the IDA category of 6177, then the Anemia category of 6448, the ID category of 8964, and the normal category of 7544.
In the classification modeling stage using the XGBoost algorithm, we need to determine the best hyperparameter (hyperparameter tuning) for each model. The XGBoost model requires the max depth,  In Figure 2, the best hyperparameter value is obtained. This comparison is also based on the accuracy value of each. It can be seen that as the value of the nrounds hyperparameter increases, the model's performance improves. In addition, the higher the maxdepth value, the better the model is. The best hyperparamter values obtained with the highest accuracy are nrounds=200, max_depth=15, eta=0.05, gamma=0.01, colsample_bytree=0.75, min_child_weight=0, and subsample=0.5. So the XGBoost model with the hyperparameter will be used. After obtaining the model, then the model evaluation stage will be carried out. Where the evaluation of this model will look at the level of accuracy, AUC value, sensitivity, and specificity of data carried out by all data handlers and without data handling (Boruta and SMOTE). Then the comparison results are obtained in Table 6. From Table 6, it can be seen that when multiclass classification data is handled at the pre-processing stage, it will increase the model performance value. Judging from the Sensitivity and Specificity values in each category, they tend to be proportional or not too much different. This indicates that the use of the SMOTE method to address data imbalance cases is appropriate. In addition, the goodness of the model in prediction can also be seen from the high AUC value of 0.9693.

CONCLUSIONS
In the multiclass classification case, especially in anemia and iron deficiency classification in women in Indonesia in the age range of 10-45 years, many problems were found at the data preparation stage. Therefore, data handling was carried out to overcome missing data using MissForest, to handle the selection of variables that used Boruta a lot, and to balance data using SMOTE. After that, a classification analysis was carried out using the XGBoost algorithm. So that the specificity and sensitivity values for each category of Anemia, ID, IDA, and Normal are (0.8025, 0.8312, 0.8575, and 0.9514) and (0.9639, 0.9266, 0.9782, 0.9432) where these values tend to be proportional indicating that the use of SMOTE is quite appropriate. The accuracy and AUC values were 0.8615 and 0.9693, respectively, which indicated that the model performance was quite good in predicting cases of anemia and iron deficiency. This prediction can estimate a person's category in cases of anemia and iron deficiency in Indonesia. It also helps the government to evaluate performance and policies to make certain decisions. However, the scope of this study is limited to predicting the categories of anemia and iron deficiency. This can also be done with more nutritional needs in a person's body as well as other variables in future research. Despite their superior performance, MissForest, Boruta, and SMOTE's handling of imputed data suffer from deficiencies in computational efficiency. For further research, choosing another type of data handling can be used to reduce computation time. It doesn't reduce accuracy significantly, but it also has to be adjusted to the size and complexity of the dataset.