INTEGRATION OF SVM AND SMOTE-NC FOR CLASSIFICATION OF HEART FAILURE PATIENTS

Article History: SMOTE (Synthetic Minority Over-sampling Technique) and SMOTE-NC (SMOTE for Nominal and Continuous features) are variations of the original SMOTE algorithm designed to handle imbalanced datasets with continuous and nominal features. The primary difference lies in their ability to generate synthetic examples for the minority class when dealing with continuous and nominal features. We employed a dataset comprising continuous and nominal features from heart failure patients. The distribution of patients' statuses, either deceased or alive, exhibited an imbalance. To address this, we executed a data balancing procedure using SMOTE-NC before conducting the classification analysis with SVM. It was found that the combination of SVM and SMOTE-NC methods gave better results than the SVM method, seen from the higher level of accuracy and F1 score. F1 gives less sensitivity to class imbalance compared to accuracy. Suppose there is a significant imbalance in the number of instances between classes. In that case, the F1 score can be a more informative metric for evaluating a classifier's performance, especially when the minority class is of interest.


INTRODUCTION
Along with technology development, the amount of information available on the internet has been distributed globally and reasonably significantly.Nevertheless, if this information is not used, it will only become a useless data collection.This information allows people to find and interpret patterns that can help make decisions, one of which is data mining techniques.One of the data mining techniques used to predict a decision is classification [1].The classification has several algorithms based on fuzzy logic, Bayes classification, decision tree, support vector machine, artificial neural network, and k-nearest neighbor.Support Vector Machine (SVM) has better accuracy than the k-nearest neighbor, decision tree, and linear regression algorithms [2].The experimental findings indicate that the SVM classifier achieves the highest accuracy and demonstrates the most heightened sensitivity following the training and testing conducted using the proposed method.SVM is used to find the optimal hyperplane that separates the two classes to provide good generalization capabilities.We usually take most of the label data to find the optimal hyperplane.However, large-scale experimental data can create complexity the higher the computational process [3].
According to World Health Organization, an estimated 17.9 million people died from heart failure in 2019, representing 32% of all [4].Many factors cause a person to have heart failure, such as age, high blood pressure, and other factors.We used heart failure datasets from the Faisalabad Institute of Cardiology and the Allied Hospital in Faisalabad (Punjab, Pakistan) with 13 attributes such as age of patient, the diabetes status, and more.
Based on heart failure dataset, it was found that only a tiny proportion of patients died compared to those who survived.This condition indicates that the death event is imbalanced between the dead (minority) and survived (majority) classes.Imbalanced datasets are those in which the number of instances in one class is much greater than the number of examples in another [5].Insights from such datasets can influence decision results.Hence the practical ramifications of uneven training data are considerable.Unbalanced data distributions are widespread in real-world contexts, especially when target classes lack uniform distribution across various tiers [6].This problem is prevalent in healthcare applications, where a class imbalance is a significant obstacle [6].
The most popular method of dealing with data imbalances is resampling.One of the resampling methods that is quite popular is the Synthetic Minority Oversampling Technique (SMOTE).The SMOTE method works by replicating data from minority data based on the k-nearest neighbors to be equivalent to the majority class [7].Of those popular resampling methods, the author applies them to the case of this research data, which has an imbalanced class.However, because the data used are nominal and continuous, this study proposes a class imbalance handling method using the SMOTE-NC algorithm proposed by Gök and Olgun [8].
This study will apply the integration of the oversampling SMOTE-NC method with the SVM classification in predicting the death event of a patient's heart failure, whether the object is in the dead or survived class.Imbalanced datasets that will be analyzed by SVM can introduce bias during learning, as the model may focus more on the prevalent class.Balancing mitigates this bias, allowing the model to learn the underlying patterns in each class more effectively.In medical diagnosis applications, the varying costs associated with false positives and false negatives stemming from data imbalance underscore the need to integrate classification algorithms seamlessly with effective data balancing techniques.This integration is essential for achieving superior and more reliable results in the diagnostic process.

Support Vector Machine
Classification is a process of finding and determining a model or function that can explain and differentiate data classes to use the data to estimate the class of an object whose status is unknown.During the learning process in modeling, a learning algorithm is needed, including Support Vector Machine (SVM), Naïve Bayes, K-Nearest Neighbors (K-NN), Decision Tree, Artificial Neural Network (ANN), and others.SVM is one of the classification methods in data mining.SVM can also predict both classification and regression.SVM was first proposed by Vladimir Vapnik and is used in statistical learning theory and structural risk minimization [9].SVM are commonly employed in healthcare challenges like disease prediction [10].SVM has been enhanced in survival analysis to handle censored data, making it suitable for predicting survival and discharge-time likelihood in COVID-19 cases [10].Various machine learning algorithms, including SVM, are utilized in the healthcare business to diagnose disorders [11].SVM is utilized as a classifier in a diagnosis model for outpatient clinicians employing healthcare data analytics and neural networks [12].SVM is also utilized in a clinical decision support system to predict heart illness [13].The research addresses the computational difficulties of solving nonlinear SVM on large-scale data with uneven class sizes [14].
SVM has a linear principle, but now SVM has developed to work on non-linear problems.The way SVM works on non-linear issues is to incorporate the kernel concept into a high-dimensional space.In this dimensional space, a separator or what is often called a hyperplane will be searched.Hyperplane can maximize the distance or margin between data classes.The best hyperplane between the two categories can be found by measuring the margin and finding the maximum point.The effort to find the best hyperplane in the class separator is the core of the process in the SVM method.
Consider the n-dimensional training dataset  = {( 1 ,  1 ), ⋯ , (  ,   )} where m denotes the number of samples,   is a sample in the input space , and   ∈ {−1, 1} is the label in the output .The support vector machine's core principle is to discover an ideal hyper-plane that maximizes the area of both sides of the hyper-plane.The separation hyper-plane for standard linear se accuracy of classification a support vector classification (SVC) can be defined as [15]: where  is a vector weight,  = { 1 ,  2 , … ,   },  is the number of attributes, and  is a scalar, often called bias.If two classes are not completely separable, a hyperplane can still be determined to maximize the margin while minimizing a quantity proportional to the misclassification errors with add positive slack variables , so that the optimization problem can be described as follows: where  is a parameter chosen by user that controls the tradeoff between margin and the misclassification errors.
The challenge of maximizing the interval is turned into a convex quadratic programming problem using the dual approach in convex optimization.SVMs use nonlinear kernels to translate low-dimensional feature space to high-dimensional space to deal with nonlinear challenges, the kernel function is defined as: (  ,   ) = 〈(  ).(  )〉, and (  ) is a mapping [16].Table 1.shown some typical kernels function [16].The kernel trick enables SVM to perform computations in the original feature space without explicitly mapping data into the higher-dimensional space.

Synthetic Minority Oversampling Technique-Nominal Continuous (SMOTE-NC)
For nominal and continuous mixed datasets, the method proposed by Chawla is called Synthetic Minority Oversampling Technique Nominal Continuous (SMOTE-NC).The SMOTE-NC algorithm is described below [8].
1. Calculate Median: Calculates the median of the standard deviations of all consecutive characteristics of the minority class.If the nominal attributes differ between the sample and its potential nearest neighbors, this median is included in the Euclidean distance calculation.Use the median to penalize nominal feature differences in quantities related to general differences in continuous feature values.
2. Nearest Neighbor Calculation: Uses continuous feature space to calculate the Euclidean distance between a feature vector (minority class sample) that identifies the nearest neighbor and another feature vector (minority class sample).Each nominal feature deviates between the feature vector considered and its nearest potential, including the median of the previously calculated standard deviation in the Euclidean distance calculation.
3. Populate the artificial pattern: The non-stop capabilities of the brand-new artificial minority class pattern are created using the equal SMOTE method, as defined earlier.The nominal characteristic is given the price occurring inside the majority of the k-nearest neighbors.

Materials
The data used in the study was data of the medical records of 299 heart failure patients collected at the Faisalabad Institute of Cardiology and the Allied Hospital in Faisalabad (Punjab, Pakistan), during 13 attributes [17].This study aimed to classify death events: the survived patients (death event = 0) and the dead patients (death event = 1).Before the classification, handling imbalanced data between classes on the death event variable using SMOTE-NC is carried out.

Descriptive Statistics
In this study, we used 13 attributes with seven continuous attributes and six nominal type attributes.The aim of this study is to classify death events based on the attributes that influence them.The following bar plot shows the death event category, namely survived patient, and dead patient from a total of 299 data.and 96 patients died.The next stage is the division of the dataset into training and testing data with a proportion of 75%:25%.In cases where there is an imbalance class, the accuracy of the classification results in the survived class (majority) will be high.In contrast, the classification accuracy in the dead class (minority) will tend to below.The classifier will consider minor classes, in this case is dead status, as noise or outliers when forming a classification function.In addition, the dataset used consists of a mixture of nominal and continuous attributes.Therefore, it is necessary to handle imbalanced data using SMOTE-NC.The following are the results after oversampling.

Figure 2. Training Data Balancing Results
Based on Figure 2., after balancing the data using SMOTE-NC, the distribution of the dead (minority) class is balanced with the survived (majority) class.SMOTE-NC extends the concept of SMOTE to handle nominal features by generating synthetic samples not only in the continuous feature space but also in the nominal feature space.
For nominal features, SMOTE-NC randomly selects a feature value from the nearest neighbors to create synthetic instances, preserving the distribution of the original nominal feature values.By generating synthetic samples for the minority class, SMOTE-NC helps balance the class distribution in the dataset.

SVM Analysis
After splitting the dataset, the next step is classification using SVM.Training data which is 224 data is used to build the SVM model, while testing data of 75 for model evaluation.This study will compare the classification results to the original data and balanced data from SMOTE-NC results.

a) SVM for Original Data
In SVM modeling, the  or cost and gamma values are first determined.Next, in this study, we try 35 combinations for each parameter.The best  and gamma values have the lowest error rates.Figure 3. below is the comparison of 35 combination of  and gamma values., a hyperparameter in SVM, is a regularization parameter that influences the trade-off between achieving a smooth decision boundary and correctly classifying training points.A low  emphasizes a more generalized decision boundary, allowing for some misclassification but preventing overfitting.Conversely, a large  imposes a stricter requirement for correct classification, potentially leading to a decision boundary that closely fits the training data but may need to generalize better to new data.
On the other hand, gamma comes into play when employing the Gaussian Radial Basis Function (RBF) kernel in SVM.Setting gamma is a crucial step before training the model.This parameter determines the extent of curvature in the decision boundary.A more minor gamma results in a broader, more gradual decision boundary, while a larger gamma leads to a more intricate and tightly fitted decision boundary that can capture intricate patterns in the training data.
In essence, selecting appropriate values for  and gamma involves finding a balance that aligns with the complexity of the dataset.Table 3. shows that the best parameters when tuning parameters are obtained: cost = 10, gamma = 0.01, and RBF kernel function.The RBF kernel allows non-linear decision boundaries to be represented in the original feature space by implicitly mapping the input data into a higher-dimensional space.This capacity is essential when working with intricate patterns that are difficult to represent fully in a lower-dimensional area.Furthermore, the model will be built on the training data using these parameters and make predictions on the testing data.  5. shows that heart failure patients who fall into the dead category and are correctly predicted to be dead are 43.Likewise for the survived category, which correctly predicted as many as 17.The performance of the quality classification model has an accuracy of 80.00%, which means that the accuracy of the model in classifying dead and survived category is 80.00%.The sensitivity is 68.00%, which means that the model can correctly predict dead (minority class) of 68.00%.The specificity of 86.00% means that the model can correctly predict survived (majority class) of 86.00%.The precision is 70.83%, which means that the accuracy of the model in classifying dead is 70.83%.The F1 score is 69.39%, which means the classification model has a precision and a sensitivity of 69.39%.In addition to the F1 score, the AUC value can also be used to see the model's performance.In this study, an AUC value of 0.77 was obtained, meaning that the model can classify, and the assessment of dead and survived category can be said to be quite good.

b) SVM and SMOTE-NC
In principle, SMOTE is an oversampling algorithm that works to duplicate by creating synthetic data in the minor class based on the k-nearest neighbors.Since the dataset used consists of nominal and continuous data, this synthesis data is obtained by calculating the closest distance to the observations to be duplicated.The number of duplications carried out is determined so that the ratio between major and minor classes plus synthetic minor classes will approach balance.Furthermore, classification using SVM with SMOTE-NC uses the combination of  and gamma values same as previous method.

Figure 1 .Figure 1 .
Figure 1.Total Death Events by Patient Status Figure 1.explains that from a total of 299 data on heart failure patients, 203 patients were survived

Table 4 .
and Table