PERFORMANCE ANALYSIS OF MODIFIED-ODBOT AND SMOTE FOR TREE-BASED CLASSIFICATION OF IMBALANCED HUMAN DEVELOPMENT INDEX DATA

Yunna Mentari Indah; Anwar Fitrianto; Indahwati Indahwati

doi:10.30598/barekengvol20iss3pp2311-2326

Yunna Mentari Indah Department of Statistics and Data Science, IPB University, Indonesia https://orcid.org/0009-0003-7539-6396
Anwar Fitrianto Department of Statistics and Data Science, IPB University, Indonesia https://orcid.org/0000-0001-7050-3082
Indahwati Indahwati Department of Statistics and Data Science, IPB University, Indonesia https://orcid.org/0000-0002-0800-591X

DOI: https://doi.org/10.30598/barekengvol20iss3pp2311-2326

Keywords: Class imbalance, Euclidean distance, Mahalanobis distance, Multiclass data, Oversampling technique

Abstract

Classification of Human Development Index (HDI) data presents significant challenges due to severe class imbalance, where low-development regions are substantially underrepresented. This imbalance reduces classification performance because machine learning models tend to be biased toward the majority classes, making it challenging to accurately identify minority classes. This study proposes a modified ODBOT that replaces Euclidean distance with Mahalanobis distance within the oversampling mechanism (Mahalanobis-based ODBOT) and compares its performance with Euclidean-based ODBOT with and without Principal Component Analysis (PCA), as well as the conventional SMOTE technique. Four tree-based classifications were used, namely Random Forest, Double Random Forest, XGBoost, and LightGBM. The Human Development Index (HDI) data set from the Central Statistics Agency, consisting of 514 observations and four features, with an imbalance ratio (IR) of 19.0, was divided into training and testing sets (ratio 80:20) with 30 repetitions and evaluated using F1-Measure (F1-M), Geometric Mean (G-M), Area Under the Curve (AUC), and computation time. The results show that Mahalanobis-based ODBOT achieved the highest performance on the AUC evaluation metric across all classification models and the highest on the G-M evaluation metric in three of the four classification models, but required significantly longer computation time (2545.66 seconds). In contrast, the Euclidean-based ODBOT with PCA improved F1-M while reducing computation time (7.21 seconds) compared to the original ODBOT (68.23 seconds), while SMOTE consistently improved G-M and AUC across all experiments. These findings suggest that oversampling techniques should be selected based on practical application needs. Specifically, the Mahalanobis-based ODBOT can be recommended when improving prediction performance is a priority, while the Euclidean-based ODBOT with PCA or SMOTE is preferable for real-world implementations that require faster execution and lower computational cost.

Downloads

Download data is not yet available.

References

B. Liu and G. Tsoumakas, “DEALING WITH CLASS IMBALANCE IN CLASSIFIER CHAINS VIA RANDOM UNDERSAMPLING,” Knowledge-Based Syst., vol. 192, p. 105292, 2020. doi: https://doi.org/10.1016/j.knosys.2019.105292.

C. O. Vázquez, S. vanden Broucke, and J. De Weerdt, HELLINGER DISTANCE DECISION TREES FOR PU LEARNING IN IMBALANCED DATA SETS, vol. 113, no. 7. Springer US, 2024. doi: https://doi.org/10.1007/s10994-023-06323-y.

UNDP, Human Development Report 2025: A MATTER OF CHOICE: PEOPLE AND POSSIBILITIES IN THE AGE OF AI. New York: United Nations Development Programme, 2025. [Online]. Available: https://hdr.undp.org/content/human-development-report-2025. doi: https://doi.org/10.2139/ssrn.5353261

Y. M. Indah, R. Aristawidya, A. Fitrianto, E. Erfiani, and L. M. R. D. Jumansyah, “COMPARISON OF RANDOM FOREST, XGBOOST, AND LIGHTGBM METHODS FOR THE HUMAN DEVELOPMENT INDEX CLASSIFICATION,” Jambura J. Math., vol. 7, no. 1, pp. 14–18, 2025. doi: https://doi.org/10.37905/jjom.v7i1.28290.

Badan Pusat Statistik, “INDEKS PEMBANGUNAN MANUSIA 2023,” vol. 18, pp. 1–282, 2023.

J. H.-Osorio, A. A.-Meza, G. D.-Santacoloma, A. O.-Gutierrez, and G. C.-Dominguez, “RELEVANT INFORMATION UNDERSAMPLING TO SUPPORT IMBALANCED DATA CLASSIFICATION,” Neurocomputing, vol. 436, pp. 136–146, May 2021. doi: https://doi.org/10.1016/j.neucom.2021.01.033.

P. Kumar, R. Bhatnagar, K. Gaur, and A. Bhatnagar, “CLASSIFICATION OF IMBALANCED DATA: REVIEW OF METHODS AND APPLICATIONS,” IOP Conf. Ser. Mater. Sci. Eng., vol. 1099, no. 1, p. 012077, 2021. doi: https://doi.org/10.1088/1757-899X/1099/1/012077.

N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “SMOTE: SYNTHETIC MINORITY OVER-SAMPLING TECHNIQUE,” J. Artif. Intell. Res., vol. 16, pp. 321–357, 2002. doi: https://doi.org/10.1613/jair.953.

V. P. K. Turlapati and M. R. Prusty, “OUTLIER-SMOTE: A REFINED OVERSAMPLING TECHNIQUE FOR IMPROVED DETECTION OF COVID-19,” Intell. Med., vol. 3–4, no. November, p. 100023, 2020. doi: https://doi.org/10.1016/j.ibmed.2020.100023.

P. Gnip, L. Vokorokos, and P. Drotár, “SELECTIVE OVERSAMPLING APPROACH FOR STRONGLY IMBALANCED DATA,” PeerJ Comput. Sci., vol. 7, pp. 1–22, 2021. doi: https://doi.org/10.7717/peerj-cs.604.

S. Lusito, A. Pugnana, and R. Guidotti, SOLVING IMBALANCED LEARNING WITH OUTLIER DETECTION AND FEATURES REDUCTION, vol. 113, no. 8. Springer US, 2024. doi: https://doi.org/10.1007/s10994-023-06448-0.

B. Mirzaei, B. Nikpour, and H. Nezamabadi-Pour, “AN UNDER-SAMPLING TECHNIQUE FOR IMBALANCED DATA CLASSIFICATION BASED ON DBSCAN ALGORITHM,” in 2020 8th Iranian Joint Congress on Fuzzy and intelligent Systems (CFIS), IEEE, Sep. 2020, pp. 21–26. doi: https://doi.org/10.1109/CFIS49607.2020.9238718.

H. Zhou, J. Tong, Y. Liu, K. Zheng, and C. Cao, “AN OVERSAMPLING FCM-KSMOTE ALGORITHM FOR IMBALANCED DATA CLASSIFICATION,” J. King Saud Univ. - Comput. Inf. Sci., vol. 36, no. 10, p. 102248, Dec. 2024. doi: https://doi.org/10.1016/j.jksuci.2024.102248.

M. H. Ibrahim, “ODBOT: OUTLIER DETECTION-BASED OVERSAMPLING TECHNIQUE FOR IMBALANCED DATASETS LEARNING,” Neural Comput. Appl., vol. 33, no. 22, pp. 15781–15806, 2021. doi: https://doi.org/10.1007/s00521-021-06198-x.

P. Kumari and S. Gupta, “COMPARATIVE ANALYSIS BETWEEN EUCLIDEAN DISTANCE METRIC AND MAHALANOBIS DISTANCE METRIC,” Int. J. Innov. Res. Technol. Sci. www.ijirts.org, vol. 12, no. 2, 2024, [Online]. Available: www.ijirts.org

F. Wang, M. Zheng, K. Ma, and X. Hu, “RESAMPLING APPROACH FOR IMBALANCED DATA CLASSIFICATION BASED ON CLASS INSTANCE DENSITY PER FEATURE VALUE INTERVALS,” Inf. Sci. (Ny)., vol. 692, p. 121570, Feb. 2025. doi: https://doi.org/10.1016/j.ins.2024.121570.

D. Elreedy, A. F. Atiya, and F. Kamalov, “A THEORETICAL DISTRIBUTION ANALYSIS OF SYNTHETIC MINORITY OVERSAMPLING TECHNIQUE (SMOTE) FOR IMBALANCED LEARNING,” Mach. Learn., vol. 113, no. 7, pp. 4903–4923, 2024. doi: https://doi.org/10.1007/s10994-022-06296-4.

K. S. Raslan, A. S. Alsharkawy, and K. R. Raslan, “IHHO-SMOTE: A CLEANSED APPROACH FOR HANDLING OUTLIERS AND REDUCING NOISE TO IMPROVE IMBALANCED DATA CLASSIFICATION,” Int. J. Comput. Appl., vol. 186, no. 32, pp. 975–8887, 2024. doi: https://doi.org/10.5120/ijca2024923849.

M. H. Ibrahim, “WBBA-KM: A HYBRID WEIGHT-BASED BAT ALGORITHM WITH K-MEANS ALGORITHM FOR CLUSTER ANALYSIS,” Politek. Derg., vol. 25, no. 1, pp. 65–73, 2022. doi: https://doi.org/10.2339/politeknik.689384.

T. R. Etherington, “MAHALANOBIS DISTANCES FOR ECOLOGICAL NICHE MODELLING AND OUTLIER DETECTION: IMPLICATIONS OF SAMPLE SIZE, ERROR, AND BIAS FOR SELECTING AND PARAMETERISING A MULTIVARIATE LOCATION AND SCATTER METHOD,” PeerJ, vol. 9, 2021. doi: https://doi.org/10.7717/peerj.11436.

K. Dashdondov and M.-H. Kim, “MAHALANOBIS DISTANCE BASED MULTIVARIATE OUTLIER DETECTION TO IMPROVE PERFORMANCE OF HYPERTENSION PREDICTION,” Neural Process. Lett., vol. 55, no. 1, pp. 265–277, Feb. 2023. doi: https://doi.org/10.1007/s11063-021-10663-y.

M. A. Ganaie, M. Tanveer, P. N. Suganthan, and V. Snasel, “OBLIQUE AND ROTATION DOUBLE RANDOM FOREST,” Neural Networks, vol. 153, pp. 496–517, Sep. 2022. doi: https://doi.org/10.1016/j.neunet.2022.06.012.

S. Han, H. Kim, and Y. S. Lee, “DOUBLE RANDOM FOREST,” Mach. Learn., vol. 109, no. 8, pp. 1569–1586, 2020. doi: https://doi.org/10.1007/s10994-020-05889-1.

O. Sagi and L. Rokach, “APPROXIMATING XGBOOST WITH AN INTERPRETABLE DECISION TREE,” Inf. Sci. (Ny)., vol. 572, pp. 522–542, 2021. doi: https://doi.org/10.1016/j.ins.2021.05.055.

M. J. Sai, P. Chettri, R. Panigrahi, A. Garg, A. K. Bhoi, and P. Barsocchi, “AN ENSEMBLE OF LIGHT GRADIENT BOOSTING MACHINE AND ADAPTIVE BOOSTING FOR PREDICTION OF TYPE-2 DIABETES,” Int. J. Comput. Intell. Syst., vol. 16, no. 1, 2023. doi: https://doi.org/10.1007/s44196-023-00184-y.

I. Markoulidakis, I. Rallis, I. Georgoulas, G. Kopsiaftis, A. Doulamis, and N. Doulamis, “MULTICLASS CONFUSION MATRIX REDUCTION METHOD AND ITS APPLICATION ON NET PROMOTER SCORE CLASSIFICATION PROBLEM,” 2021. doi: https://doi.org/10.3390/technologies9040081.

A. Bolívar, V. García, R. Alejo, R. F.-Juárez, and J. S. Sánchez, “DATA-CENTRIC SOLUTIONS FOR ADDRESSING BIG DATA VERACITY WITH CLASS IMBALANCE, HIGH DIMENSIONALITY, AND CLASS OVERLAPPING,” Appl. Sci., vol. 14, no. 13, 2024. doi: https://doi.org/10.3390/app14135845.

S. S. Rawat and A. K. Mishra, “REVIEW OF METHODS FOR HANDLING CLASS IMBALANCE IN CLASSIFICATION PROBLEMS,” 2024, pp. 3–14. doi: https://doi.org/10.1007/978-981-97-0037-0_1.

J. Tanha, Y. Abdi, N. Samadi, N. Razzaghi, and M. Asadpour, “BOOSTING METHODS FOR MULTI-CLASS IMBALANCED DATA CLASSIFICATION: AN EXPERIMENTAL REVIEW,” J. Big Data, vol. 7, no. 1, 2020. doi: https://doi.org/10.1186/s40537-020-00349-y.

N. Shrestha, “DETECTING MULTICOLLINEARITY IN REGRESSION ANALYSIS,” Am. J. Appl. Math. Stat., vol. 8, no. 2, pp. 39–42, Jun. 2020. doi: https://doi.org/10.12691/ajams-8-2-1.

L. Yao and T. Lin, “EVOLUTIONARY MAHALANOBIS DISTANCE-BASED OVERSAMPLING FOR MULTI-CLASS IMBALANCED DATA CLASSIFICATION,” Sensors, vol. 21, no. 19, p. 6616, Oct. 2021. doi: https://doi.org/10.3390/s21196616.

S. Lusito, A. Pugnana, and R. Guidotti, “SOLVING IMBALANCED LEARNING WITH OUTLIER DETECTION AND FEATURES REDUCTION,” Machine Learning, vol. 113, no. 8, pp. 5273–5330, 2024. doi: https://doi.org/10.1007/s10994-023-06448-0.

I. Naglik and M. Lango, “GMMSAMPLING: A NEW MODEL BASED, DATA DIFFICULTY DRIVEN RESAMPLING METHOD FOR MULTI CLASS IMBALANCED DATA,” Machine Learning, vol. 113, no. 8, pp. 5183–5202, 2024. doi: https://doi.org/10.1007/s10994-023-06416-8.

Y. Han and I. Joe, “ENHANCING MACHINE LEARNING MODELS THROUGH PCA, SMOTE-ENN, AND STOCHASTIC WEIGHTED AVERAGING,” Appl. Sci., vol. 14, no. 21, p. 9772, Oct. 2024. doi: https://doi.org/10.3390/app14219772.

G. A. Mulla, Y. Demir, and M. Hassan, “COMBINATION OF PCA WITH SMOTE OVERSAMPLING FOR CLASSIFICATION OF HIGH-DIMENSIONAL IMBALANCED DATA,” Bitlis Eren Univ. Fen Bilimleri Dergisi, vol. 10, no. 3, pp. 858–869, 2021. doi: https://doi.org/10.17798/bitlisfen.939733.

S. J. Yang and K. J. Cha, “GMOTE: GAUSSIAN BASED MINORITY OVERSAMPLING TECHNIQUE FOR IMBALANCED CLASSIFICATION ADAPTING TAIL PROBABILITY OF OUTLIERS,” arXiv preprint arXiv:2105.03855, 2021.

Y. Zhang, T. Zuo, L. Fang, J. Li, and Z. Xing, “AN IMPROVED MAHAKIL OVERSAMPLING METHOD FOR IMBALANCED DATASET CLASSIFICATION,” IEEE Access, vol. 9, pp. 16030–16040, 2020. doi: https://doi.org/10.1109/ACCESS.2020.3047741.

I. D. Mienye and Y. Sun, “PERFORMANCE ANALYSIS OF COST SENSITIVE LEARNING METHODS WITH APPLICATION TO IMBALANCED MEDICAL DATA,” Informatics in Medicine Unlocked, vol. 25, p. 100690, 2021. doi: https://doi.org/10.1016/j.imu.2021.100690.

B. Zhu, X. Jing, L. Qiu, and R. Li, “AN IMBALANCED DATA CLASSIFICATION METHOD BASED ON HYBRID RESAMPLING AND FINE COST SENSITIVE SUPPORT VECTOR MACHINE,” Computers, Materials & Continua, vol. 79, no. 3, 2024. doi: https://doi.org/10.32604/cmc.2024.048062.

T. Aswani, J. M. Gummadi, and G. Sharada, “A RANDOM FOREST-BASED MACHINE LEARNING FRAMEWORK WITH PCA, SMOTE, AND SHAP FOR EFFICIENT AND INTERPRETABLE CORONARY ARTERY DISEASE PREDICTION,” Informatica, vol. 49, no. 22, 2025. doi : https://doi.org/10.31449/inf.v49i22.7998.

N. G. Siddappa and T. Kampalappa, “IMBALANCE DATA CLASSIFICATION USING LOCAL MAHALANOBIS DISTANCE LEARNING BASED ON NEAREST NEIGHBOR,” SN Computer Science, vol. 1, no. 2, p. 76, 2020. doi: https://doi.org/10.1007/s42979-020-0085-x.

M. Fachrie, A. Musdholifah, and R. Pulungan, “EFFECTIVENESS OF DATA RESAMPLING AND ENSEMBLE LEARNING IN MULTICLASS IMBALANCE LEARNING,” Artificial Intelligence Review, vol. 58, no. 12, p. 368, 2025. doi : https://doi.org/10.1007/s10462-025-11357-w.

PERFORMANCE ANALYSIS OF MODIFIED-ODBOT AND SMOTE FOR TREE-BASED CLASSIFICATION OF IMBALANCED HUMAN DEVELOPMENT INDEX DATA

Abstract

Downloads

References

Editorial Office

Contact Info