COMPARATIVE STUDY OF LIGHTGBM, CATBOOST, AND RANDOM FOREST IN MODELING PUBLIC COMPLAINTS CLASSIFICATION

Oktaviyani Daswati; Hari Wijayanto; Farit Mochamad Afendi

doi:10.30598/barekengvol20iss3pp2535-2548

Oktaviyani Daswati Department of Statistics and Data Science, School of Data Science, Mathematics, and Informatics, IPB University, Indonesia https://orcid.org/0009-0002-9780-1279
Hari Wijayanto Department of Statistics and Data Science, School of Data Science, Mathematics, and Informatics, IPB University, Indonesia https://orcid.org/0000-0002-7507-2602
Farit Mochamad Afendi Department of Statistics and Data Science, School of Data Science, Mathematics, and Informatics, IPB University, Indonesia https://orcid.org/0009-0006-9172-9455

DOI: https://doi.org/10.30598/barekengvol20iss3pp2535-2548

Keywords: CatBoost, Classification, LightGBM Public complaints, Random Forest

Abstract

Public complaints data on maladministration in Indonesia is a dataset with high-cardinality categorical variables and imbalanced category distributions, posing significant challenges for conventional machine learning algorithms. To address this issue, this study aims to evaluate and compare the performance of three widely used classification algorithms (LightGBM, CatBoost, and Random Forest) on actual public complaint data that has never been analysed using machine learning methods. Hyperparameter tuning was applied to obtain optimal configurations and ensure robust performance. Analysis was conducted using 30 repeated simulations with accuracy and sensitivity as the primary metrics. ANOVA followed by Tukey HSD was used to explicitly determine whether there were differences in performance between models at a 95% confidence level. The results show that LightGBM performed best with an accuracy of 74.50% and a sensitivity of 76.70%, followed by CatBoost with an accuracy of 74.12% and a sensitivity of 75.54%, while Random Forest lagged far behind. Statistical tests confirmed significant performance differences between the three models. This study is not without limitations. Only three classification algorithms were evaluated, encoding strategies were not systematically compared, and the hyperparameter search space was restricted, meaning broader model exploration may yield improved performance. Nonetheless, the study provides originality and value by representing the first empirical application of machine learning to Indonesian public complaint data on maladministration, demonstrating how algorithm selection directly affects predictive outcomes when handling complex categorical structures. The findings offer practical insights for government agencies, highlighting how data-driven models can support policy design, strengthen transparency, and improve the quality of public services.

Downloads

Download data is not yet available.

References

S. C. and S. R. Balasundaram, “DATA ANALYSIS IN CONTEXT-BASED STATISTICAL MODELING IN PREDICTIVE ANALYTICS,” pp. 96–114, 2021, doi: https://doi.org/10.4018/978-1-7998-3053-5.ch006

X. Wang, X. Y. Lou, S. Y. Hu, and S. C. He, “EVALUATION OF SAFE DRIVING BEHAVIOR OF TRANSPORT VEHICLES BASED ON K-SVM-XGBOOST,” in 2020 3rd International Conference on Advanced Electronic Materials, Computers and Software Engineering (AEMCSE), IEEE, pp. 84–92, Apr. 2020, doi: https://doi.org/10.1109/AEMCSE50948.2020.00026

G. Ke et al., “LIGHTGBM: A HIGHLY EFFICIENT GRADIENT BOOSTING DECISION TREE,” Adv Neural Inf Process Syst, vol. 30, pp. 3146–3154, 2017.

L. Prokhorenkova, G. Gusev, A. Vorobev, A. Dorogush, and A. Gulin, “CATBOOST: UNBIASED BOOSTING WITH CATEGORICAL FEATURES,” Adv Neural Inf Process Syst, vol. 31, pp. 6638–6648, 2018.

S. M. Intani, B. I. Nasution, M. E. Aminanto, Y. Nugraha, N. Muchtar, and J. I. Kanggrawan, “AUTOMATING PUBLIC COMPLAINT CLASSIFICATION THROUGH JAKLAPOR CHANNEL: A CASE STUDY OF JAKARTA, INDONESIA,” in 2022 IEEE International Smart Cities Conference (ISC2), IEEE, pp. 1–6, Sep. 2022, doi: https://doi.org/10.1109/ISC255366.2022.9922346

E. D. Madyatmadja, C. P. M. Sianipar, C. Wijaya, and D. J. M. Sembiring, “CLASSIFYING CROWDSOURCED CITIZEN COMPLAINTS THROUGH DATA MINING: ACCURACY TESTING OF K-NEAREST NEIGHBORS, RANDOM FOREST, SUPPORT VECTOR MACHINE, AND ADABOOST,” Informatics, vol. 10, no. 4, p. 84, Nov. 2023, doi: https://doi.org/10.3390/informatics10040084

W. Liang, S. Luo, G. Zhao, and H. Wu, “PREDICTING HARD ROCK PILLAR STABILITY USING GBDT, XGBOOST, AND LIGHTGBM ALGORITHMS,” Mathematics, vol. 8, no. 5, p. 765, May 2020, doi: https://doi.org/10.3390/math8050765

J. T. Hancock and T. M. Khoshgoftaar, “CATBOOST FOR BIG DATA: AN INTERDISCIPLINARY REVIEW,” J Big Data, vol. 7, no. 1, p. 94, Dec. 2020, doi: https://doi.org/10.1186/s40537-020-00369-8

D. Setiawan, H. Wijayanto, and L. O. A. Rahman, “BAGGING AND RANDOM FOREST CLASSIFICATION METHODS FOR UNBALANCED DATA SCHOOL DROPOUT CASES IN LAMPUNG PROVINCE,” p. 020026, 2022, doi: https://doi.org/10.1063/5.0109130

A. Pratiwi, K. A. Notodiputro, and H. Wijayanto, “PEMODELAN LOYALITAS KONSUMEN SUSU PERTUMBUHAN DALAM MENGIKUTI PROGRAM REWARDS MENGGUNAKAN METODE RANDOM FOREST DAN NEURAL NETWORK,” Xplore: Journal of Statistics, vol. 2, no. 2, pp. 41–48, Aug. 2018, doi: https://doi.org/10.29244/xplore.v2i2.104

F. Izzati, M. Masjkur, and F. M. Afendi, “COMPARISON OF CHI-SQUARE AUTOMATIC INTERACTION DETECTOR (CHAID) AND RANDOM FOREST METHODS IN THE CLASSIFICATION OF HOUSEHOLD POVERTY STATUS IN CENTRAL JAVA,” Indonesian Journal of Statistics and Its Applications, vol. 8, no. 1, pp. 1–13, Jun. 2024. https://doi.org/10.29244/ijsa.v8i1p1-13

T.-H. Lee, A. Ullah, and R. Wang, “BOOTSTRAP AGGREGATING AND RANDOM FOREST,” 2020, pp. 389–429. https://doi.org/10.1007/978-3-030-31150-6_13

C. Bentéjac, A. Csörgő, and G. Martínez-Muñoz, “A COMPARATIVE ANALYSIS OF GRADIENT BOOSTING ALGORITHMS,” Artif Intell Rev, vol. 54, no. 3, pp. 1937–1967, Mar. 2021, doi: https://doi.org/10.1007/s10462-020-09896-5

D. Zhang and Y. Gong, “THE COMPARISON OF LIGHTGBM AND XGBOOST COUPLING FACTOR ANALYSIS AND PREDIAGNOSIS OF ACUTE LIVER FAILURE,” IEEE Access, vol. 8, pp. 220990–221003, 2020, doi: https://doi.org/10.1109/ACCESS.2020.3042848

A. V. Dorogush, V. Ershov, and A. Gulin, “CATBOOST: GRADIENT BOOSTING WITH CATEGORICAL FEATURES SUPPORT,” ArXiv, vol. abs/1810.11363, 2018.

H. A. Salman, A. Kalakech, and A. Steiti, “RANDOM FOREST ALGORITHM OVERVIEW,” Babylonian Journal of Machine Learning, vol. 2024, pp. 69–79, Jun. 2024, doi: https://doi.org/10.58496/BJML/2024/007

M. Heydarian, T. E. Doyle, and R. Samavi, “MLCM: MULTI-LABEL CONFUSION MATRIX,” IEEE Access, vol. 10, pp. 19083–19095, 2022, doi: https://doi.org/10.1109/ACCESS.2022.3151048

T. Zhu, “ANALYSIS ON THE APPLICABILITY OF THE RANDOM FOREST,” J Phys Conf Ser, vol. 1607, no. 1, p. 012123, Aug. 2020, doi: https://doi.org/10.1088/1742-6596/1607/1/012123

M. N. Wright and I. R. König, “SPLITTING ON CATEGORICAL PREDICTORS IN RANDOM FORESTS,” PeerJ, vol. 7, p. e6339, Feb. 2019, doi: https://doi.org/10.7717/peerj.6339

G. Biau, B. Cadre, and L. Rouvière, “ACCELERATED GRADIENT BOOSTING,” Mach Learn, vol. 108, no. 6, pp. 971–992, Jun. 2019, doi: https://doi.org/10.1007/s10994-019-05787-1

COMPARATIVE STUDY OF LIGHTGBM, CATBOOST, AND RANDOM FOREST IN MODELING PUBLIC COMPLAINTS CLASSIFICATION

Abstract

Downloads

References

Editorial Office

Contact Info