THE EFFECT OF SAMPLE SIZE ON THE STABILITY OF XGBOOST MODEL PERFORMANCE IN PREDICTING STUDENT STUDY PERIOD

  • Muhammad Lintang Damar Sakti Department of Mathematics Education, Faculty of Science and Mathematics, Universitas Negeri Yogyakarta, Indonesia https://orcid.org/0009-0005-1226-581X
  • Jailani Jailani Department of Mathematics Education, Faculty of Science and Mathematics, Universitas Negeri Yogyakarta, Indonesia https://orcid.org/0000-0002-8505-2074
  • Heri Retnawati Department of Mathematics Education, Faculty of Science and Mathematics, Universitas Negeri Yogyakarta, Indonesia https://orcid.org/0000-0002-1792-5873
  • Kana Hidayati Department of Mathematics Education, Faculty of Science and Mathematics, Universitas Negeri Yogyakarta, Indonesia https://orcid.org/0000-0002-9226-8500
  • Nur Hadi Waryanto Department of Mathematics Education, Faculty of Science and Mathematics, Universitas Negeri Yogyakarta, Indonesia https://orcid.org/0000-0002-0632-313X
  • Zulfa Safina Ibrahim Department of Mathematics Education, Faculty of Science and Mathematics, Universitas Negeri Yogyakarta, Indonesia https://orcid.org/0009-0004-0539-4771
  • Asma’ Khoirunnisa Department of Mathematics Education, Faculty of Science and Mathematics, Universitas Negeri Yogyakarta, Indonesia https://orcid.org/0000-0002-7609-9735
  • Firdaus Amruzain Satiranandi Wibowo Department of Mathematics Education, Faculty of Science and Mathematics, Universitas Negeri Yogyakarta, Indonesia https://orcid.org/0000-0002-4143-2622
  • Miftah Okta Berlian Department of Mathematics Education, Faculty of Science and Mathematics, Universitas Negeri Yogyakarta, Indonesia https://orcid.org/0009-0002-2399-6528
  • Angella Ananta Batubara Department of Mathematics Education, Faculty of Science and Mathematics, Universitas Negeri Yogyakarta, Indonesia https://orcid.org/0009-0003-5530-9830
Keywords: Bootstrap, Sample Size, Stability, Study Period, XGBoost

Abstract

Student success can be defined based on the period of study taken until graduation from college. Machine learning can be used to predict the factors that are thought to influence student success. To achieve optimal machine learning model performance, attention is needed on the sample size. This study aims to determine the effect of student sample size on the stability of model performance to predict student success. This research is quantitative. The data used is student data from a university in Yogyakarta from 2014 to 2019, totaling 19061 students. The target variable is the student study period in months, while the predictor variables are college entrance pathways, GPA from semester 1 to semester 6, and family socioeconomic conditions based on the father’s and mother’s income. This research uses the XGBoost model with the best hyperparameters and the bootstrap approach. Bootstrapping was performed on the original data by sampling twenty different sample sizes: 250, 500, 750, 1000, 1250, 1500, 1750, 2000, 2250, 2500, 2750, 3000, 3250, 3500, 3750, 4000, 4250, 4500, 4750, and 5000. The resulting bootstrap samples were replicated ten times. Model performance evaluation uses the Root Mean Square Error (RMSE) value. The result of this research is the XGBoost model with the best hyperparameters, obtained through the training data division scheme of 90% and testing data of 10%, which has the smallest RMSE value of 8.318. The model uses the best hyperparameters: n_estimators of 75, max_depth of 8, min_child_weight of 5, eta of 0.07, gamma of 0.2, subsample of 0.8, and colsample_bylevel of 1. The XGBoost model with optimal hyperparameters demonstrates peak performance stability at a sample size of 1750 students, as evidenced by consistent RMSE values across 10 bootstrap replications, confirming that this data quantity provides the ideal balance between prediction accuracy and stability for estimating study duration.

 

Downloads

Download data is not yet available.

References

A. Sugianto, “TYPES OF DATA VARIABLES (DISCRETE AND CONTINUOUS VARIABLES),” 2016, [Online]. Available: https://adoc.pub/jenis-jenis-data-variabel-variabel-diskrit-dan-variabel-kont.html

E. Febriantoro, E. Setyati, and J. Santoso, “MODELING TOY SALES QUANTITY PREDICTION USING LIGHT GRADIENT BOOSTING MACHINE,” SMARTICS J., vol. 9, no. 1, pp. 7–13, 2023, doi: https://doi.org/10.21067/smartics.v9i1.8279.

S. E. Herni Yulianti, Oni Soesanto, and Yuana Sukmawaty, “APPLICATION OF EXTREME GRADIENT BOOSTING (XGBOOST) METHOD IN CREDIT CARD CUSTOMER CLASSIFICATION,” J. Math. Theory Appl., vol. 4, no. 1, pp. 21–26, 2022, doi: https://doi.org/10.31605/jomta.v4i1.1792.

T. Chen and C. Guestrin, “XGBOOST: A SCALABLE TREE BOOSTING SYSTEM,” Proc. ACM SIGKDD Int. Conf. Knowl. Discov. Data Min., pp. 785–794, 2016, doi: https://doi.org/10.1145/2939672.2939785.

B. Efron and R. J. Tibshirani, AN INTRODUCTION TO THE BOOTSTRAP. London: Chapman and Hall, Inc, 1993. doi: https://doi.org/10.1007/978-1-4899-4541-9.

F. P. A. putra Rachman, R. Goejantoro, and M. N. Hayati, “DETERMINATION OF THE NUMBER OF BOOTSTRAP REPLICATIONS USING THE PRETEST METHOD IN THE INDEPENDENT SAMPLE T-TEST (REGIONAL ORIGINAL REVENUE OF REGENCIES/CITIES IN EAST KALIMANTAN AND NORTH KALIMANTAN PROVINCES IN 2015),” J. Eksponensial, vol. 9, no. 1, pp. 35–40, 2018, [Online]. Available: blob:chrome-extension://efaidnbmnnnibpcajpcglclefindmkaj/27956194-298e-41aa-9c94-ec7a856961d4

I. K. P. Suniantara, G. Suwardika, and I. G. A. Astapa, “BAGGING LOGISTIC REGRESSION FOR IMPROVING THE CLASSIFICATION ACCURACY OF STUDENT GRADUATION TIME AT STIKOM BALI,” J. Din., vol. 09, no. 1, pp. 10–19, 2018.

Darwin and S. Zurimi, “APPLIED MODEL ANALYSIS OF MULTIVARIATE ADAPTIVE REGRESSION SPLINE (MARS) ON THE CLASSIFICATION OF FACTORS AFFECTING THE STUDY PERIOD OF FKIP UNIVERSITAS DARUSSALAM AMBON STUDENTS,” J. SIMETRIK, vol. 9, no. 2, pp. 250–255, 2019.doi ; https://doi.org/10.31959/js.v9i2.426

Nalim, H. L. Dewi, and M. A. Safii, “ANALYSIS OF FACTORS INFLUENCING STUDENTS’ academic success at PTKIN in Jawa Tengah province,” J. Has. Penelit. dan Kaji. Kepustakaan di Bid. Pendidikan, Pengajaran dan Pembelajaran, vol. 7, no. 4, pp. 1003–1013, 2021, doi: https://doi.org/10.33394/jk.v7i4.3430.

N. Hanum and S. Safuridar, “ANALYSIS OF FAMILY SOCIO-ECONOMIC CONDITIONS ON FAMILY WELFARE IN GAMPONG KARANG ANYAR, LANGSA,” J. Samudra Ekon. dan Bisnis, vol. 9, no. 1, pp. 42–49, 2018, doi: https://doi.org/10.33059/jseb.v9i1.460.

B. Sudarwanto, “THE INFLUENCE OF PARENTS’ SOCIO-ECONOMIC STATUS AND LEARNING MOTIVATION ON THE ACADEMIC ACHIEVEMENT OF STUDENTS AT SMPN 4 WONOSOBO,” Media Manaj. Pendidik., vol. 1, no. 1, p. 116, 2018, doi: 10.30738/mmp.v1i1.2881.

B. Mahesh, “MACHINE LEARNING ALGORITHMS,” Int. J. Sci. Res., vol. 9, no. 1, pp. 381–386, 2020, doi: https://doi.org/10.21275/ART20203995.

I. Goodfellow, Y. Bengio, and A. Courville, DEEP LEARNING. Cambridge: The MIT Press, 2016. [Online]. Available: http://deeplearning.net/

E. J. Sudarman and S. Budi, “DEVELOPMENT OF AN EXTREME GRADIENT BOOSTING MACHINE INTELLIGENCE MODEL FOR PREDICTING STUDENT ACADEMIC SUCCESS,” J. Strateg., vol. 5, no. 2, pp. 297–314, 2023.

L. N. Aina, V. R. S. Nastiti, and C. S. K. Aditya, “IMPLEMENTATION OF EXTRA TREES CLASSIFIER WITH GRID SEARCH CV OPTIMIZATION FOR ADAPTATION LEVEL PREDICTION,” MIND (Multimedia Artif. Intell. Netw. Database) J., vol. 9, no. 1, pp. 78–88, 2024, [Online]. Available: https://ejurnal.itenas.ac.id/index.php/mindjournal/article/view/10899

B. Quinto, NEXT-GENERATION MACHINE LEARNING WITH SPARK. 2020. doi: ttps://doi.org/10.1007/978-1-4842-5669-5.

N. N. Pandika Pinata, I. M. Sukarsa, and N. K. Dwi Rusjayanthi, “TRAFFIC ACCIDENT PREDICTION IN BALI USING XGBOOST IN PYTHON,” J. Ilm. Merpati (Menara Penelit. Akad. Teknol. Informasi), vol. 8, no. 3, p. 188, 2020, doi: https://doi.org/10.24843/JIM.2020.v08.i03.p04.

T. O. Hodson, “ROOT-MEAN-SQUARE ERROR (RMSE) OR MEAN ABSOLUTE ERROR (MAE): WHEN TO USE THEM OR NOT,” Geosci. Model Dev., vol. 15, no. 14, pp. 5481–5487, 2022, doi: https://doi.org/10.5194/gmd-15-5481-2022.

A. R. Naufal, R. Satria, and A. Syukur, “IMPLEMENTATION OF BOOTSTRAPPING FOR CLASS IMBALANCE AND WEIGHTED INFORMATION GAIN FOR FEATURE SELECTION IN THE SUPPORT VECTOR MACHINE ALGORITHM FOR CUSTOMER LOYALTY PREDICTION,” J. Intell. Syst., vol. 1, no. 2, pp. 98–108, 2015.

M. Maswar, “DESCRIPTIVE STATISTICAL ANALYSIS OF ECONOMETRICS FINAL EXAM SCORES OF STUDENTS USING SPSS 23 & EVIEWS 8.1,” J. Pendidik. Islam Indones., vol. 1, no. 2, pp. 273–292, 2017, doi: https://doi.org/10.35316/jpii.v1i2.54.

I. Veronika Girsang et al., “ANALYSIS OF FACTORS INFLUENCING THE CUMULATIVE GRADE POINT AVERAGE (GPA) OF STUDENTS IN THE DEVELOPMENT ECONOMICS DEPARTMENT, FACULTY OF ECONOMICS AND BUSINESS, UNIVERSITAS PALANGKA RAYA,” J. Ilm. Mhs., vol. 2, no. 1, pp. 144–156, 2024, [Online]. Available: https://doi.org/10.59603/niantanasikka.v2i1.

D. F. Wicaksono, R. S. Basuki, and D. Setiawan, “IMPLEMENTATION OF MACHINE LEARNING IN PREDICTING DISEASE CASE LEVELS IN INDONESIA,” J. Media Inform. Budidarma, vol. 8, no. 2, p. 736, 2024, doi: https://doi.org/10.30865/mib.v8i2.7501.

M. A. Rayadin, M. Musaruddin, R. A. Saputra, and I. Isnawaty, “IMPLEMENTATION OF ENSEMBLE LEARNING USING XGBOOST AND RANDOM FOREST METHODS FOR PREDICTING BATTERY REPLACEMENT TIME,” BIOS J. Teknol. Inf. dan Rekayasa Komput., vol. 5, no. 2, pp. 111–119, 2024.doi: https://doi.org/10.37148/bios.v5i2.128

A. Wibowo, “EARTHQUAKE MAGNITUDE PREDICTION USING MACHINE LEARNING WITH THE XGBOOST MODEL AS A STRATEGIC STEP IN EARTHQUAKE-RESISTANT BUILDING STRUCTURE PLANNING IN INDONESIA,” MESA (Teknik Mesin, Tek. Elektro, Tek. Sipil …, vol. 6, no. 1, pp. 18–29, 2022, [Online]. Available: http://www.ejournal.unsub.ac.id/index.php/FTK/article/view/1829

A. S. Saputra, B. N. Sari, and C. Rozikin, “IMPLEMENTATION OF THE EXTREME GRADIENT BOOSTING (XGBOOST) ALGORITHM FOR CREDIT RISK ANALYSIS,” J. Ilm. Wahana Pendidik., vol. 10, no. 7, pp. 27–36, 2024, doi: 10.5281/zenodo.10960080.

M. Syukron, R. Santoso, and T. Widiharih, “COMPARISON OF SMOTE RANDOM FOREST AND SMOTE XGBOOST METHODS FOR CLASSIFYING HEPATITIS C SEVERITY LEVELS ON IMBALANCED CLASS DATA,” J. GAUSSIAN, vol. 9, no. 3, pp. 227–236, 2020.doi: https://doi.org/10.14710/j.gauss.v9i3.28915

B. Jange, “PREDICTION OF BCA BANK STOCK PRICES USING XGBOOST,” Arbitr. J. Econ. Account., vol. 3, no. 2, pp. 231–237, 2022, doi: https://doi.org/10.47065/arbitrase.v3i2.495.

E. Agustin, A. Eviyanti, and N. L. Azizah, “EPILEPSY DETECTION THROUGH EEG SIGNALS USING DWT AND EXTREME GRADIENT BOOSTING,” J. Media Inform. Budidarma, vol. 7, no. 1, p. 117, 2023, doi: https://doi.org/10.30865/mib.v7i1.5412.

Jimmy, L. D. Yulianto, E. H. Hermaliani, and L. Kurniawati, “PENERAPAN MACHINE LEARNING DALAM ANALISIS STADIUM PENYAKIT HATI UNTUK PROSES DIAGNOSIS DAN PERAWATAN,” RESOLUSI Rekayasa Tek. Inform. dan Inf., vol. 3, no. 4, pp. 170–180, 2023.

J. Pardede and D. Nurrohmah, “HEPATITIS IDENTIFICATION USING BACKWARD ELIMINATION AND EXTREME GRADIENT BOOSTING METHODS,” J. Inf. Syst. Eng. Bus. Intell., vol. 10, no. 2, pp. 302–313, 2024, doi: https://doi.org/10.20473/jisebi.10.2.302-313.

A. Pate, R. Emsley, M. Sperrin, G. P. Martin, and T. van Staa, “IMPACT OF SAMPLE SIZE ON THE STABILITY OF RISK SCORES FROM CLINICAL PREDICTION MODELS: A CASE STUDY IN CARDIOVASCULAR DISEASE,” Diagnostic Progn. Res., vol. 4, no. 1, 2020, doi: https://doi.org/10.1186/s41512-020-00082-3.

Published
2025-09-01
How to Cite
[1]
M. L. Damar Sakti, “THE EFFECT OF SAMPLE SIZE ON THE STABILITY OF XGBOOST MODEL PERFORMANCE IN PREDICTING STUDENT STUDY PERIOD”, BAREKENG: J. Math. & App., vol. 19, no. 4, pp. 2679-2692, Sep. 2025.