IMPROVING ACCURACY OF PREDICTION INTERVALS OF HOUSEHOLD INCOME USING QUANTILE REGRESSION FOREST AND SELECTION OF EXPLANATORY VARIABLES

Asrirawan Asrirawan; Khairil Anwar Notodiputro; Bagus Sartono

doi:10.30598/barekengvol17iss4pp1915-1926

Asrirawan Asrirawan Department of Statistics, Faculty of Mathematics and Natural Sciences, University of West Sulawesi, Indonesia
Khairil Anwar Notodiputro Department of Statistics and Data Science, Faculty of Mathematics and Natural Sciences, IPB University
Bagus Sartono Department of Statistics and Data Science, Faculty of Mathematics and Natural Sciences, IPB University

DOI: https://doi.org/10.30598/barekengvol17iss4pp1915-1926

Keywords: Household Income, Quantile Regression Forest, Random Forest, Prediction Interval

Abstract

Quantile regression forest (QRF) is a non-parametric method for estimating the distribution function of response by using the random forest algorithm and constructing conditional quantile prediction intervals. However, if the explanatory factors (covariates) are highly correlated, the quantile regression forest's performance will decrease, resulting in low accuracy of prediction intervals for the outcome variable. The selection of explanatory variables in quantile regression forest is investigated and addressed in this paper, using several selection scenarios that consist of the full model, forward selection, LASSO, ridge regression, and random forest to improve the accuracy of household income data prediction. This data was obtained from National Labour Force Survey in 2021. The results indicate that the random forest method outperforms other methods for explanatory selection utilizing RMSE metrics. With regard to the criteria of average coverage value just above the 95% target and statistical test results, the RF-QRF and Forward-QRF methods outperform the QRF, LASSO-QRF, and Ridge-QRF methods for constructing prediction intervals.

Downloads

Download data is not yet available.

References

R. Koenker, “Quantile Regression - book extract,” Cambridge Univ. Press, no. February 1997, p. 198, 2005.

R. Koenker and K. F. Hallock, “Quantile regression,” J. Econ. Perspect., vol. 15, no. 4, pp. 143–156, 2001, doi: 10.1257/jep.15.4.143.

Q. Huang, H. Zhang, J. Chen, and M. He, “Quantile Regression Models and Their Applications: A Review,” J. Biom. Biostat., vol. 08, no. 03, 2017, doi: 10.4172/2155-6180.1000354.

E. Waldmann, “Quantile regression: A short story on how and why,” Stat. Modelling, vol. 18, no. 3–4, pp. 203–218, 2018, doi: 10.1177/1471082X18759142.

H. Cardot and P. Sarda, “Estimation in generalized linear models for functional data via penalized likelihood,” J. Multivar. Anal., vol. 92, no. 1, pp. 24–41, 2005, doi: 10.1016/j.jmva.2003.08.008.

K. Chen and H. G. Müller, “Conditional quantile analysis when covariates are functions, with application to growth data,” J. R. Stat. Soc. Ser. B Stat. Methodol., vol. 74, no. 1, pp. 67–89, 2012, doi: 10.1111/j.1467-9868.2011.01008.x.

N. Meinshausen, “Quantile Regression Forest,” J. Mach. Learn. Res., vol. 7, no. 2006, pp. 983–999, 2006, doi: 10.1111/j.1541-0420.2010.01521.x.

M. Ardiansyah, K. A. Notodiputro, and B. Sartono, “Peningkatan Presisi Dugaan Berat Gabah Melalui Proses Seleksi Peubah Dalam Pembelajaran Mesin Statistika,” in Prosiding Seminar Nasional VARIANSI, 2020, pp. 171–183.

Y. Zhang, Q. Wang, and M. Tian, “Smoothed Quantile Regression with Factor-Augmented Regularized Variable Selection for High Correlated Data,” Mathematics, vol. 10, no. 16, pp. 1–30, 2022, doi: 10.3390/math10162935.

N. Meinshausen and P. Bühlmann, “High-dimensional graphs and variable selection with the Lasso,” Ann. Stat., vol. 34, no. 3, pp. 1436–1462, 2006, doi: 10.1214/009053606000000281.

M. Rashighi and J. E. Harris, “Regularized Quantile Regression and Robust Feature Screening for Single Index Models,” Physiol. Behav., vol. 176, no. 3, pp. 139–148, 2017, doi: 10.1053/j.gastro.2016.08.014.CagY.

Y. Chang, “Multi-step quantile regression tree,” J. Stat. Comput. Simul., vol. 84, no. 3, pp. 663–682, 2014, doi: 10.1080/00949655.2012.721886.

T. T. Nguyen, H. Zhao, J. Z. Huang, T. T. Nguyen, and M. J. Li, “A new feature sampling method in random forests for predicting high-dimensional data,” in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2015, vol. 9078, no. May, pp. 459–470. doi: 10.1007/978-3-319-18032-8_36.

T. Nguyen, J. Z. Huang, T. T. Nguyen, and I. Khan, “Bias-Corrected Quantile Regression Forests for High-Dimensional Data,” in Proceeding of the 2014 International Conference on Machine Learning and Cybernetics, 2014, no. July, pp. 1–6. doi: 10.13140/2.1.2500.8002.

R. Shrinkage, “Regression Shrinkage and Selection via the Lasso,” J. R. Stat. Soc. Ser. B, vol. 58, no. 1, pp. 267–288, 1996, [Online]. Available: jstor.org/stable/2346178

J. Wieczorek and J. Lei, “Model selection properties of forward selection and sequential cross-validation for high-dimensional regression,” Can. J. Stat., vol. 50, no. 2, pp. 454–470, 2022, doi: 10.1002/cjs.11635.

Y. Wu, “Can’t Ridge Regression Perform Variable Selection?,” Technometrics, vol. 63, no. 2, pp. 263–271, 2021, doi: 10.1080/00401706.2020.1791254.

J. M. Klusowski, “Complete Analysis of a Random Forest Model,” arXiv, vol. 13, pp. 1063–1095, 2018.

A. Primajaya and B. N. Sari, “Random Forest Algorithm for Prediction of Precipitation,” Indones. J. Artif. Intell. Data Min., vol. 1, no. 1, p. 27, 2018, doi: 10.24014/ijaidm.v1i1.4903.

D. Pramika, “Faktor-Faktor Yang Mempengaruhi Pendapatan Rumah Tanggadi Kabupaten Empat Lawang Provinsi Sumatera Selatan,” J. Ekon. Manajemen, Bisnis, Audit. dan Akunt., vol. 2, no. 1, pp. 33–49, 2017.

D. A. Putri and N. D. Setiawina, “Pengaruh Umur, Pendidikan, Pekerjaan Terhadap Pendapatan Rumah Tangga,” EP-Unud, vol. 2, no. 4, pp. 173–180, 2017.

BPS, “Survey Angkatan Kerja Nasional,” 2021.

L. Breiman, Classification and Regression Trees, 1st ed., no. January. New York: Taylor and Francis Group, 1984. doi: 10.1201/9781315139470.

L. Breiman, “Random Forests,” Mach. Learn., no. 45, pp. 5–32, 2021, doi: 10.1109/ICCECE51280.2021.9342376.

Y. Li et al., “Random forest regression for online capacity estimation of lithium-ion batteries,” Appl. Energy, vol. 232, no. February, pp. 197–210, 2018, doi: 10.1016/j.apenergy.2018.09.182.

L. Schiesser, “Quantile Regression Forests - An R-Vignette,” pp. 1–10, 2014.

H. Pham, “A new criterion for model selection,” Mathematics, vol. 7, no. 12, pp. 1–12, 2019, doi: 10.3390/MATH7121215.

A. Asrirawan, S. U. Permata, and M. I. Fauzan, “Pendekatan Univariate Time Series Modelling untuk Prediksi Kuartalan Pertumbuhan Ekonomi Indonesia Pasca Vaksinasi COVID-19,” Jambura J. Math., vol. 4, no. 1, pp. 86–103, 2022, doi: 10.34312/jjom.v4i1.11717.

J. Landon and N. D. Singpurwalla, “Choosing a coverage probability for prediction intervals,” Am. Stat., vol. 62, no. 2, pp. 120–124, 2008, doi: 10.1198/000313008X304062.

M. Huang, C. Müller, and I. Gaynanova, “latentcor: An R Package for estimating latent correlations from mixed data types,” J. Open Source Softw., vol. 6, no. 65, p. 3634, 2021, doi: 10.21105/joss.03634.

IMPROVING ACCURACY OF PREDICTION INTERVALS OF HOUSEHOLD INCOME USING QUANTILE REGRESSION FOREST AND SELECTION OF EXPLANATORY VARIABLES

Abstract

Downloads

References

Editorial Office

Contact Info