IMPROVING ACCURACY OF PREDICTION INTERVALS OF HOUSEHOLD INCOME USING QUANTILE REGRESSION FOREST AND SELECTION OF EXPLANATORY VARIABLES
Abstract
Quantile regression forest (QRF) is a non-parametric method for estimating the distribution function of response by using the random forest algorithm and constructing conditional quantile prediction intervals. However, if the explanatory factors (covariates) are highly correlated, the quantile regression forest's performance will decrease, resulting in low accuracy of prediction intervals for the outcome variable. The selection of explanatory variables in quantile regression forest is investigated and addressed in this paper, using several selection scenarios that consist of the full model, forward selection, LASSO, ridge regression, and random forest to improve the accuracy of household income data prediction. This data was obtained from National Labour Force Survey in 2021. The results indicate that the random forest method outperforms other methods for explanatory selection utilizing RMSE metrics. With regard to the criteria of average coverage value just above the 95% target and statistical test results, the RF-QRF and Forward-QRF methods outperform the QRF, LASSO-QRF, and Ridge-QRF methods for constructing prediction intervals.
Downloads
References
R. Koenker, “Quantile Regression - book extract,” Cambridge Univ. Press, no. February 1997, p. 198, 2005.
R. Koenker and K. F. Hallock, “Quantile regression,” J. Econ. Perspect., vol. 15, no. 4, pp. 143–156, 2001, doi: 10.1257/jep.15.4.143.
Q. Huang, H. Zhang, J. Chen, and M. He, “Quantile Regression Models and Their Applications: A Review,” J. Biom. Biostat., vol. 08, no. 03, 2017, doi: 10.4172/2155-6180.1000354.
E. Waldmann, “Quantile regression: A short story on how and why,” Stat. Modelling, vol. 18, no. 3–4, pp. 203–218, 2018, doi: 10.1177/1471082X18759142.
H. Cardot and P. Sarda, “Estimation in generalized linear models for functional data via penalized likelihood,” J. Multivar. Anal., vol. 92, no. 1, pp. 24–41, 2005, doi: 10.1016/j.jmva.2003.08.008.
K. Chen and H. G. Müller, “Conditional quantile analysis when covariates are functions, with application to growth data,” J. R. Stat. Soc. Ser. B Stat. Methodol., vol. 74, no. 1, pp. 67–89, 2012, doi: 10.1111/j.1467-9868.2011.01008.x.
N. Meinshausen, “Quantile Regression Forest,” J. Mach. Learn. Res., vol. 7, no. 2006, pp. 983–999, 2006, doi: 10.1111/j.1541-0420.2010.01521.x.
M. Ardiansyah, K. A. Notodiputro, and B. Sartono, “Peningkatan Presisi Dugaan Berat Gabah Melalui Proses Seleksi Peubah Dalam Pembelajaran Mesin Statistika,” in Prosiding Seminar Nasional VARIANSI, 2020, pp. 171–183.
Y. Zhang, Q. Wang, and M. Tian, “Smoothed Quantile Regression with Factor-Augmented Regularized Variable Selection for High Correlated Data,” Mathematics, vol. 10, no. 16, pp. 1–30, 2022, doi: 10.3390/math10162935.
N. Meinshausen and P. Bühlmann, “High-dimensional graphs and variable selection with the Lasso,” Ann. Stat., vol. 34, no. 3, pp. 1436–1462, 2006, doi: 10.1214/009053606000000281.
M. Rashighi and J. E. Harris, “Regularized Quantile Regression and Robust Feature Screening for Single Index Models,” Physiol. Behav., vol. 176, no. 3, pp. 139–148, 2017, doi: 10.1053/j.gastro.2016.08.014.CagY.
Y. Chang, “Multi-step quantile regression tree,” J. Stat. Comput. Simul., vol. 84, no. 3, pp. 663–682, 2014, doi: 10.1080/00949655.2012.721886.
T. T. Nguyen, H. Zhao, J. Z. Huang, T. T. Nguyen, and M. J. Li, “A new feature sampling method in random forests for predicting high-dimensional data,” in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2015, vol. 9078, no. May, pp. 459–470. doi: 10.1007/978-3-319-18032-8_36.
T. Nguyen, J. Z. Huang, T. T. Nguyen, and I. Khan, “Bias-Corrected Quantile Regression Forests for High-Dimensional Data,” in Proceeding of the 2014 International Conference on Machine Learning and Cybernetics, 2014, no. July, pp. 1–6. doi: 10.13140/2.1.2500.8002.
R. Shrinkage, “Regression Shrinkage and Selection via the Lasso,” J. R. Stat. Soc. Ser. B, vol. 58, no. 1, pp. 267–288, 1996, [Online]. Available: jstor.org/stable/2346178
J. Wieczorek and J. Lei, “Model selection properties of forward selection and sequential cross-validation for high-dimensional regression,” Can. J. Stat., vol. 50, no. 2, pp. 454–470, 2022, doi: 10.1002/cjs.11635.
Y. Wu, “Can’t Ridge Regression Perform Variable Selection?,” Technometrics, vol. 63, no. 2, pp. 263–271, 2021, doi: 10.1080/00401706.2020.1791254.
J. M. Klusowski, “Complete Analysis of a Random Forest Model,” arXiv, vol. 13, pp. 1063–1095, 2018.
A. Primajaya and B. N. Sari, “Random Forest Algorithm for Prediction of Precipitation,” Indones. J. Artif. Intell. Data Min., vol. 1, no. 1, p. 27, 2018, doi: 10.24014/ijaidm.v1i1.4903.
D. Pramika, “Faktor-Faktor Yang Mempengaruhi Pendapatan Rumah Tanggadi Kabupaten Empat Lawang Provinsi Sumatera Selatan,” J. Ekon. Manajemen, Bisnis, Audit. dan Akunt., vol. 2, no. 1, pp. 33–49, 2017.
D. A. Putri and N. D. Setiawina, “Pengaruh Umur, Pendidikan, Pekerjaan Terhadap Pendapatan Rumah Tangga,” EP-Unud, vol. 2, no. 4, pp. 173–180, 2017.
BPS, “Survey Angkatan Kerja Nasional,” 2021.
L. Breiman, Classification and Regression Trees, 1st ed., no. January. New York: Taylor and Francis Group, 1984. doi: 10.1201/9781315139470.
L. Breiman, “Random Forests,” Mach. Learn., no. 45, pp. 5–32, 2021, doi: 10.1109/ICCECE51280.2021.9342376.
Y. Li et al., “Random forest regression for online capacity estimation of lithium-ion batteries,” Appl. Energy, vol. 232, no. February, pp. 197–210, 2018, doi: 10.1016/j.apenergy.2018.09.182.
L. Schiesser, “Quantile Regression Forests - An R-Vignette,” pp. 1–10, 2014.
H. Pham, “A new criterion for model selection,” Mathematics, vol. 7, no. 12, pp. 1–12, 2019, doi: 10.3390/MATH7121215.
A. Asrirawan, S. U. Permata, and M. I. Fauzan, “Pendekatan Univariate Time Series Modelling untuk Prediksi Kuartalan Pertumbuhan Ekonomi Indonesia Pasca Vaksinasi COVID-19,” Jambura J. Math., vol. 4, no. 1, pp. 86–103, 2022, doi: 10.34312/jjom.v4i1.11717.
J. Landon and N. D. Singpurwalla, “Choosing a coverage probability for prediction intervals,” Am. Stat., vol. 62, no. 2, pp. 120–124, 2008, doi: 10.1198/000313008X304062.
M. Huang, C. Müller, and I. Gaynanova, “latentcor: An R Package for estimating latent correlations from mixed data types,” J. Open Source Softw., vol. 6, no. 65, p. 3634, 2021, doi: 10.21105/joss.03634.
Copyright (c) 2023 Asrirawan Asrirawan, Khairil Anwar Notodiputro, Bagus Sartono
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Authors who publish with this Journal agree to the following terms:
- Author retain copyright and grant the journal right of first publication with the work simultaneously licensed under a creative commons attribution license that allow others to share the work within an acknowledgement of the work’s authorship and initial publication of this journal.
- Authors are able to enter into separate, additional contractual arrangement for the non-exclusive distribution of the journal’s published version of the work (e.g. acknowledgement of its initial publication in this journal).
- Authors are permitted and encouraged to post their work online (e.g. in institutional repositories or on their websites) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published works.