IMPROVING ACCURACY OF PREDICTION INTERVALS OF HOUSEHOLD INCOME USING QUANTILE REGRESSION FOREST AND SELECTION OF EXPLANATORY VARIABLES

ABSTRACT


INTRODUCTION
In recent years, ensemble-based prediction algorithms have been widely developed to examine the effect of high-dimensional data.The introduction of quantile regression by Koenker [1] and Koenker and Hallock [2] enables a flexible investigation of covariate impacts on the conditional tail distribution that cannot be solved with mean regression.Furthermore, suppose the normality assumption on the outcome distribution is not met.Predicting variance and making prediction intervals for mean regression predictions becomes challenging, necessitating a more flexible quantile approach to the distribution [3], [4].However, if the proper conditional quantile function (TCQ) is not a linear combination of the covariates or if the connection between predictor variables and the TCQ is non-linear, quantile regression predictions become less reliable [5], [6].The Quantile Regression Forest (QRF) model evolved to address this shortcoming.Meinshausen [7] introduced the QRF model, a non-parametric ensemble approach based on decision trees.Using the framework of a random forest, this method is used to estimate the conditional quantile distribution function and provide prediction intervals for the outcome variables.The purpose of random forest algorithm is to generate a distribution function by modifying a large number of decision trees.Furthermore, because it employs decision trees as its underlying model, random forest can capture nonlinear correlations between predictor variables and responses.
The estimation of quantiles depends on predictor variables in quantile regression forests.However, since predictor variables are highly correlated, it may result in uncertainty in the quantile estimate.This is due to the quantile regression forest model allowing interactions between predictor variables [7], [8].Consequently, the prediction intervals will decrease.To address it, we can use variable selection techniques [9]- [11].This approaches can be used to remove uninformative (irrelevant) or highly correlated predictor variables, improving model interpretation and making quantile estimation more accurate, as demonstrated by Meinshausen [7] and Youngjae and Chang [12].Meinshausen [7] examined the goodness of prediction and prediction intervals by simulating different predictor variables.A loss function was used to assess prediction accuracy, and it was discovered that irrelevant variables tend to increase the loss function for each conditional quantile, implying that prediction accuracy diminishes.Also, Nguyen et al. [13] demonstrated that when employed with high-dimensional data, the performance of random forest may decrease due to increasingly complex interactions between predictor variables, making quantile regression forest predictions less accurate.The results may be biased due to the decision tree's random selection of variables, as uninformative variables are more likely to be preferred [14].[19]) to improve the accuracy of prediction intervals for household income quantiles in Bogor.The variables chosen in the previous step will be utilized in the forest quantile regression model (QRF) models since we will have QRF, RF-QRF, Forward-QRF, LASSO-QRF, and Ridge-QRF models.The average of RMSE and coverage will deliver to compare them with different quantiles.Preliminary identification, such as normality and outlier assumptions, are displayed to guarantee that household income data is appropriate for modeling using forest quantile regression.It will also determine whether or not there is a relationship between predictor variables.To resolve these covariate models, we propose some previous research.According to Pramika [20] and Putri [21], education, age, occupation, and family size are all characteristics that influence household earnings.Participation in training or courses, working hours, job search length, internet access availability, health insurance (BPJS), and the pre work card are other predictor variables for the current research.National Labour Force Survey data was used.

Data Description
The data on the total number of household incomes in Bogor Regency, West Java, used as the outcome variable in this study, was obtained from the National Labor Force Survey, 2021 [22].West Java had 2,985 household heads or roughly 5.13% of total respondents.Only 1,565 households (52.4%) held jobs (employed status) for the previous week, while others were unemployed.However, just 1,498 working families were chosen for quantile regression forest modeling.The family size, age, last education level, training type (course/training), employment status, duration of job search, working hours, availability of internet services, availability of health insurance services (BPJS), and ownership status of the PRAKERJA card were the predictor variables used.Table 1 contains the specifics.

Random Forest
Random forest generates hundreds or even thousands of decision trees that act as independent regression functions, and the ultimate output of the RF regression is the average of all decision tree outputs.RF is an expansion of Classification and Regression Trees (CART) initiated by Breiman et al [23].Given X as an input vector with m features and


The RF algorithm procedure comprises separating input data at each node to improve the splitting function parameters to suit the set n S .The decision tree must first determine the optimal separation from all variables.The splitting process starts at the root and proceeds to each node, which applies the separation function to the new input X .This technique is continued until you reach the terminal or leaf node.
Typically, the tree runs out when it comes to the maximum number of levels or when a node acquires a certain amount of observations [24], [25].Let  be a set of built trees and a random parameter vector.We also have a weight vector ( )  , x w i , a positive constant if observation X occurs on the tree leaves ( )  , x  , and 0 otherwise.Equation (2) may be used to calculate the weight ( )


The number of trees is denoted by k.Thus, the observation Y prediction could easily be written as Equation (4).

Quantile Regression Forest
Quantile regression forest is a random forest generalization that remains resilient, non-linear, and non-parametric in estimating conditional quantiles [3].Consider the  -th quantile of Y with 5), the conditional distribution function for x X = and ( ) is the probability of Y existing less than or equal to R y  .

(
) ( ) Quantiles are constructed using this distribution function.In general, the QRF equation can be represented as in Equation ( 6).
The weighted distribution of the response variables is utilized to estimate the conditional distribution function, as shown in Equation ( 7).
The weight ( ) x w i may be observed in Equation (3).In addition, quantile regression forests may deliver more robust findings against outliers compared to other regression approaches [3], [26].This is because QRF employs the median or different quantiles as the primary statistic in decision-making at each tree node, which is less impacted by extreme values.However, outliers can still affect QRF in some circumstances.Outliers that are too far away from the majority of data points might interfere with the construction of tree nodes in QRF, resulting in erroneous or inaccurate predictions.In general, the QRF Algorithm proceeds as follows: 1.As in random forests, put k trees ( ) . Make a note of all observations in this leaf, not just the average, for each leaf of each tree.
2. Drop x down all trees for a specified x X = .
3. For each tree, compute the weight of observation ( ) 4. Calculate the weight ( ) as an average of overusing (3).

By applying the weights from
Step 2, compute the distribution function estimate as in (7) for every.R y  6.Estimates of the conditional quantiles are obtained by plugging instead of into (1).

( )
x q  ˆ of the conditional quantiles ( ) are produced by substituting ( )

Prediction Intervals
Prediction intervals are constructed utilizing conditional quantiles of QRF-predicted household revenue responses.Prediction intervals give a range of values for actual data at an acceptable level of confidence.In particular, Equation (8) constructs the prediction interval ( ) For example, the 95% prediction range for the response Y is calculated by Equation ( 9).
This suggests that for a given value x , the household income is likely to fall inside the interval.The length of the predicted interval varies X .The coverage value is used to compare the reliability of the prediction interval for family income response.The coverage value is the percentage of sample points that fall inside the prediction interval.

Evaluation Metrics
The root means square error (RMSE) measure is used to evaluate the accuracy of the QRF algorithm's forecast values to actual values acquired from trials.RMSE is comparable to mean absolute error (MAE), except it gives more weight to bigger fundamental values than MAE [27], [28].A significant discrepancy between MAE and RMSE suggests the presence of variance in individual mistakes.RMSE can be defined as follows Equation ( 10).
We also include a coverage level to measure the accuracy of prediction intervals.The coverage probability for such intervals is commonly chosen by convention or brilliant judgment.The wider the prediction interval, the greater the coverage probability, and vice versa [29].

Step Analysis
This study's analytical steps were as follows: 1. Explore variable response data and predictor variables as follows:

RESULTS AND DISCUSSION
Data exploration investigation and the analysis of assumptions, including TCQ and outlier detection about the distribution of response data, will be provided in early research to identify whether the quantile regression forest method is able to be used to predict and improve the accuracy of prediction intervals for household income information.According to the Q-Q normal plot, the normal distribution assumption for household income data is not met.Figure 1a shows this.The red dots on the plot do not follow the diagonal line but instead create a distinct pattern, suggesting that the household income response data does not follow a normal distribution.Meanwhile, the actual conditional quantile relationship (TCQ) and predictor variables are non-linear, as illustrated in Figure 1 (b), which is a plot of the number of trees against the error of quantile 5 , 0 =  .The link between the number of trees and the error is a way to determine whether or not the relationship between the TCQ and the predictor variables is linear.Figure 1 (b) depicts the association pattern between the predictor variable using by KP variable at the median quantile.This variable was selected because it has the highest correlation value as compared to the others.However, the plot for countless additional variables shows a similar pattern, despite the fact that the correlation value is low.According to Figure 1 (b), the relation between TCQ and the predictor variable KP is non-linear since it does not form a straight-line pattern and has a trend in error values.Furthermore, the enormous number of trees reflects the complexity of the non-linear connection in TCQ.Thus, boxplots, as illustrated in Figure 2, are used to ensure that there are no outliers that are too far out.As previously stated, the quantile regression forest approach is resistant to outliers but will impair prediction accuracy if outliers are too widely apart.The graphic shows that the few outlier points (in red) are still within a respectable range and will not have a substantial impact on the quantile regression forest's performance.

Figure 2. Categorical and continuous explanatory variables
The next stage is to investigate the correlation between predictor variables after checking the distribution analysis, TCQ, and outliers of household income data.It has the potential to improve the quantile regression forest approach's prediction effectiveness.Figure 3 depicts the study's level of association among the predictor factors (mixed-scaled type) and also between these variables and household income.To allow the calculation of correlations between any form of mixed variable, we use the idea of semi-parametric latent Gaussian copula approaches proposed by [30].The higher the negative correlation between variables, the more blue-black the hue, and the stronger the positive correlation between variables, the more yellowish-green the color.The findings imply that the correlation between variables is more diminutive than r=|0.6|.With r=-0.508, the variable of training involvement shows a high negative connection with household earnings.Some factors, for example, age and highest education level (r=-0.492),household income and highest education level (r=0.427), and household income and employment status (r=0.407),have correlation values greater than r=|0.4|.It also indicates that among the predictor variables a fairly significant correlation.This also holds for the relationship between predictors and responses.
As a result, variable selection must be made, starting with a simulated study of the number of variables using various selection methods such as forward selection, full model, LASSO, ridge, and random forest.Figure 4 displays the simulation results for household income data using these selection approaches with the number of predictor variable combinations of p = (10, 8, 5, 2) and ten repetitions.The RMSE value is used to evaluate the approaches.In general, variable selection results show that the number and mix of variables utilized affect the decrease or rise in the RMSE value.Figure 4 further indicates that the random forest approach has a lower average RMSE value than other methods for each value of p and repetition available.

Figure 3. Correlation Of Mixed-Scaled Types Of Explanatory And Outcome Variables
Meanwhile, the average RMSE of the LASSO and ridge regression models is nearly the same, but it is still extremely high when compared to the average RMSE of the full model and forward model.The variable selection results and the variables utilized for quantile estimation in the quantile regression forest approach can be shown in Table 2.The variable combinations for each method are the best combinations based on AIC (full and forward), RMSE (LASSO and ridge), and importance variables (random forest).The median quantile is frequently employed in quantile regression because it has the most weight in generating the projected value and generates a more comprehensive model.Therefore, it is believed to give more valuable information for household income.The following step is to create prediction intervals and assess the method's performance using the average coverage value from 10 repetitions with a target coverage value of 95%.The average coverage values of the quantile regression forest with a target coverage value of 95% and ten replications at the median quantile are presented in Figure 7.All approaches' average coverage levels pretty close to the desired coverage value of 95%.However, QRF, LASSO-QRF, and Ridge-QRF have average coverage values lower than the reference target of 95%, whilst the others (Forward-QRF and RF-QRF) have values higher than 95%.Furthermore, QRF has the lowest average coverage value.This implies that the predictability of prediction intervals for family income responses, including all predictor factors, is lower than the predictability of variables generated from selection results.As a result, RF-QRF and Forward-QRF, which have average coverage values greater than 95%, may be utilized to create prediction ranges for household income.The green dots represent observed sites that are within the 95% prediction interval, while the red dots represent places that are outside the interval.The coverage numbers in the quantile 005 , 0 =  are about 97%-99% (above the aim of 95%), indicating that the prediction ranges are more cautious.This suggests that the probability of the predicted value of household income fitting actual observation is pretty high.This is also evidenced by the comparatively long prediction interval, which encompasses more actual values.However, the performance of the prediction intervals diminishes from quantile 005 , 0 =  to quantile 5 , 0 =  .Only RF-QRF and Forward-QRF are above the 95% interval objective, while the rest are below it.This can also be observed in the considerably lower prediction interval for the median quantile compared to the prior quantile.However, when employing the median quantile, RF-QRF is suggested since it offers accurate findings.A statistical analysis of the mean RMSE and coverage values was undertaken to compare the techniques, as shown in Table 3.The t-test findings for mean RMSE values indicate that there is a significant difference in mean RMSE values between techniques at the 95% confidence level.The RF-QRF method is superior to the others.Meanwhile, the t-test for mean coverage values reveals that significant variations in mean coverage values exist only between QRF and RF-QRF, as well as QRF and Forward-QRF.This suggests that the RF-QRF and Forward-QRF approaches outperform QRF when it comes to creating prediction intervals for household income responses.

CONCLUSIONS
The findings of explanatory variable selection affect predicted values and prediction intervals for the quantile regression of household income response.The random forest technique has the lowest RMSE value based on the simulated predictor variable selection procedure.The RF-QRF and Forward-QRF algorithms display an average coverage value above the given target when creating prediction intervals with a target coverage of 95%.This indicates that, when compared to other approaches, these methodologies produce more trustworthy projections of household income.

=
as a scalar output, and n S as training data with a total of n observations, it can be represented as in Equation (1).

4 .
a. calculating correlations between mixed-scale variables based[30] b. plotting between TCQ and predictor variables c. create a boxplot to detect outliers d.Create a Q-Q plot to see the distribution of the response variables 2. Simulate variable selection using the full method, forward, LASSO, ridge, and random forest for the number of variables p = 10, 8, 5, and 2 with the following: a. Dividing the training data and test data by a ratio of 80:20 b.Selecting the best variables for p = 10,8,5 and 2 for each method by looking at RMSE (full, forward, LASSO, ridge) and variable importance measures (random forest).c.Calculating the RMSE value of each variable combination d.Step b is repeated ten times e.Calculating the average of RMSE f.Comparing the average RMSE values through the plots 3. From step 2, determine the best combination of variables to be used in the forest quantile regression method for each forward, LASSO, ridge, and random forest method.Predict and construct forest quantile regression prediction intervals with the following steps: a. Split the training data and test data by comparison 80:20 b.Estimated conditional quantile predictive value ( -QRF, Forward-QRF, LASSO-QRF, and Ridge-QRF methods, respectively.c.Step b is repeated ten times d.Calculating the average of RMSE e. Plot the mean RMSE values for all quantiles f.Create an RMSE boxplot for the median quantiles 5. Make prediction intervals 6. Calculate coverage values 7. Create a boxplot for the average coverage value 8. Perform statistical tests on RMSE mean and coverage based on steps (4) and (5) using paired t-test.(10)

FollowingFigure 5 Figure 5 .Figure 5 .
Figure 5.The average of RMSE for quantiles 995 , 0 ; 95 , 0 ; 5 , 0 ; 05 , 0 ; 025 , 0 ; 005 , 0 =  Figure 5 clarifies that the QRF method has lower RMSE values than others for various quantile points proposed, followed by the RF-QRF method.However, the RF-QRF method has the lowest RMSE value at the median quantile point 5 , 0 =  .The performance of RF-QRF will improve as the quantile approaches the center.Figure 6 demonstrates the comparison of RMSE values between models at the median quantile.

Figure 6
demonstrates the comparison of RMSE values between models at the median quantile.

Figure 7 .
Figure 7. Boxplot Of The Average Of Coverage Quantile Regression Forest

Figure 8 and
Figure 9 provide a plot of prediction intervals