SIMULATION STUDY OF HIERARCHICAL BAYESIAN APPROACH FOR SMALL AREA ESTIMATION WITH MEASUREMENT ERROR

ABSTRACT


INTRODUCTION
Small area estimation is an estimation method that can overcome the sample size problem.Rao and Molina define a small area to denote any domain that cannot accurately produce a direct estimate [1].Additional information is needed or known as auxiliary variables, to obtain an adequate level of precision in indirect estimation with small area estimation.SAE can increase the effectiveness of sample size by borrowing the strength of neighboring areas and information from the auxiliary variables that have a strong relationship with observational variables [2].Generally, additional information in estimating small areas is obtained from census calculations and administrative records [3].Census may completely enumerate all units of its target population, so it is assumed that the auxiliary variables used do not contain measurement errors.However, the census has several drawbacks, such as a long time and, high costs, lack sufficient detail about the characteristic of interest [4], so surveys are activities that are often carried out compared to censuses.Survey activities use more cost-effectiveness and faster implementation time.Still, when additional information is obtained from the results of survey calculations, it is assumed that the auxiliary variables will contain measurement errors.Measurement error arises when a recorded measurement value is not exactly the same as the actual value [5].The existence of measurement errors in the model causes parameter estimates to be biased and inconsistent, and conclusions may be drawn erroneously [6].Therefore, the estimation of parameters in small area estimation with auxiliary variables containing errors will also be a biased estimator, and the mean squared error of the predictor may be enhanced [7].So, we need a small area estimation method to accommodate auxiliary variables with measurement errors.
The development of a small area estimation model with auxiliary variables that contain measurement errors, in general, has been discussed by Hariyanto et al. [8] and Tanur [9].In the unit-level small area estimation, Ghosh et al. [10] first developed a small area estimation with an auxiliary variable containing the error.Development is carried out by estimating a small area at the unit level with structural measurement errors in the auxiliary variables used.The Bayesian approach to estimating small areas at the area level by considering the auxiliary variables containing measurement errors was carried out by Arima et al. [23].Method development from the Ybarra-Lohr SaeME method proposes an alternative estimator resulting from the Hierarchical Bayes measurement error model.Estimation in the simulation study is carried out with several scenarios that condition that there are unequal measurement errors in the generated areas.The simulations compared several SAE methods, including direct estimation, EBLUP-FH, EBLUP Bayesian, Ybarra-Lohr SaeME, and Hierarchical Bayesian SaeME.The simulation study results show that estimating with Hierarchical Bayes is more stable and has a smaller MSE value than the Ybarra-lohr SaeME estimation and others' estimation [23].Arima et al. [23] only used one random effect variance and one auxiliary variable that contained errors in their simulation study.Based on this, in this study, an adaptation of the simulation conducted by Arima et al. [23] will be carried out by adding scenarios to the variance of random effect areas and scenarios to the auxiliary variables used.There are two simulations carried out in this study.These simulations were carried out by applying three small area estimation methods when the auxiliary variables have measurement errors, namely the SAE EBLUP-FH method, the Ybarra-Lohr SaeME method, and the Hierarchical Bayesian SaeME method.This study compares the four methods in some scenarios when the variance of random effect has different value.The first simulation was conducted to see each area's different measurement error conditions.The second simulation is carried out by creating scenarios on the auxiliary variables.The aim is to see the conditions when the auxiliary variables used are two auxiliary variables containing measurement errors and the conditions when some auxiliary variables used include measurement errors, and some do not.From these simulations can be used for the future research in applying empirical data with the method that can give the accurate result in estimating small area with auxiliary variables that containing measurement error.

Fay-Herriot Model
Small area estimation is done to estimate parameters indirectly in a relatively small area in a pilot survey [24].The availability of additional information and determining a good and suitable model is important in obtaining indirect estimates, especially in small areas [1].In the area level model, the auxiliary variables   = ( 1 , …   )  available to a small level, and the observed variable is assumed to be a function of the average response variable   = ( ̅  ) for (.).Where   is a known positive value constant and  = (  , … ,   )  is the regression coefficient,   is the small random effect (often assumed to be normal),   is the sampling error,   ~(0,   2 ), so that the area level model (a model of linear mixed form or known as the Fay-Herriot model) can be written in Equation (1): BLUP estimator (best linear unbiased prediction) to   be formulated in Equation ( 2): In the BLUP estimator, the value   2 is assumed to be known.However, the random effect variance (  2 ) is unknown in practice, so it must be estimated first.The estimate   2 will then be replaced by  ̂ 2 , the BLUP estimator.Then a new EBLUP estimator (empirical best linear unbiased prediction) will be obtained in Equation (3).Which is the weighted average (with weight  ̂) of the direct estimator (  ̂) and the synthesis model (    ̂).

Ybarra-Lohr SaeME Model
Ybarra and Lohr [13] developed this small area estimation because the survey has errors due to sampling and errors not due to sampling, so the additional information or auxiliary variables used contain errors.Therefore, this development is known as Small Area Estimation with Measurement Error (SaeME) or small Area estimation with auxiliary variables that contain errors.In small area estimation,   is the value of the auxiliary variable area .If all components   is known, we use Equation (4): where   and   are independent,   is the sampling error,   ~(0,   2 ).However, when   , there are auxiliary variables that are measured with errors, then the model with measurements that contain errors is as follows [13] in Equation ( 5): Estimator  ̂ replaces the value for   with   ( ̂,   ) =   + (  −  ̂)  .It is assumed that the estimator for]   is available for all areas i.We can input or estimate multiple components   is not measurable.In this model, it is assumed that   and   are independent.Ybarra and Lohr SaeME formula can be written in Equation ( 6): model is denoted as with   =   2 +     /(  2 +      +   2 ).Suppose  1 , … ,   is a set of finite weights bounded from 0. The estimated regression parameters are defined as follows in Equation ( 7): ) −1 ∑    ̂   =1 (7) with   = 1/(  2 +      +   2 ) for  = 1, … , .

Hierarchical Bayesian SaeME Model
Arima et al. [23] examined a Hierarchical Bayesian model with measurement errors.The results show that the uncertainty of measuring the posterior variance of the Bayesian estimator is more stable than the EBLUP MSE proposed by Ybarra-Lohr [13].In the Hierarchical Bayesian approach, the unknown model parameters (including the variance components) are treated as random components, each with a certain prior distribution.The posterior distribution for the parameter of interest is obtained based on the entire prior distribution.The prior that used in this study is non informative prior.So it requires the generation of sample data from each , , , , dan   2 given parameter and the remaining data.Bayesian procedure implementation is used with the Markov chain Monte Carlo technique, especially the Gibbs sampler.In obtaining posterior distribution results, the trace plot is used to analyze MCMC convergence [25].Trace plot is a plot of interaction against the resulting values.If there is a certain pattern then the Markov chain has not reached convergence.it is important to check algorithm convergence.Hierarchical Bayesian SaeME model can be witten in multi-stage model as follows in Equation ( 8 )

Data Analysis Procedure
The simulation data were analyzed using four estimation methods, including the direct estimation method.This SAE EBLUP-FH method assumes that the auxiliary variables containing errors are the true values, the Ybarra-Lohr SaeME method, and Hierarchical Bayesian SaeME.There were two simulations carried out in this study.Both simulations were carried out with scenarios on a variance of random effects (σ u 1 2 = 2 dan σ u 2 2 = 4) and scenarios on the measurement error (   {0, } where = 2 and 4).The first simulation is carried out with several k scenarios, k ϵ{0,20,50,80,100} where k is the percentage of small areas where the auxiliary variables measured contain errors    {0, }, for example, k = 80% and   = 2 means that 80% of the small areas with auxiliary variables measured contained errors   = 2.The percentage of k was used to see the small area with different measurement error conditions.The greater the percentage of k, the greater the percentage of small area with auxiliary variables measured with error.The rest has a value of 0. In comparison, the second simulation has two scenarios on the auxiliary variables used.There are two auxiliary variables used in the second simulation.The first scenario of the second simulation is that one auxiliary variable contains an error, and one auxiliary variable does not contain an error.The second scenario of this second simulation is that the two auxiliary variables used contain errors Both simulations were carried out with 100 iterations to get an optimal result.Data processing was carried out using the R program.The following are the stages of the simulation study carried out: 1) Determine the number of small areas, namely m= 20.
4) Auxiliary variables that contain errors will later be generated with area error data of the auxiliary variables (v i ) from a normal distribution with zero expectation values and three scenarios of measurement error (c i ), which are determined as constants c i ϵ {0, d} where d= 2 and 4.
(iii) From each iteration will be drawn Y i = α + βx i + u i and y i = Y i + e i (iv) Generating auxiliary variables containing errors from the equation x ̂i = x i + v i .
(v) Several scenarios will be carried out ϵ{0,20,50,80,100} with  as a small percentage area with the measured auxiliary variables containing errors c i 6) Generating the second simulation data as follows: (i) Define value α = 1 and value β with two scenarios β 1 = 3 and β 2 = 5 (ii) Generating two auxiliary variables namely x 1 i ~N(5, σ = 3) and x 2 i ~N(3, σ = 5) (iii) From each iteration will be drawn Y i = α + β 1 x 1 i + β 2 x 2 i + u i and y i = Y i + e i (iv) Generating auxiliary variables that contain errors from the equation, namely x ̂1i = x 1 i + v i , and x ̂2i = x 2 i + v i (v) Modeling is built with two scenarios: a.One auxiliary variable contains an error, and one variable does not have an error b.Both auxiliary variables have errors 7) Calculation of parameter estimators from both simulations is done by: (i) Direct estimation method y i (ii) SAE EBLUP-FH method assumes x ̂i as the true value.
(iii) The Ybarra-Lohr SaeME method (iv) Hierarchical Bayesian SaeME Method.The Bayesian procedure is implemented with a monte carlo markov chain simulation.
8) The simulation results were evaluated by calculating the average bias and the average empirical mean squared error (EMSE) of 100 replicates for all scenarios.Then the results of all estimation methods are compared.The better method is the one with a smaller average bias (AB) and average EMSE (AEMSE) in Equation ( 9) and Equation (10) respectively.

Simulation Study 1
From the results of processing with the R program, the average bias and average empirical mean squared error (EMSE) of several scenarios were obtained.The results of the four methods are then compared, the direct estimation method (  ), the EBLUP-FH method, which assumes no errors in the auxiliary variables (  ̂ ), the Ybarra-Lohr SaeME method ( ̂ ), and the Hierarchical Bayesian SaeME method with the auxiliary variables containing errors ( ̂ ).In obtaining good posterior distribution results for the Hierarchical Bayesian SaeME method, it is necessary to fulfill the convergence of the MCMC algorithm.The parameter estimation process for first simulation is carried out by generating sample data in 10,000 iterations with a burn-in of 5,000 and a thin of 20.Based on the MCMC algorithm, convergence occurs when the resulting Markov chain distribution approaches the posterior interest distribution.To evaluate the convergence is done by looking at the resulting Trace plot.From several scenarios, the resulting trace plot has a random pattern, and the plot is relatively stable at a specific value.So convergence has been achieved.1 above shows the average bias when  = 20, the random effect area   2 = 2, and   = 2.The average bias is shown by separating the areas with   = 2 and   = 0. From these results it can be seen that the fourth method produces results that are not much different.However, in general it appears that direct estimation produces an average bias that is close to 0, especially when the percentage of area containing measurement error is  = 20 and  = 30 with c_i = (2, 3, or 4).However, when  the values are 0 and 100, estimating with Hierarchical Bayesian SaeME (  ̂ ) produces average bias, which is generally smaller than other methods.Table 2 shows the average bias when  = 20 and the random effect area   2 = 4 and   = 2. Generally, when the variance of random effect is greater,   2 = 4, the average bias is greater than when the variance of random effect is   2 = 2.The result of average bias from four methods also close to 0. However, when  is large, Hierarchical Bayesian estimation with auxiliary variables containing errors ( ̂ ) produces an average bias generally smaller than other methods.Based on the calculation of the Empirical Mean Squared Error (EMSE), the result is that in a scenario with a random effect area   2 = 2 and   = 2 shown in Figure 1.The average EMSE is shown by separating the areas with   = 2 and   = 0.At  =0, the estimation results  ̂ provide the lowest average EMSE among the other methods.When   = 0,  =20, 50, and 80,  ̂ provide the average EMSE lower than   and  ̂ .From two graphics,   provides the average EMSE, bigger than three other methods, except when   = 0 and  =20, 50, and 80.For   = 2, When  =20 and  =50, the estimation  ̂ gives the smallest average EMSE among other estimation methods.It means that  ̂ provides the smallest average EMSE when there is a measurement error in the auxiliary variable with a small k.In contrast, the smallest average EMSE is obtained by  ̂ when  values are 80 and 100.It can be said that the greater the percentage  and the value of   , the estimator  ̂ is, the best estimator.

Simulation Study 2
In simulation study 2, two scenarios are carried out in the parameter estimation model.The first scenario is carried out with one auxiliary variable( ̂1 ) containing the measurement error and one auxiliary variable ( 2 ) which does not include the measurement error.The second scenario has two auxiliary variables, each containing measurement errors ( ̂1 and  ̂2 ).The parameter estimation process with the Hierarchical Bayesian SaeME method is carried out by generating MCMC samples by generating sample data in 100,000 iterations with a burn-in of 5,000 and a thin of 20.The estimation is evaluated by comparing the average bias and the average EMSE.
The average bias results with random effect area   2 = 2 and measurement error   = 2, 3, and 4 are shown in Table 3.It shows that the average bias produces in both scenarios 1 and 2 for each   is the same for the three methods which   ,  ̂ , and  ̂ .Therefore, the average bias resulting from three methods, whether the auxiliary variable contains only one variable that containing measurement error or both containing measurement error, is not different.The results are different from the average bias of  ̂ .The greater the number of auxiliary variables that contain measurement error, the greater the average bias.When in scenario 1, only one auxiliary variable contains measurement error, the average bias of  ̂ is smaller than other estimation methods.When the model contains auxiliary variables that both include errors, the resulting average bias is greater than the model where only one of the auxiliary variables has errors.In this simulation, the average bias of  ̂ is not greater than  ̂ .It relates to Ybarra and Lohr [13] in certain conditions,  ̂ is ignored for mean squared error calculation, and the reported mean squared error of  ̂ will be too small, giving a misleading notion of precision.Table 4 shows the average bias when the random effect area   2 = 4.In general, the lowest bias average is obtained by the estimation method  ̂ .Three estimation method which are   ,  ̂ , and  ̂ give the result that is not different from Table 3.In general, the lowest average bias obtained by  ̂ estimation method is in scenario 1, only one of the auxiliary variables contains measurement error.From the two average bias tables, the four estimation methods produce an average bias that is close to 0. The estimation method is then evaluated by looking at the average EMSE output of the four estimation methods (Figure 3).The estimation   provides the average EMSE, which is lowest than other methods.When the random effect area   2 = 2 and   = 2, 3, and 4, it can be seen that the larger   , the greater the average EMSE value.The average EMSE resulting from the four methods is not too different.However, in general, the average EMSE of the two scenarios with the smallest value is produced by the method  ̂ compared to method  ̂ and  ̂ .When the random effect area   2 = 4 and   = 2, 3, and 4, the average EMSE is shown in Figure 4.The average EMSE of the two scenarios with the smallest value is generally produced by the method  ̂ compared to method  ̂ and  ̂ .Not much different as at   2 = 2, the bigger it is,   , the bigger the average EMSE value.Although the resulting numbers are close enough, it is quite visible that the value increases as it increases   .However, it is different when compared, so the average EMSE value at   2 = 4 is smaller than at   2 = 2 in scenario 2. Based on the evaluation of the model estimation, both the average bias and the average EMSE, it can be said that the estimation of  ̂ gives the best results compared to the estimate of   ,  ̂, and  ̂ .

CONCLUSIONS
The first simulation study was conducted to see each area's different measurement error conditions.When the percentage of areas containing measurement errors is zero, which means there are no measurement errors, the estimation with SAE EBLUP-FH is better.This can be seen from the smaller average EMSE.When the percentage of small areas containing measurement error is 20 and 50, the estimation with Ybarra-Lohr SaeME produces smaller EMSE averages.When the percentage of small areas containing measurement errors is 80 to 100, the mean bias and EMSE results with Hierarchical Bayesian SaeME are smaller.The second simulation study was carried out by creating scenarios on the auxiliary variables.When the auxiliary variables that contain errors increase, the average bias value and the resulting average EMSE from Hierarchical Bayesian SaeME method are also greater as the measurement error on the auxiliary variables increases.In general, for estimating small area estimation with measurement error, the Hierarchical Bayesian SaeME estimator was the best for the one and two simulation studies because the average bias and average EMSE were lower than the direct estimates, SAE EBLUP-FH and Ybarra-Lohr SaeME when auxiliary variable is measured with error.For future research, the Hierarchical Bayesian SaeME method can be used to obtain small area estimation for empirical data to obtain better accuracy results which the auxiliary variables containing measurement errors such as from survey data.

Figure 1 . 2 .Figure 2 .
Figure 1.Average EMSE with Random Effect Area    = 2, when (a)   = , (b)   =  The comparison average EMSE at  = 20,   = 2, and the random effect area   2 = 4 is shown in Figure 2. At times  =20, 50, and 80,   = 0, the smallest average EMSE was produced by estimating  ̂ .For  =20 and   = 2, the estimation  ̂ gives the smallest average EMSE among other estimation methods.And when  =50, 80, and 100,   = 2, the estimation  ̂ gives the smallest average EMSE among other estimation methods.The average EMSE estimation   is the largest among other methods for almost all  and   .Figures1 and 2show a similar pattern, but it can be seen that the four estimation method give a bigger average EMSE when   2 = 4 than   2 = 2.It means the bigger   2 , the bigger the average EMSE.From the bias average and EMSE average of the four methods, it can be said that  ̂ it gives the best results among the other methods

Figure 4 .
Figure 4.The Average EMSE with Random Effect Area    =4, (a) Scenario 1, (b) Scenario 2 applies SUSENAS data to estimate the average length of schooling by sub-district in Kampar Regency.The results show that the Ybarra-Lohr SaeME estimation model can predict a smaller MSE value than direct estimation.Aziz and Ubaidillah [19] used two SAE models SAE EBLUP Fay-Herriot model with auxiliary variables Podes data and SAE with Error Measurement with auxiliary variable Twitter data.Estimation results using the SAE method are better than direct estimates.Auxiliary variables that contain errors are sources from Big data which as Twitter.Hariyanto et al. [20] estimate parameters and develop empirical bates in small area estimation with measurement error in t distributed covariate variable.Tanur and Kurnia [21] conducted a study that developed an alternative small-area estimation for the autoregressive model with auxiliary variables containing measurement errors.Novkaniza et al. [22] estimate non-symmetrical count data in SAE for the Poisson-lognormal model with measurement error in covariate.

Table 1 . Average Bias with Random Effect Area
= 2 and   =

Table 2 . Average Bias With Random Effect Area
= 4 and   =

Table 4 .
Average Bias With Random Effect Area    = 4