COMPARISON OF ROBUST ESTIMATION ON MULTIPLE REGRESSION MODEL

ABSTRACT


INTRODUCTION
Commonly, there are two variables in regression analysis, i.e., the dependent variable ( ) and the independent variable ( ). The relationship between the dependent and the independent variables in regression analysis is represented through a mathematical model called the regression equation. The regression equation that states a linear relationship between one dependent variable (Y) and one independent variable (X) is called simple linear regression. Meanwhile, the regression equation that models a linear relationship between one dependent variable and more than one independent variable is known as multiple linear regression [1]. Regression analysis aims to estimate the regression coefficient in the regression model [2]. Some approaches to the method used are adjusted to the characteristics of the data that will be analyzed.
The characteristics or properties of the data are diverse; one of them is data with outliers [3], [4]. The problem that frequently arises in regression analysis is that one or more data points that are found are far from the general data pattern or are commonly called outliers. They usually arise due to errors in the measurement system, errors in the data input, measurement instrument errors, or an unusual event [5], [6]. However, it can provide information that cannot be delivered by other data points in some cases [7]- [11]. Practically, more special handling or methods are needed for data with outliers [12], [13].
Ordinary Least Square (OLS) is a parameter estimation method in regression modeling. This method is quite good to be used if the data do not have outliers [14]- [18]. However, if the assumptions are not met due to outliers, this method is not strong enough to model the regression [19]. Thus, we need another regression model in the estimation process. One of the robust models in the presence of outlier data is the robust regression model [20], [21]. According to Chen (2002), robust regression is a significant method for analyzing data contaminated by outliers [22]. Several estimation methods in the robust regression model are the M, S, and MM regression models. This method is selected because it has a high breakdown point (maximum number of outlier data that can be tolerated by a model) [23].
The main contribution of this paper is to explain the robustness of the robust regression method in the presence of outliers. This is important because the nature of the outliers in the data is random and exists in various conditions. This paper begins with checking outliers on the data used in the simulation [24]. Then, the regression model was formed and compared between the data with outliers and the data of which the outliers had been removed using the OLS method and the robust regression model. Furthermore, the robust method of the data with outliers was observed. The model criteria were seen from the stability of the estimated parameters and the standard error of the data with and without outliers.

RESEARCH METHODS
The research method contains explanations in the form of paragraphs about the research design or descriptions of the experimental settings, data sources, data collection techniques, and data analysis conducted by the researcher. This guide will explain writing headings. If your headers exceed one, use the second level of headings as below.

Multiple Linear Regression Model
Multiple linear regression is a statistical method that is useful in testing the effect of an independent variable on the dependent variable. The general form of the multiple linear regression equation is as follows

OLS Method
OLS is a regression method that minimizes the number of squared errors. The parameter estimation method used is the OLS method, which estimates the regression coefficient ( ) by minimizing error. The parameter estimation is as follows: ̂= ( ) −1 (2) Where ̂ is a vector of the estimated parameters of size ( + 1) × 1, X is a matrix of predictor variables of size × ( + 1), and y is an observation vector of response variables of size × 1.

M-Estimation
M-estimation is an estimation of the robust regression that minimizes the function, as follows [27]: Where: ̂ is the estimated beta of the M-method estimation results is a weighted representation of the residual is the i th residual Function used is the Huber objective function by the following equation: , with MAD (Median Absolute Deviation) as the median of residue | | [14], [28], [29].
The steps of the robust regression model with M-estimation according to [14] are as follows: 1. Determine the residual of . 2. Determine the Mean Absolute Deviation value, the value of ̂, and the value of . 3. Compare the value of with the initial residual value and the final residual value. 4. Determine the new estimated value by adding up the initial estimated value with the new residual. 5. Perform regression analysis again with the newly estimated value as the dependent variable. 6. Repeat until the n th iteration and convergence is reached.

MM-Estimation
The first step is to find an estimator with the S-estimation, then to set the regression parameters using the M-estimation [30].
The MM-estimation is defined as: Where ̂ is the estimated beta of the MM-method estimation results.
The steps of the robust regression model with MM-estimation according to [30], [31] are as follows: 1. Determine the initial estimator value of ̂ and residue 2. The residue obtained in the first step is used to determine the value of ̂ and calculate the initial weight of 3. Residue in the first step and the value of ̂ in the second step is used in the initial iteration to estimate the Weighted Least Square to calculate the regression coefficient. 4. Calculate the new weight with the estimated scale from the initial iteration of Weighted Least Square. 5. Repeat these steps until the n th iteration and ∑ | | =1 reaches convergence.

S-Estimation
The robust regression model with S-estimation can reach a breakdown point of up to 50%, meaning that half of the outliers can be overcome and have a good influence on other observations [14]. The Sestimation is defined as ̂=̂( 1 , 2 , … , ), by determining the minimum scale of the ̂ robust estimation The weighting objective function minimizes the combined sum of the squared residuals and the absolute number of residues.
The steps of the robust regression model with S-estimation according to [14] are as follows: 1. Determine the residual of 2. Determine ̂ 3. Determine =̂ 4. Determine the weighting function 5. Determine ̂ by using the Weighted Least Square with weighting 6. Repeat these steps until the n th iteration and ̂ reaches convergence.
The data obtained are used for simulation by previously checking the outliers using the plotting data, boxplot, quartile values, and Mahalanobis distance and range to detect outliers. The criteria for univariate data are as follows:

Mahalanobis Distance
Measurement of the squared distance applied to detect outliers in multivariate data can use the following formula [32]: Where ,1− 2 is the outlier limit with a probability of 1 − [33].
Data outside the interval are called outlier data. The next stage is to remove outliers in the data. After having and removing data with outliers, the regression parameters are estimated using the OLS and the estimations of M, MM, and S.

Figure 2. Boxplot Data
The results of the boxplot showed that there were outliers, notably in the student learning outcomes (Y) data. Then, the data with outliers were checked and validated using 1 − 3 2 < data < 3 + 3 2 .
The results are shown in Table 2.  Table 2 confirms that there was data below 15.00. This means that all values of the learning outcomes variable in the category below 15.00 were detected as outliers. Meanwhile, if we take a look closely, data with outliers were found in the 7 th , 14 th , and 19 th respondents. The causes of outliers could vary, such as recording errors, measuring instrument errors, and so on. Furthermore, the data were analyzed using the OLS method, M-estimation, MM-estimation, and S-estimation to see the accuracy and robustness of the outlier method.

Detection of outliers with Mahalanobis Distance
In the Mahalanobis distance calculation, the mean and variance-covariance matrix were calculated firstly, thus obtaining: Thus, the Mahalanobis distance of the data was obtained as follows: . Therefore, the 3 rd , 4 th , 5 th , 7 th , 8 th , 17 th , and 18 th observation data were detected as outliers.

Ordinary Least Square (OLS) Method
The OLS method is the most commonly used in the regression. The multiple regression analysis with the OLS method was presented using two data in this discussion, namely data with and without outliers. The analysis results are presented in Table 4 below. Based on Table 4 using the OLS method, it can be seen that there were differences in the analysis results between the data with and without outliers. In the Intercept section, it can be seen that the data with outliers had a value of -4.9277 which was very different from the data without outliers of which the value was 34.8248. It gave a contrast-interpretation. The data with outliers had a quite high standard error, namely 21.0053, while the data without outliers was 9.1318. If we look at the mean squared error (MSE), there was a significant difference between the data with and without outliers, indicating that the Ordinary Least Square (OLS) method was not a robust method for data with outliers.

Robust Regression Model with M-Estimation
We will present the application of a robust regression model using a bisquare weighting with Mestimation in this section. This aimed to test whether the M-estimation can be robust to the outlier data. The results of the analysis using M-estimation are presented in Table 5 below. The robust regression model with M-estimation showed no significant difference between the data with and without outliers. The M-estimation carried out reached convergence in the 7 th iteration. The data intercept parameter with an outlier of 33.5227 could survive even by removing the outlier, which was 33.1399. On the standard error and others, they had almost adjacent values. However, if we looked at the MSE robust regression method with M-estimation on the data with and without outliers, the difference was quite far. The data with outliers had an MSE of 546.5, while those without outliers had an MSE of 62.87.

Robust Regression Model with MM-Estimation
Furthermore, a robust regression method with MM-estimation was applied. The iterations required to reach convergence were seven iterations on the data with outliers in the MM-estimation. The results obtained are presented in Table 6 below. The robust regression model with MM-estimation showed no significant difference between the data with and without outliers. The MM-estimation carried out reached convergence in the 8 th iteration on the data without outliers. The intercept parameter data with an outlier of 34.2759 was quite able to survive even by removing the outliers, which was 33.9391. In terms of the standard errors and others, both the data with and without outliers had quite competitive values. However, the difference was quite far when using the MSE robust regression method with MM-estimation on the data with and without outliers. The data with outliers had an MSE value of 546.5, while those without outliers had an MSE value of 61.9.

Robust Regression Model with S-Estimation
The convergence was achieved in Iteratively Reweighted Least Squares (IRWLS) in the application of the S-method. The results obtained are presented in Table 7 below.

Discussion
The results of the comparison of the regression model with the OLS and the M, MM, S estimations are described in this section. Then, a robust model to the presence of outliers was determined. The following Table 8 presents a summary of each method. In the OLS method, there was a significant difference between the data with and without outliers. The difference between the intercept and standard error models was quite significant, indicating that multiple linear regression with the OLS method was weak against the outlier data. This is in line with research from [10], [34]- [36]. The robust regression model with M-estimation was quite reliable in overcoming the outlier data. It can be seen from the intercept of both the data with and without outliers of 33.5227 and 33.1399, which were not significantly different. In terms of the standard error, the data with outliers had an error of 8.9422 and those without outliers of 9.3720. The resulting standard error did not differ significantly when compared to the OLS method. This result is certainly in line with [37]. The regression model with MMestimation was also quite reliable in overcoming the outliers. Even the MM method resulted in the smallest standard error compared to the other models tested. Thus, it can be said that the MM-model was the best estimation candidate in this case study. The reliability of the MM method in overcoming outliers has also been confirmed by research conducted by [38], [39]. Meanwhile, the robust regression model on the Sestimation was also quite reliable in overcoming outliers, in which it generated a higher standard error than the M-estimation and MM-estimation for both the data with and without outliers. However, this result is slightly different from the research conducted by [23], [40] in their case study, showing that the S-estimation was better than the MM-estimation. In the data without outliers, the MM-estimation method was still the best method with an MSE value of 61.9, which was smaller than the other models. Therefore, the regression equation with the best estimation was ̂= 33.0059 + 0.9251 1 + 1.2083 2 for the data with outliers and ̂= 33.9391 + 0.9619 1 + 1.1975 2 for the data without outliers.

CONCLUSIONS
In modeling the regression, it is necessary to first know the characteristics of the data to be used. By understanding the data characteristics, it is easier to determine the most suitable method. For data without outliers, the OLS method is quite reliable in modeling regression. However, for data with outliers, the OLS method is not yet capable of modeling it. Therefore, a robust regression model with M, MM, and S estimations is needed. These estimations are quite reliable in modeling data with outliers. In this study, the regression model with MM-estimation was better than the M and S estimations. MM-estimation is better in modeling robust regression on data with and without outliers. The open issues in future research are related to the addition of fuzzy to the robust regression model in the M, MM, and S estimations or others.