OUTLIER DETECTION ON HIGH DIMENSIONAL DATA USING MINIMUM VECTOR VARIANCE (MVV)

. High-dimensional data can occur in actual cases where the variable p is larger than the number of observations n. The problem that often occurs when adding data dimensions indicates that the data points will approach an outlier. Outliers are parts of observations that do not follow the data distribution pattern and are located far from the data center. The existence of outliers needs to be detected because it can lead to deviations from the analysis results. One of the methods used to detect outliers is the Mahalanobis distance. To obtain a robust Mahalanobis distance, the Minimum Vector Variance (MVV) method is used. This study will compare the MVV method with the classical Mahalanobis distance method in detecting outliers in non-invasive blood glucose level data, both at p>n and n>p. The test results show that the MVV method is better for n>p. MVV shows more effective results in identifying the minimum data group and outlier data points than the classical method.


INTRODUCTION
Along with the rapid development of technology, the role of data is now crucial in various fields of knowledge and its use. The development increased the number of databases both in the number of observations and in the number of dimensions. In real cases, high-dimensional data can occur where p is greater than n, p is a variable or variable, and n is the number of observations [1]. In theory, increasing the number of variables gives an accurate classification result. However, in practice, with a limited number of observations and a large number of variables, data handling experiences a high analytical complexity in addressing the problems contained in the data [2].
The problem that often occurs in adding data dimensions indicates that the data points will approach an outlier. Outliers are part of the data from a data set that does not follow the data distribution pattern and is located far from the data center. Outliers in the data can result in inaccurate data analysis results, such as deviations from statistical test results based on the mean and covariance parameters [3]. Therefore, detection of outlier indications is needed, especially in extensive data.
Outlier detection is beneficial in various applications such as network intrusion detection, indications of credit card fraud, monitoring activities, financial applications, analysis of election irregularities, bad weather prediction indications, geographic information systems, and other data fields [4]. Detection of outliers is usually with the concept of proximity-based on its relationship to the rest of the existing data. In high-dimensional data, the data density will decrease, resulting in the estimation of the proximity between the data becoming less accurate [5].
One of the methods used to identify outliers in multivariate data is the Mahalanobis distance, by calculating the distance of each observation to the center of the data set [6] . However, the Mahalanobis distance is still included in the classical estimator, which relies on basic assumptions such as normality, linearity, etc. Therefore, a robust method is needed against outliers [3] . One alternative method has been discussed to obtain a robust Mahalanobis distance in detecting outliers in multivariate data, including [3] , [7] , [8], and [9], using the Minimum Vector Variance (MVV) method. However, in this discussion, the multivariate data used is not classified as high-dimensional data with p greater than n. The MVV method utilizes the total variance in finding the minimum covariance matrix so that a robust Mahalanobis distance can be obtained against the outliers. In this study, the outlier detection method will be carried out on multivariate, high-dimensional data.

RESEARCH METHODS
The data used in this study are primary data as part of a study by a non-invasive biomarking team at the Bogor Agricultural University regarding the development and clinical trial of a prototype of a noninvasive blood glucose monitoring device. Data collection starts on July 13-20, 2019 using an invasive nonblood glucose measurement tool design. The design captures the intensity of light passed from the finger and was implemented with a total of 74 respondents who came from Kebon Pedes Village, Tanah Sereal District, and Bogor City. This blood glucose level data is generated from intensity residual points from each modulation and time domain. So, before detecting outliers in this data, a summary process is first carried out to obtain the variables that define this blood glucose level.

Data Summary Process
Statistical analysis of the data is done by calculating the area of the trapezoid to obtain comprehensive information by utilizing the time domain interval set to adjust the point of observation. The following illustrates the data summary process shown in Figure 1. The non-invasive measurement of blood glucose levels produces initial data in the form of intensity residues for each time domain which is designed with a lighting level (modulation) of 0-90. Aurelia's research [10] states that broad summarization can estimate blood glucose levels with better performance than standard deviation summaries. This is because area summarization can utilize all the information from the data well and uses broad limits based on time-domain intervals that have been set to suit all observations. Aurelia then discusses using the 50-90 modulation in blood glucose level data in 2017 and 2019, showing significant residual values. Other modulations tend to be constant, so the modulation used in this study is the 50-90 modulation, which is the 26-30 period.
Each residual intensity value in one modulation, as shown in Figure 1, will be drawn in a straight line in the direction of its time domain. Then calculate each area on each peak formed using the trapezoid area formula. The length of the time-domain interval is defined as the height(t i ) and the residual value of the intensity is defined as a parallel side (y i ). After obtaining the area in one modulation, the value of the area is added up to form the peak area of each modulation. One independent variable is the sum of the area values of one modulation.
There were five modulations and five independent variables in one replicate. One independent variable is the sum of the area values of one peak in one modulation. There are two peaks in one modulation, so there are 50 independent variables in five replications. To obtain high-dimensional data with p greater than n, the data will be taken in as many as 30 or 40 observations. The data will be standardized first before conducting further analysis.
Mahalanobis distance calculation is defined by calculating the distance of each observation to the center of all the data. Mahalanobis distance is more practical than Euclidean distance, where the calculation considers the correlation between variables [11] . Multivariate high-dimensional data is very susceptible to the correlation between variables [12] . In this study, the identification of outliers in blood glucose level data in 2019 will be carried out using classical and robust methods.

Classic Mahalanobis Distance Detection
The steps for detecting outliers using the classical Mahalanobis distance method are as follows [13]  5. Evaluating the Mahalanobis distance using the chi-square cut-off value ( χ 2 ), i.e., if d i 2 > χ p,(1−α) 2 , then the i-th observation point is identified as an outlier.

Outlier Detection with Mahalanobis Minimum Vector Variance
Generally, there are two methods for dealing with multivariate problems related to covariance: total variance (TV) and general variance (GV). The general variance (GV) is usually called the covariance determinant (CD) or is defined as |Σ|. The TV role is defined as Tr(Σ), usually used in dimensional reduction problems such as principal component analysis, etc. The role of CD can be used in almost all multivariate problems. However, in its application to the principal component analysis, the CD has limitations if its value is close to zero or equal to zero. Therefore, a new concept for solving this problem was launched, known as vector variance (VV) or defined as Tr(Σ 2 ) [3].
The effectiveness of VV computation led to the development of VV as a robust estimation by minimizing the vector variance of the data. MVV criteria in labeling outliers and the application of principal components were first introduced by Herwindiati (in ) value for all possible sets containing h data. Therefore, the MVV estimates for the location parameters of the matrix are written in the following formula [14]; The algorithm for detecting outliers with the Minimum Vector Variance (MVV) method is as follows [ π n ). This order will give a permutation of the observation index.
4. Form a new set new , It consists of h observations with index π(1), π(2),…, π(h). , then the observation point is identified as an outlier.

Outline Detection Classical Mahalanobis Distance Method with p>n
The results of outlier detection using the classical Mahalanobis distance method on blood glucose levels are shown in the following Table 1:  Table 1 shows the data for non-invasive blood glucose levels  The red dot in Figure 2 indicates outlier observations, while the black dots are normal observations. The horizontal line is the i-th observation, and the vertical line Md is the Mahalanobis distance. The blue horizontal line is the cut-off value (X (50;0,95) 2 ), which is 67.50481. The outliers detected tend to be in the area of the blue horizontal line, with the position of the scattered dots making it difficult to identify the group of outliers.

Outline Detection of Mahalanobis MVV Distance Method with p>n
The results of outlier detection using the Mahalanobis MVV distance method on blood glucose level data are shown in the following Table 2: The Mahalanobis distance data for 2019 uses the robust MVV method with a large number of observations, n = 30and n = 40the resulting data ranges from 41,52 − 1896,15. The cut-off value(X (50;0,95) 2 ) on the data is 67,50481. Observations are said to be outliers if they exceed the cut-off limit of chi-square, so based on the results of the calculation of the distance of the Mahalanobis Robust MVV, there are four observations classified as outliers in n = 30 and n = 40, there are six outliers. The number of outliers detected in the Mahalanobis MVV distance method is smaller than in the classical Mahalanobis distance method. Observations classified as outliers based on calculating the distance Mahalanobis MVV can be seen in the scatter plot in Figure 3. Observation of outliers detected in Figure 3 shows outlier points more clearly grouped in position than the classical method. The robust Mahalanobis distance obtained can identify the minimum data group and indicate data points with extreme outliers.

Mahalanobis Distance Outage Detection with n>p
Detection of outliers in multivariate data with the number of observations greater than the variable. The data used is data on blood glucose levels with the use of all observations as many as 74 observations. The detection results are summarized in the following table: The Mahalanobis distance of this data uses the robust MVV method in all observations of the 2019 blood glucose level data produced, ranging from 27,35 − 1896,15. N value cut-off value (X (50;0,95) 2 ) on the data is 67,50481. Observations are said to be outliers if they exceed the cut-off limit of the data. Based on the calculation of the Mahalanobis distance results, two observations were classified as outliers, and the robust MVV method identified 12 outliers. The number of outliers detected in the Mahalanobis MVV distance method is greater than in the classical Mahalanobis distance method. This is inversely proportional to the p>n dimension data, where the number of observations identified as outliers by the MVV method is smaller than the classical method. Observations classified as outliers based on the calculation of the Mahalanobis MVV distance can be seen in the scatter plot in Figure 4. The vertical line labeled x in Figure 4 is defined as the Mahalanobis distance. Some outlier observations are detected by the MVV method but not by the classical method. It is necessary to find points that indicate outliers to facilitate further analysis of blood glucose level data. So, if these points can be detected, it will help in further statistical analysis. Observations of outliers detected in the MVV method show outlier points that are clearer in grouping positions than the classical method on n>p data. The number of outliers obtained from Mahalanobis distance can identify the minimum data group and data points with extreme outliers.

CONCLUSION
Based on the study's results, it can be concluded that the Mahalanobis MVV distance is better used for detecting outliers of non-invasive blood glucose level data in 2019 with n>p. The use of the classical Mahalanobis distance is limited in identifying extreme outliers in this data compared to the MVV method, which is more robust against outliers. For high-dimensional data where p>n, the results obtained by the MVV method are more effective in identifying the minimum data group and data points with extreme outliers than the classical Mahalanobis distance method. However, some outliers in the high-dimensional data can be identified at the classical Mahalanobis distance but not identified at the Mahalanobis MVV distance.