COMPARISON OF K-MEANS AND GAUSSIAN MIXTURE MODEL IN PROFILING AREAS BY POVERTY INDICATORS

ABSTRACT


INTRODUCTION
Various indicators influence poverty in different areas. Appropriate measurement of poverty can help in knowing the number of poor people, distribution, and conditions of poverty. The method of calculating the poor population carried out by BPS uses the basic needs approach. Based on this approach, poverty is seen as an inability from an economic standpoint, so poverty status is measured according to the poverty line. According to the basic needs approach, the Head Count Index (HCI) indicator is used. Apart from the headcount index (P0), other indicators are used to measure poverty levels, namely the poverty gap index or P1 and the distributionally sensitive index or P2 formulated by Foster-Greer-Thorbecke [1]. This method is the basis for calculating the percentage of poor people in all districts or cities.
When viewed by island, the percentage of poor people on the island of Java in 2020 the number of poor people in Java is 14.05 million people. This number shows that over half of Indonesia's poor population is on Java. The increase in the number of poor people in Java is because the area of Java has found many cases of Covid-19 compared to other islands in Indonesia [2]. Central Java is one of the provinces in Java that has been most affected by  In 2020 the percentage of poor people was 11.41%. This percentage had increased from the previous 2019 when the percentage of poor people was 10.80%. According to the Socio-Demographic Survey on the impact of Covid-19, almost 50% of respondents in the low-income group (<1.8 million) said they had experienced a decline in income. This decrease causes poverty to increase because more and more people have an average expenditure below the poverty line. This condition will undoubtedly be a big challenge for the Central Java Provincial government to overcome the increasing poverty rate.
The poverty that occurs in a region in the long term will have an impact on hampering national development. The government needs to get an overview of the poverty of each district/city in Central Java to adopt poverty alleviation policies. In order to support the successful implementation of development programs to reduce poverty in Central Java Province, a study is needed to classify districts/cities in Central Java with almost the same or homogeneous characteristics or characteristics of poverty. As a solution to dig up poverty description information, one of the methods that can be used is clustering. Clustering aims to group data with the same characteristics into one group. With this grouping, the position of data distribution in actual conditions and finding a solution to a problem. One clustering method is the K-Means and the Gaussian Mixture Model. K-Means is a non-hierarchical cluster analysis that seeks to divide data with the same characteristics into one cluster. The K-means algorithm is performed by minimizing the sum squares distance between the data of each cluster center (centroid-based). Meanwhile, the Gaussian Mixture Model is a method that assumes that each Gaussian distribution number represents a cluster. A combination of means and variance will represent each Gaussian.
Several previous studies on poverty, as in [3] - [5], used K-Means and Average Linkage to map the characteristics of each group formed based on the value of each poverty indicator. Meanwhile, research that compares the performance of K-Means and GMM can be seen in [6]. The results of this study indicate that the GMM algorithm is superior to the K-Means algorithm based on the accuracy and speed of computation.
Based on the results of previous studies, researchers want to use the K-Means and GMM algorithms for grouping poverty data. The use of the GMM algorithm is relatively new for poverty indicator data. This study aims to classify poverty based on districts/cities in Central Java Province in 2020 using the K-Means algorithm and the Gaussian Mixture Model (GMM). Furthermore, profiling of the cluster results was carried out to map poverty in Central Java.

Cluster Assumption
There are two assumptions that must be fulfilled in cluster analysis [7], which are as follows.

1) Representative of the Sample
A representative sample is a sample with the same characteristics as the population. Using a representative sample will provide maximum results and be under the conditions of the existing population. If the research uses population data, it can be concluded that representative assumptions are met [7].
Another way to see whether a sample is representative is to use the Kaiser-Meyer Olkin (KMO) test. The KMO is conducted to see whether the sample is representative of the existing population so that the clustering or grouping process can be carried out correctly. This KMO test measures sample adequacy for each indicator. The KMO has a value of 0 to 1. If the KMO value is more than 0.5, the sample can be said to represent the population or a representative sample [8]. The following equation describes the KMO test [7].
where: p is the number of variables, is correlation between variables and , and is partial correlation between variables and .

2) Impact of Multicollinearity
The assumption in clusters is that there is no multicollinearity between variables. One way to determine the presence of multicollinearity is to look at the VIF value, where 2 is coefficient of determination If the VIF value exceeds 10, it can be concluded that there is multicollinearity among variables [9].

Principal Component Analysis (PCA)
1) Create an matrix that contains data from variable that has been standardized.
2) Make a correlation matrix from , namely ′ . Principal component reduction begins by finding the eigenvalues obtained from the equation: The number of selected principal components is based on the eigenvalue ( ). The number of principal components selected is the value of > 1 [11].

Determination of the Optimum Number of Clusters
There are several approaches to determining the optimum number of clusters: the connectivity index, Dunn index, and silhouette index. The formula for each of these indices is as follows [12].

1) Connectivity Index
where ( ) is the closest neighbor observation, to , and is a parameter that determines the number of neighbors contributing to connectivity measurements.

2) Dunn Index
Dunn index is the ratio of the smallest distance between observations in different clusters with the most significant distance in each data cluster.
where , , and are indices for each cluster, measures the distance between clusters, and ′ measures the differences between clusters.

3) Silhouette Index
Silhouette index used to measure the confidence level in the clustering process. The clustering results are said to be good if the index value is close to 1 and vice versa if the index value is close to -1.
where is the average distance between and other data in the same cluster, and is the average distance between and other data in different clusters.

K-Means Algorithm
The K-Means algorithm was first proposed by McQueen (1967) [13] and developed by Hartigan and Wong in 1979 [14], which aims to divide data points in dimensions into several clusters. The clustering steps using the K-Means algorithm are as follows [15].

1) Determine the number of clusters;
2) Randomly allocate the initial and centroid of the cluster; 3) Find the distance for each centroid using the Euclidean distance with the formula 4) Calculate the new centroid of the average data in each cluster = 1 + 2 + ; 5) Allocate each data to the nearest centroid, 6) If data is still moving clusters, return to step 3.

Gaussian Mixture Model Algorithm
McLachlan and Basford (1989) provide an approach by paying attention to data distribution, namely model-based analysis [16]. The model-based clustering method is a cluster group algorithm using statistical analysis to analyze group results. The model-based clustering method assumes that the data is generated by a mix of probability distributions, with each component representing a different cluster. If the model is a mixture of Gaussian components, it is called the Gaussian Mixture Model. The Gaussian mixture model is a method that assumes that each Gaussian distribution number represents a cluster. A combination of means and variance will represent each Gaussian. The purpose of grouping using the Gaussian mixture model is to find the model parameters (mean and covariance matrices of each distribution and weight) so that the resulting model best fits the data.
Fraley and Raftery (2003) identified several models used to group data with various geometric properties obtained through Gaussian components with different parameters [17], as seen in Table 1. Characteristics of geometric distribution (orientation, volume, and shape) are obtained from various shapes groups or limited to the same group. The variance matrix for all components can be equal or variance. In one dimension, only two models are available: E for the same variance and V for different variances. For more than one dimension, the geometric characteristics of the model are identified. For example, in the EVI model, where the volume of all clusters is the same (E), the shape of the clusters varies (V), and the orientation is identity (I). clusters with the EVI model have a diagonal covariance and orientation parallel to the coordinate axes. The explanation of Figure 1 is as follows.
To get the best results, the thing that must be done is to maximize the possibilities of the data from the GMM model. This can be achieved using the expectation maximization (EM) algorithm. In each iteration using the EM algorithm, there are two stages: the expectation stage (E-Step) and the maximization stage (M-Step).
The clustering steps using the EM algorithm in the Gaussian Mixture Model are as follows.

1) Initialize , , and
values randomly for all clusters. is the mean, is the variance, is the coefficient of the mixture, is the number that refers to a mixture in the Gaussian distribution, and the equivalent is the value that refers to a cluster.
2) E-Step: Evaluate the log-likelihood results using the parameters , , and . Suppose cluster is represented by a Gaussian distribution ( , ), then the probability of belonging to cluster is calculated from the equation: Then calculate the likelihood value and evidence: 3) M-Step: Change the value of , , ( ) by calculating with the following equations: = ∑ ( | ) 4) Repeat steps 2 and 3 until the convergence criteria are met. For convergence, determine specific threshold values for changes in means and variance in successive iterations, so that cluster members can be grouped using the Maximum a Posteriori (MAP) classification method with the following conditions: The selection of the best model in the Gaussian Mixture Model (GMM) method uses a commonly used approach, Bayes Information Criterion (BIC). Fraley and Raftery (1998) took a mixed model approach through the Bayes factor (BIC) with a systematic selection for model parameterization and the number of groups [19]. Generally, the greater the BIC value, the more substantial the evidence for the best model and number of clusters. The equation can obtain the value for BIC: where: ( | ) : integration of likelihoods for model, ( |̂ , ) : integrated maximum mixed likelihood for model, Vk : the number of independent parameters estimated in the model.

Materials
The data source in this study came from the Central Bureau of Statistics (BPS) official website of Central Java [20]. At the same time, the type of data is secondary data from BPS, which refers to concepts in the Handbook on Poverty and Inequality published by the World Bank. We calculate the poor population using the basic needs approach by the BPS.
This study uses poverty indicators which consist of 4 variables, namely poverty line (GK), percentage of poor population (P0), poverty depth index (P1), and poverty severity index (P2). Since each variable has a different unit, the data is standardized using the Z-score method before carrying out the cluster analysis.

Cluster Assumptions
Based on the results of the KMO test, a KMO value of 0.53 was obtained, which exceeded the threshold. It means that the sample represents the population or a representative sample so that the analysis can proceed to the next stage. Next is the multicollinearity test, and we used VIF to evaluate each variable. Based on the VIF value of each variable in Table 2, there is a VIF value of more than 10. Thus, there is an indication of multicollinearity in the independent variables in the data used. Therefore, it is necessary to do PCA before cluster analysis to overcome this condition.

Overcoming Multicollinearity with PCA
PCA aims to reduce variables to fewer than the previous number of variables, where the number of new variables will be less than the old variables. The principal component (PC) is a linear combination of the original variables. The number of selected principal components is seen based on the eigenvalue ( ) obtained from Equation (3), and the resulting eigenvalues are 2.820, 1.041, 0.132, and 0.007. Based on those results, eigenvalues of more than one are found in factors 1 to 2, which means that the number of factors to be formed is two factors (PC1 and PC2). The next step is to retest the cluster assumption using the result of PCA, and it was found that all cluster assumptions had been fulfilled.

Determination of the Optimum Number of Clusters
Several approaches to a clustering algorithm intend to find the best number of clusters. This study used the connectivity, Dunn, and silhouette indices, which were calculated based on equations 4-6. The best number of clusters is determined based on the smallest connectivity index value, the most extensive Dunn index, and the silhouette index close to 1. Table 3 shows the results of several index values obtained from the district/city poverty indicators in Central Java in 2020. We compare the number of clusters based on the respective index criteria applied to both algorithms. It was found that almost all indices were fulfilled for the best number of clusters of three.

K-Means Clustering
The K-means algorithm starts with choosing , where is the number of clusters that want to form. Then set the value of ; temporarily, the value becomes the center of the cluster or can be referred to as the centroid. The value of is 3, and the initial centroid value was selected, as shown in Table 4 below. After selecting the initial centroid randomly, the Euclidean distance is calculated with Equation (7), and after getting the cluster results in the first iteration, proceed with the second iteration's calculation with a new centroid derived from the average value of the data in each cluster. In this study, five iterations were carried out until there were no more data-moving clusters. Based on the above results using the K-means algorithm, it is found that Cluster 1 includes ten districts/cities, Cluster 2 includes 17 districts/cities, and Cluster 3 includes eight districts/cities, each of which can be seen in Figure 2 above.

GMM Clustering
In data clustering using the GMM, nine models were identified to group data with various geometric properties, which can be seen in Table 1. The best model can be determined based on the BIC. This study uses R with the help of the MCLUST package, which provides nine models with several components from 1 to 9. The best model is EII, with an optimal number of components of 3, as shown in Figure 3 below.

Figure 3. BIC value of GMM result
The GMM in R moves based on the Expectation Maximization (EM) algorithm. The final result is a mixing proportion, means vector, and covariance matrix. Then we also get the probability data for each cluster, and we get Cluster 1 covering 10 districts/cities, Cluster 2 covering 19 districts/cities, and Cluster 3 covering 6 districts/cities, which can be seen in Figure 4 below. In Figure 4, members of Cluster 1 are indicated by blue dots, Cluster 2 is colored red, and Cluster 3 is colored green.
After carrying out the clustering analysis, it is continued by looking at the best method of cluster analysis using the K-Means algorithm and the Gaussian Mixture Model. This study uses three indices, connectivity, Dunn, and silhouette, as shown in Table 3 above. With a total of 3 clusters, all indices show that the GMM algorithm produces better cluster analysis results than the K-Means algorithm.

Cluster Outcome Profiling
After performing cluster analysis using the K-Means and GMM algorithms, the best results were obtained using GMM. Furthermore, cluster profiling is carried out for the best results by looking at the average of each cluster. Based on Table 6, the characteristics of each cluster are different. From the smallest unit value to the most significant unit value for each variable, the smallest average unit value is obtained successively as low, medium, and high values for the largest average unit. Green indicates low, yellow indicates medium, and red indicates high. In general, the districts/cities included in Cluster 1 are groups with a moderate poverty line, a high percentage of poor people, a poverty depth index, and a poverty severity index. The distribution of Cluster 1 can be seen on the map in Figure 5, which is shown in blue. Districts included in Cluster 2 are groups with a low poverty line and poverty severity index, an average percentage of poor people, and a poverty depth index-the distribution of Cluster 2 groups is shown in red. Furthermore, the districts included in Cluster 3 are groups with a high poverty line, a low percentage of poor people, a poverty depth index, and a moderate poverty severity index. The distribution of Cluster 3 groups is shown in green.

CONCLUSIONS
The fact that GMM provides estimates of the likelihood that each data point belongs to each cluster is one of their key advantages. Compared to the solo cluster assignment that most other clustering algorithms offer, this offers much more contextual information. The advantage of GMM models over others, such as K-Means clustering, is that they do not presuppose all clusters have sphere-like shapes. Instead, clusters with different shapes can be accommodated using GMM.
Based on a comparison between the K-Means and GMM, all clustering indices (connectivity, Dunn, silhouette) show the best clustering results with GMM, with the number of clusters being 3.