K-MEANS CLUSTER COUNT OPTIMIZATION WITH SILHOUETTE INDEX VALIDATION AND DAVIES BOULDIN INDEX (CASE STUDY: COVERAGE OF PREGNANT WOMEN, CHILDBIRTH, AND POSTPARTUM HEALTH SERVICES IN INDONESIA IN 2020)

ABSTRACT


INTRODUCTION
The Sustainable Development Goals (SDGs) are a global action plan agreed upon by world leaders, including Indonesia, to end poverty, reduce inequality and protect the environment. One of the SDGs targets in the health sector that needs to be achieved is to improve the degree of public health as indicated by the decrease in the Maternal Mortality Rate (MMR) [3]. In Indonesia, MMR continues to increase every year, and one of the contributing factors can be seen from the decrease in the percentage of health service performance of pregnant women, childbirth, and postpartum in Indonesian provinces [11]. To overcome this decline in performance, namely by determining in advance the provinces that need to prioritize services by grouping 34 provinces in Indonesia. This study aims to obtain the best provincial grouping results so that they can prioritize the right provinces. One method that is suitable for grouping provinces is cluster analysis, and then the data used is data on health services for pregnant women, childbirth, and postpartum in Indonesia in 2020.
Cluster analysis is a suitable method because it can find out which provinces are high or low clusters in the health services of pregnant women, childbirth, and postpartum by identifying characteristics between 34 provinces in Indonesia. Cluster analysis is divided into two, namely hierarchical and non-hierarchical methods. One of the non-hierarchical methods is K-Means. The hierarchy method (agglomerative and divisive) is inefficient and the calculation process is longer if it is used to group large amounts of data compared to K-Means [5], so the cluster analysis method that will be used in this study is the K-Means method.
K-Means is a non-hierarchical method that can group n objects into k clusters that have the same characteristics and can be used on numerical data and include simple methods. The disadvantage of K-Means is that it is sensitive to determining the most appropriate number of initial k clusters because it is generally done randomly, will result in different data groupings, and does not always provide accurate results [8]. The exact and optimal number of k clusters can be determined using validation. These validations include the Silhouette Index and the Davies Bouldin Index. Both validations can see the optimal number of clusters with stable and consistent results. The determination of the number of initial K clusters in this study was 2, 3, and 4 by looking for the highest Silhouette Index value and the lowest Davies Bouldin Index value.
Objects in K-Means are grouped by their similarity. Distance measurement plays an important role because it can determine the degree of similarity of data. To measure the degree of similarity, the Euclidean and Manhattan distances are used. Euclidean distances are used very often, but based on research shows that Manhattan is better than Euclidean in clustering [17]. Therefore, in this study, the Manhattan distance was used as a comparison of the two distances.

RESEARCH METHODS
Maternal health services are a health effort that concerns the service and maintenance of pregnant women, maternity mothers, and breastfeeding mothers [12]. Pregnant women's health services that have been implemented in Indonesia are antenatal visit services which are pregnancy checks with health workers, giving blood-added tablets to prevent anemia, classes of pregnant women carried out at local government clinics, and providing additional food to pregnant women with chronic lack of energy aimed at overcoming malnutrition. Meanwhile, maternity services are childbirth efforts that are helped by trained health workers and carried out in healthcare facilities. In addition, health services for postpartum mothers, one of which is the provision of vitamin A supplements as early prevention of vitamin A deficiency.
Cluster analysis can group n objects based on p variables that have relatively similar characteristics among these objects so that the diversity within a cluster is smaller than the diversity between clusters [9]. Cluster analysis can be used in ordinal, interval, and ratio data scales. Cluster analysis is used as a data summarizer by grouping objects based on the similarity of certain characteristics of the object to be studied, which means it is not used to connect or distinguish with samples or other variables. The assumption before conducting cluster analysis is twofold, namely that the sample represents population and multicollinearity where j = 1,2,3, ..., p and l = 1,2,3, ..., p, for ≠ ; : Pearson correlation coefficient between variables j and l; and : partial correlation coefficient between variables j and l by keeping variable m constant.

Test Criteria
A sample is said to be representative of a population of a KMO value greater than 0,5 is obtained.

b. Multicollinearity Test
Multicollinearity is the possibility of a relationship or correlation in a variable. One way of identifying the existence of multicollinearity is to calculate the value of the Variance Inflation Factor (VIF) formulated in Equation 2 [4]: where 2 is the coefficient of determination of the dependent variable with the independent variable. If the VIF value < 10, then there is no multicollinearity.
Cluster analysis is used to group the similarity of an object in the same cluster, therefore it takes some measure of distance to find out how similar the objects are. For this study, the distance measure used was Euclidean distance and Manhattan distances.

a. Euclidean Distance
The Euclidean distance is the root of the sum of the squares of the difference between objects. Formula equation for calculating Euclidean distance in Equation 3 [16]: where ( , ) is the Euclidean distance between the i-th object, the j-th variable to the center of the cluster (centroid) k-th on the j-th variable; k = 1.2, ... , K; is the value of the i-th object on the j-th variable; is the center of the k-th centroid on the j-th variable; p is the number of observed variables; and K is the number of clusters.

b. Manhattan Distance
Manhattan distance is the sum of the absolute difference for each object. Manhattan distance is expressed in Equation 4 [18]: where ( , ) is the Manhattan distance between the i-th object, the j-th variable to the center of the cluster (centroid) k-th on the j-th variable; k= 1.2, ... , K; is the value of the i-th object on the j-th variable; is the center of the k-th centroid on the j-th variable; p is the number of observed variables; and K is the number of clusters.
K-Means is a non-hierarchical clustering method that seeks to partition data into one or more clusters so that data with the same characteristics is grouped into the same cluster and data with different characteristics is grouped into other clusters. The steps of K-Means are [6]: a. Determining the number of K-clusters to be formed; b. Randomly determine the initial cluster center (centroid); c. Calculate the distance of each object with each centroid; d. Grouping each object into the closest centroid, an object will become a member of the k-th cluster if the distance of that object to the k-th centroid is of the least value when compared to the distance to other centroids; e. Determine the new centroid by calculating the average of the objects on each cluster with Equation (5): is the centroid of the k-th cluster of the j-th variable; is the number of objects on the k-th cluster; and is the value on the i-th object on the j-th variable; f. Repeat steps c through e until none of the members of each cluster have changed.
After clustering data into a number of clusters with K-Means, a validation process is needed on the cluster. Validation on the cluster is carried out to evaluate the cluster formed by giving it a validity value. This study will be used two validations to determine the optimal number of clusters in K-Means, namely by validating the Silhouette Index and the Davies Bouldin Index.

a. Silhouette Index Validation
The Silhouette coefficient is formulated in Equation (6): The best grouping is achieved if maximum SC means minimizing the distance in the group ( ( )) while maximizing the distance between groups ( ( )), the greater the value of the silhouette coefficient, the better the quality of a group [13].

b. Davies Bouldin Index Validation
Validation of the Davies Bouldin Index formulated in Equation 7 [14]: with, , and : average of the i-th object distances with k-th centroid cluster : average of the i-th object distances with v-th centroid cluster , : k-th cluster centroid distance and v-th cluster centroid distance The smaller the Davies Bouldin Index (DBI) value obtained (non-negative ≥ 0), the better the cluster obtained The last step in grouping provinces in Indonesia based on health services for pregnant women, childbirth, and postpartum, is to interpret or profile the optimal number of clusters. Cluster profiling is used to see the average value of the members of each variable in each cluster, which will then obtain the characteristics of each cluster [15]. Cluster profiling is the stage of interpretation of each cluster that is formed to provide information as an illustration of the nature of the cluster and explain how each cluster can be relevant in each cluster [10].
The type of data used in this study is secondary data obtained from the Indonesian Health Profile in 2020. The data consists of the coverage of health services for pregnant women, childbirth, and postpartum based on 34 provinces in Indonesia in 2020. The research variables used were the percentage of antenatal visits four times (K4) services for pregnant women (X1), the percentage of giving blood-added tablets to pregnant women (X2), the percentage of local government clinics carrying out classes for pregnant women (X3), the percentage of supplementary feeding for pregnant women with Chronic Energy Deficiency (CED) (X4), the percentage of maternity services assisted by trained health workers (X5), the percentage of postpartum maternal health services received vitamin A (X6).
This research was carried out data processing using R software, then the stages of data analysis are: a. Inputting data on health services for pregnant women, childbirth, and postpartum; b. Performing a sample assumption test representing a population with Kaiser Meyer Olkin (KMO); c. Conducting a multicollinearity assumption test, with a Variance Inflation Factor (VIF) value, if multicollinearity occurs in one of the variables, the main component analysis is carried out, the main component score obtained will be used as input in the next analysis as a substitute for the initial variable values with Euclidean and Manhattan distances at K=2,3, and 4. The highest Silhouette coefficient value and the lowest Davies Bouldin Index value will be selected as the optimal number of clusters; i. Analyzing optimal cluster results and profiling and interpretation of the regional characteristics of each cluster formed from the best groupings.

RESULTS AND DISCUSSION
The test results of cluster analysis assumptions based on R software processing are:

a. Sample Representing Population
In this study, the KMO test was not carried out because the data was in the form of a population of health services for pregnant women, childbirth, and postpartum in 34 provinces of Indonesia, so it can be concluded that the data has represented the existing population and the analysis can be continued.

b. Multicollinearity Test
Based on testing the assumption of multicollinearity using R software, it was obtained that the VIF value on the overall variables used in the study was less than 10. The value indicates that each variable does not have multicollinearity. After it is known that the cluster analysis assumptions are met, then further processing is carried out using the K-Means method. The results of the final iteration of the K-Means method in Table 1: Based on Table 1, a grouping of 34 objects was obtained for the number of clusters K=2 with Euclidean, members of Cluster 1 are 29 objects and Cluster 2 are 5 objects, while with Manhattan obtained members of Cluster 1 are 32 objects and Cluster 2 are 2 objects. For K=3 with Euclidean, members of Cluster 1 are 16 objects, Cluster 2 are 2 objects, and Cluster 3 are 16 objects. Whereas with Manhattan in Cluster 1 consists of 28 objects, Cluster 2 are 2 objects, and Cluster 3 are 4 objects. For K=4 with Euclidean, Cluster 1 has 15 objects, Cluster 2 has 12 objects, Cluster 3 consists of 5 objects, and Cluster 4 consists of 2 objects. Meanwhile, with Manhattan, the members of Cluster 1 are 2 objects, Cluster 2 are 3 objects, Cluster 3 are 1 object, and Cluster 4 are 28 objects.
Based on the results of the K-Means clustering method using Euclidean and Manhattan distances for K=2, 3, and 4, an evaluation was then carried out based on the validation of the Silhouette Index and Davies Bouldin Index to determine the most optimal number of clusters. The greater the value of the silhouette coefficient, the better the quality of a group, and the smaller the value of the Davies Bouldin Index (DBI) obtained (non-negative ≥ 0), the better the cluster obtained. The results of the K-Means clustering analysis described in Table 2: Based on Table 2, it can be seen that in the grouping of 34 provinces in Indonesia using the K-Means method, the highest value of the Silhouette Index was obtained, namely 0,658685, and the Davies Bouldin Index obtained the lowest value of 0,3561214. The Silhouette Index and Davies Bouldin Index values have the same results, so from the results of the K-Means analysis, it is concluded that the optimal number of clusters is at K=2 using the Manhattan distance.
After the optimal number of clusters is known, the last step in grouping provinces in Indonesia based on health services for pregnant women, childbirth, and postpartum is to interpret or profile the optimal number of clusters. Based on the evaluation of the optimal number of clusters using two validations, the optimal number of clusters was obtained, namely K=2 using the Manhattan distance. The method gives the result that Cluster 1 consists of 32 provinces and Cluster 2 consists of 2 provinces. The members of each cluster formed are in Table  3: The profiling stage will see the characteristics of each cluster formed so that the tendency of each cluster can be seen. The characteristics of the clusters formed in the K-Means method can be represented by looking at the average of the members of each of the variables used in the study. The average of each variable in the cluster formed based on the health services of pregnant women, childbirth, and postpartum in Table 4:  Table 4 shows that the average cluster with the highest health services for pregnant women, childbirth, and postpartum is in Cluster 1. Cluster 1 means that the provinces in Cluster 1 have a very good quality of health services for pregnant women, childbirth, and postpartum compared to Cluster 2. Cluster 2 is seen to have a smaller cluster average than Cluster 1. This means that Cluster 2 members are provinces with low quality of health services for pregnant women, childbirth, and postpartum in Indonesia. Based on this, it can be interpreted that West Papua and Papua are provinces that must be paid more attention to by the government in Indonesia because they have low health services for pregnant women, childbirth, and postpartum, especially in the service of antenatal visits four times (K4), giving blood-added tablets, local government clinic carrying out classes for pregnant women, childbirth services, and services for giving vitamin A supplements to postpartum mothers.

CONCLUSIONS
The conclusions obtained based on the results of the analysis and discussion that have been carried out are: 1. The results of the grouping of K-Means methods from 34 provinces are: a. For K=2 with Euclidean distance, the number of members of Cluster 1 is 29 provinces and Cluster 2 is 5 provinces, while with Manhattan in Cluster 1 it consists of 32 provinces and Cluster 2 is 2 provinces; b. For K=3 with Euclidean distance, the number of members of Cluster 1 is 16 provinces, Cluster 2 is 2 provinces, and Cluster 3 is 16 provinces. Whereas with Manhattan in Cluster 1 consists of 28 provinces, Cluster 2 is 2 provinces and Cluster 3 is 4 provinces; c. For K=4 with Euclidean distance, Cluster 1 is obtained as many as 15 provinces, Cluster 2 is 12 provinces, Cluster 3 consists of 5 provinces, and Cluster 4 consists of 2 provinces. Meanwhile, with Manhattan, the number of members of Cluster 1 is 2 provinces, Cluster 2 is 3 provinces, Cluster 3 is 1 province, and Cluster 4 is 28 provinces. 2. The results of the grouping of 34 provinces in Indonesia using the K-Means method obtained the optimal number of clusters at K=2 with a size of the Manhattan distance. This can be seen in the validation results with the Silhouette Index of 0,658685 which is the highest value and the Davies Bouldin Index obtained a value of 0,3561214 which is the lowest value. It was obtained that Cluster 1 consisted of 32 provinces and Cluster 2 consisted of 2 provinces. Based on this grouping, it was found that the measurement distance used would affect the cluster results obtained; 3. The profiling results show that the highest cluster average is in Cluster 1, which means that the members of Cluster 1 are provinces with high quality of maternal health services. Cluster 2 has a lower cluster average than Cluster 1, which means that Cluster 2 members are provinces with low maternal health services. It is hoped that the government in Indonesia will pay more attention to the provinces in Cluster 2, namely West Papua and Papua, which are a collection of provinces with low average health services for pregnant women, childbirth, and postpartum compared to Cluster 1, so that the province can improve the quality of maternal health services and can reduce maternal mortality in the coming year.