IMPLEMENTATION OF THE DBSCAN METHOD FOR CLUSTER MAPPING OF EARTHQUAKE SPREAD LOCATION

ABSTRACT


INTRODUCTION
Geographically, most areas in Indonesia are located in disaster-prone areas [1]. Natural disasters in Indonesia are caused by Indonesia's location, which is between three plate meetings, namely the Indo-Australian plate which moves north, the Eurasian plate which moves south, and the Pacific plate which moves from east to west. Indonesia is also traversed by two active mountain ranges in the world, namely the Pacific Circum and the Mediterranean Circumference. Astronomically, West Java province is located between 5°50 ′ − 7°50 ′ South Latitude and 104°48 ′ − 108°48 ′ East Longitude.
West Java region is in the Pacific Circum and Mediterranean Circum pathways, so the area is an unstable area that is marked by many volcanoes that are still actively working and earthquakes often occurring. Earthquakes have distinctive characteristics, namely unavoidable events, very sudden and surprising events, and the time of the earthquake; the location of its central and its strength cannot be predicted appropriately or accurately by anyone, including earthquake experts [2].
According to the Indonesian Central Statistics Agency (BPS), the frequency of earthquakes in Indonesia in 2021 occurred 10519 times [3]. The frequency of the earthquake rose dramatically by 2151 compared to the previous year. In addition, according to the Center for Volcanology and Geological Disaster Mitigation, in Indonesia, there are 127 active volcanoes [4]. Indonesia is the country with the largest number of active volcanoes in the world and is the first rank in the world with the highest number of fatalities. The island of Java is the island with the most active volcanoes in Indonesia, which are 34 active volcanoes.
Of the 34 active volcanoes on Java Island, 15 active volcanoes are found in West Java Province. According to the National Disaster Management Agency, from all-natural disaster events, there were 5402 total disasters in Indonesia in 2021 [5]. Of all natural disasters in Indonesia in 2021, West Java Province is the province with the highest number of natural disasters, namely 1358 events. Therefore, West Java Province is a province with several volcanoes and a high number of natural disasters.
Due to many earthquakes in West Java Province, one of the disaster mitigation efforts is by determining the cluster in the area. One of the suitable clustering methods for spatial data is the Density-Based Spatial Clustering of Application with Noise (DBSCAN). DBSCAN is an algorithm that is included in the category of density-based clustering, which is the process of forming the cluster based on the density of distance between objects in the dataset. DBScan has the advantage of being able to detect noise [6]. Harini et al. [7] applied DBSCAN methods for disaster, namely by mapping earthquake data in Bali and Nusa Tenggara. Anwar et al. [8] applied this method for disaster, namely by mapping the risk of forest fires to identify areas at risk of forest fires.
Several studies related to earthquake clustering include the Reviantika [9] study, which used the K-Means algorithm in earthquake analysis on the island of Java which produced the two best clusters, and Wahyu [10] used the same method and object and produced 4 clusters. Kurmiati [11] used the K-Medoid algorithm to classify earthquake areas in Indonesia, resulting in 3 cluster areas. Febriani [12] conducted a cluster analysis using the Kohonen Self-Organizing Maps (SOMs) algorithm in earthquake clustering in Indonesia, resulting in 3 clusters from the analysis. Sari [13] conducted clustering using K-Affinity Propagation (K-AP) in earthquake clustering in Indonesia, which resulted in 4 clusters. Meanwhile, Arista [14] uses the Fuzzy C-Means algorithm for grouping earthquake events into three groups. However, the clustering algorithm above does not use the spatial concept (in this case, area density). In contrast, the DBSCAN algorithm is a cluster analysis that looks at it from a spatial density point of view. The density calculation uses the location/ Universal Transverse Mercator (UTM) coordinate system so that it is easier to identify which areas are frequently subject to earthquakes and their densities.
This study aims to classify earthquake data in West Java Province based on density. The data used in this study is earthquake data in West Java Province in 2021 taken from the BMKG's online data website at dataonline.bmkg.go.id. From the data obtained, the latitude and longitude variables are used for analysis using an algorithm [15]. Furthermore, this research is expected to be used as a form of disaster mitigation to minimize the impact of losses due to earthquakes in West Java Province.

Data Sources and Research Stages
The population in this study is all earthquake events that occurred in 2021. While the sample used in this study is data on the location of the distribution of earthquakes in West Java Province in 2021 taken from the BMKG online data website at dataonline.bmkg.go.id. Table 1 is an illustration of data in research. The latitude and longitude variables were used in the DBSCAN algorithm, while the depth and magnitude variables were used to see the earthquake characteristics of each group. The stages of the research conducted are described in Figure 1. The research began by entering data on the location of the distribution of earthquakes in West Java Province in 2021. Then, the second stage was carried out by geoprocessing, georeferencing, and digitization in making maps of the West Java Province area. The third stage was data preprocessing, namely by eliminating duplicate data and data that was outside the territory of West Java Province. The fourth stage was descriptive analysis to see an overview of earthquake data in West Java Province in 2021 and detect patterns of data distribution using nearest-neighbor analysis.
The results of the fourth stage were used as the basis for seeing the results of the data distribution patterns, whether they are grouped or not. If the data distribution pattern is clustered, then it is continued to the DBSCAN algorithm, while if the data distribution pattern is not clustered, then another method is used. The fifth stage was to perform cluster analysis using the DBSCAN algorithm, followed by evaluating the cluster results with the silhouette coefficient. The next stage was deeper data exploration in three ways: (1) clustering based on the highest silhouette value, (2) clustering by lowering the MinPts value, and (3) clustering based on the supremum value of the silhouette coefficient. The last step was to choose the best cluster by comparing the results of the clusters from the optimum silhouette with the results from data exploration and visualization of the best cluster results.

Nearest-Neighbor Analysis
The nearest-neighbor analysis is a method designed to measure patterns in a data point in two or three dimensions. This method involves calculating the average point distance between all points and their nearest neighbors [16]. The nearest-neighbor index is expressed as the ratio of the observed distance divided by the expected distance. The distance used is the Euclidean distance presented in Equation (1). The nearestneighbor index is expressed by R and the calculation is written in Equation (2) [17].
̅ is the observed mean nearest-neighbor or the average distance of observations between each point and the nearest neighbor. ̅ is the expected mean nearest-neighbor or the expected average distance for the given points in a random pattern. The calculation of ̅ and ̅ is presented in Equation (3).
is the distance between and its nearest neighbor, is the number of points, and is the minimum rectangular area around all points, or a given area value.
̅ is standard error of the mean nearest neighbor and is the test statistics, with the calculation formula written in Equation (4) and (5).
√ ⁄ The nearest-neighbor index has values ranging from 0 to 2.15. The nearest-neighbor index with a value of 0 indicates a fully clustered pattern. Meanwhile, the nearest-neighbor index with a value of 2.15 shows a complete dispersion pattern. The random pattern is indicated by the nearest-neighbor index value of 1. The sampling distribution of test statistics is a normal distribution. In other words, is a normal standard deviation. Hypothesis test on test statistics ( ) with null hypothesis ( 0 ) is randomly distributed data and alternative hypothesis ( 1 ) is clustered distributed data, in making decisions the value is used in normal distribution and the absolute value of , if > , then reject 0 [18].

Density-Based Spatial Clustering of Application with Noise (DBSCAN)
DBSCAN is an algorithm in which the process of forming clusters is carried out based on the level of density of distances between objects in the dataset. DBSCAN has the advantage of being able to detect noise [6]. If the number of neighbors and themselves within the Eps radius is less than MinPts and no neighbors become the core because of their presence, the data is categorized as noise [19]. In DBSCAN, 2 parameters are required, namely minimum pointsThe nearest-neighbor (MinPts) and epsilon (Eps) which values are determined by the researchers. MinPts is the minimum number of items in a cluster and Eps is the value for the distance (radius) between items which forms the basis for forming a neighborhood from an item point. Neighborhoods that lie within the radius (∈) are called ∈-neighborhoods of the data object [20]. An object is a core object if the ∈-neighborhood of the object contains at least MinPts of objects. Core objects are pillars of density areas [17].
There are 2 types of points in a cluster, namely at the core points and at the edges of the border points, where the neighborhood of the border points contains far fewer items than the neighborhood of the core points [21]. A border point may belong to more than 1 cluster. If given a set of objects, all core objects given the parameters ∈ and MinPts can be identified. The clustering within is reduced by using core objects and neighborhoods to form densities, where dense areas are clusters [17].
There are several terms in DBSCAN: (1) directly density-reachable: for core object and object , it says that is directly density-reachable of (with ∈ and MinPts) if is in the ∈-neighborhood of , (2) density-reachable: if is density-reachable of (with ∈ and MinPts in ) if there is a chain of objects 1 , … , so that 1 = , = , and +1 are directly density-reachable from to ∈ and MinPts, for 1 ≤ ≤ , ∈ , (3) density-connected: two objects 1 , 2 ∈ are density-connected (with ∈ and MinPts) if there are objects ∈ such that so that 1 and 2 are density-reachable from to ∈ and MinPts. Figure 2 is an example of density-reachability and density-connectivity for a certain ∈ which is represented by the circle radius and by assuming MinPts=3. Points , , , are core objects because each one is in a ∈-neighborhood that contains at least three points. Object directly density-reachable from m. Object m is directly density-reachable from and vice versa. Object is (indirectly) density-reachable of because is directly density-reachable of m and m is directly density-reachable of . However, is not density-reachable from because is not a core object. Similarly, r and are density-reachable of and is density-reachable of . Therefore, , , and are all density-connected.
DBSCAN finds clusters by tracing the clusters, which is by examining the ∈-neighborhood (Epsneighborhood) of each point in the database. If the ∈-neighborhood of point contains more than MinPts, a new cluster with as the core object is created. Then, DBSCAN iteratively collects density-reachable objects directly from the core object, which may involve merging several density-reachable clusters [20]. The sequence of the DBSCAN algorithm is as follows: (1) choose the initial point randomly, (2) determine Eps and MinPts to take all points that are density reachable to point , (3) if p is the core point then a cluster is formed, (4) if is a border point, there is no density-reachable relation of , and DBSCAN will visit the next point in the database, (5) continue processing until all points have been processed, (6) the result obtained does not depend on the order of the processed points taken.

Model Evaluation
The quality of a cluster is obtained from the value of the silhouette coefficient [22]. The distance between data in the same cluster is called intra-cluster. Meanwhile, the distance between data in one cluster to data in another cluster is called inter-cluster [23]. If the difference between intra-cluster and inter-cluster is greater, then the resulting silhouette coefficient value is getting better. For example, if you have data that is clustered into Cluster A, Cluster B, and Cluster C. Therefore, the silhouette coefficient calculation begins by calculating the value ( ) (intra-cluster), which is finding the average distance of the data with all data in the same cluster, it is assumed here that the data is in Cluster A [24]. The formula for ( ) is written in Equation (6), where A is the amount of data in Cluster A.
Next, calculate the value of ( ) (inter-cluster), which is the minimum value of the data mean with all data in different clusters. Then, assume different clusters besides A and cluster C. Then, the calculation of the data average distance with all data in cluster C ( ( , )) is written in Equation (7), where is the amount of data in Cluster C. After calculating ( , ) for all clusters ≠ , then choose the minimum distance as the value of ( ) with Equation (8).

) IMPLEMENTATION OF THE DBSCAN METHOD FOR CLUSTER…
If Cluster B has a minimum distance, then ( , ) = ( ), which is called the neighbor of the data and is the second-best cluster for the data after Cluster A. After ( ) and ( )is known, then the next step is to calculate the silhouette value of object with Equation (9).
The value of ( ) is between -1 and 1, where each value is interpreted as follows: (1) ( ) ≈ 1, then the data is classified properly (in ), (2) ( ) ≈ 0, then the data is in the middle between the two clusters well ( dan ), (3) ( ) ≈ −1, then the data is classified as weak (closer to Cluster than ) [25]. The average silhouette ( ) of all objects in a cluster is called the average silhouette width of that cluster. The average silhouette ( ) for = 1,2, … , is called the average silhouette width for the entire data set. Then, the maximum value of the average silhouette width for the entire data set is called the silhouette coefficient

Data Overview
In 2021, earthquakes in West Java Province occurred in 110 locations. Before grouping the data, it will be seen first how the distribution pattern of the data. As the basis for whether the points of the earthquake spread or form clusters, the nearest neighbor analysis is used. Based on the results of calculations using the nearest-neighbor analysis, the nearest-neighbor index value is 0.716320, which indicates a cluster pattern. Besides, the absolute value of the is 5.691877 which is worth more than the , which is 1.96. Based on this, the location data for the distribution of earthquakes in West Java Province in 2021 has a clustered data distribution pattern.

Clustering using DBSCAN Algorithm
In this study, it was carried out on Eps and MinPts values. The MinPts interval used is from 3 to 6, while the Eps value is between 9000 and 23000. From the combination of these two parameters, the best cluster is taken based on the silhouette coefficient. The results of the combination of the two parameters are presented in Table 2. Based on Table 2, the maximum silhouette coefficient value is 0.859, namely at MinPts=5 and Eps=9000, with 3 clusters formed and 71 noise. A silhouette coefficient value of 0.859 means that the cluster has a strong structure [24]. The silhouette coefficient plot for each MinPts is shown in Figure 3. In Figure 3, it is found that the silhouette coefficient is found at MinPts = 9000 and Eps = 5. In addition, slowly the value of the average silhouette width for the entire data set decreases to a MinPts value of 23000 for all Eps.

Data Exploration
In addition to clustering based on the highest silhouette value, according to the explanation at the research stage, data exploration was carried out. Exploration of seismic clustering data aims at disaster mitigation. Disaster mitigation is a study that aims to minimize the impact of losses due to disaster events, both material losses and moral losses [1]. In this disaster mitigation, what is being done is to create clusters of earthquake-prone areas. Researchers tried to explore this by lowering the MinPts value so that more and more clusters are formed so that there are also more and more areas that are prone to earthquakes.

Clustering by lowering the MinPts value
In this section, the MinPts value was reduced to 3 and 4 from the optimum silhouette value. The visualization of the number of clusters from the combination of the two MinPts is presented in Figure 4 (a). From Figure 4 (a) for MinPts=3, it reaches the optimum point with a silhouette coefficient value of 0.756, 8 clusters are obtained, the number of points in the cluster is 57, and the number of points in the noise is 53. Meanwhile, in MinPts=4, it reaches the optimum with a silhouette coefficient value of 0.808 with the number of clusters formed is 4, the number of points in this cluster is 43, and the number of points in the noise is 67.

Clustering based on the value of the smallest (supremum) value of the silhouette coefficient
The results of clustering from the supremum value of a cluster are said to be good (≈0.71), which are visualized in Figure 4

Selection of the Best Cluster Results
In selecting the best cluster, it is determined by comparing the cluster results on the optimum silhouette coefficient value with the cluster results on data exploration. The comparison of cluster results is shown in Table 3. From Table 3, it is found that the optimum silhouette obtained is 0.859 (Eps=9000, MinPts=3) with 3 clusters. Meanwhile, the clusters that have the highest number are found in the 3rd data exploration results (Eps=10000, MinPts=3, and silhouette coefficient = 0.713), that is, with a total of 12 clusters. In addition, the results of clusters that have more points in the cluster than the number of points in the noise are found in the cluster results in the 1st data exploration (Eps=9000, MinPts=3, and silhouette coefficient = 0.756) and cluster results on the 3rd data exploration (Eps=10000, MinPts=3, and silhouette coefficient = 0.713).
From these results, the researcher determined the best cluster results based on the number of clusters obtained, namely the cluster with the highest number of clusters. This is because the larger the number of clusters that are formed, the more areas that are prone to earthquakes are detected, the hope is that this can be used as information for the public to be aware of areas that are prone to earthquakes. In addition, the best cluster results are taken from clusters that have many points in the cluster that are larger than the many points in the noise. Therefore, it is determined that the best cluster results are found in the 3rd data exploration results (Eps=10000, MinPts=3, and silhouette coefficient=0.713), namely with 12 clusters, 70 points in the cluster, and 40 points in the noise. The visualization of the cluster results can be seen in Figure 5.  Table 4. Sukabumi Regency 3 4 Sukabumi Regency 3 5 South Sea of West Java (South of Cianjur Regency) 5 6 Karawang Regency 3 7 Bandung and Garut Regencies 24 8 South Sea of West Java (South of Cianjur Regency) 6 9 South Sea of West Java (South of Sukabumi Regency) 4 10 Cianjur and Bandung Regencies 4 11 South Sea of West Java (South of Sukabumi Regency) 3 12 South Sea of West Java (South of Garut Regency) 3 The characteristics of the depth and strength of the earthquake in the silhouette coefficient supremum for each cluster are shown in Table 5. From Table 5, it is found that the average shallowest earthquake depth is in Cluster 7 (Bandung and Garut Regencies), while the highest average earthquake strength is in Cluster 9 (the southern sea area of West Java (south of Sukabumi Regency)). In addition, for most parts in each cluster, the standard deviation values were smaller than the average value, meaning that the data on the depth of the earthquake and the strength of the earthquake in each cluster varied less. Furthermore, the characteristics of the depth and strength of the earthquake on the noise data in the supremum silhouette are shown in Table 6. From Table 6, it is found that the noise data in the silhouette coefficient supremum has varying earthquake depths because the standard deviation value is greater than the average value. Meanwhile, the strength of the earthquake varies less because the standard deviation value is smaller than the average value. Furthermore, researchers compared the best cluster results obtained with the earthquake hazard map in West Java Province in 2020 by the Regional Disaster Management Agency (BPBD) of West Java Province which is shown in Figure 6. In Figure 6, the earthquake hazard index is shown based on colors. If the color is closer to green, then it indicates a lower earthquake hazard index; if the color is closer to yellow then it indicates a medium earthquake hazard index; and if the color is closer to red then it indicates a higher earthquake hazard index. If the best cluster results obtained by researchers are compared to the earthquake hazard map from the BPBD of West Java Province in 2020, the results tend to be the same, namely, some areas are prone to earthquakes in Sukabumi, Cianjur, Bandung, Garut regencies, and some areas in the southern West Java that have the earthquake index is quite high by BNPB. In addition, there are also earthquake-prone areas in Karawang and Garut Regencies which have an earthquake index that tends to be moderate by the BPBD of West Java Province. Therefore, the best cluster results obtained by researchers with the earthquake hazard map from the BPBD of West Java Province in 2020 show the same results. Overall, West Java is an unstable area which is characterized by many volcanoes because it is located on the Pacific Circum and Mediterranean Circum pathways, so in the mainland area of West Java Province, earthquakes (volcanic earthquakes) often occur. In addition, in the South Seas area, West Java is located close to the meeting line of the Eurasian and Indo-Australian plates, so earthquakes are also common (tectonic earthquakes). Therefore, possible disaster mitigation is to conduct outreach and increase awareness in the community. In addition, conducting basic disaster training for officials and the community is also necessary. For the government, it can supervise the implementation of various regulations regarding spatial planning, building permits (IMB), and other regulations related to disaster prevention that aim to minimize the impact of disasters. In addition, it can conduct research and study the characteristics of disasters as well as disaster risk analysis [1].

CONCLUSIONS
Earthquakes in West Java Province have a clustered pattern of data distribution. 12 clusters were formed, which were obtained using DBSCAN. The evaluation results obtained are 0.713 using the silhouette coefficient, which means that the cluster has a strong structure. Each cluster has its earthquake characteristics. This can be a form of disaster mitigation to minimize the impact of losses due to earthquakes in the West Java Province area and become information for the public to be aware of areas that are prone to earthquakes. .

ACKNOWLEDGMENT
The author thanks to Statistics Department, Faculty of Mathematics and Natural Science, Universitas Islam Indonesia which has assisted in the research process.