DATA MINING STUDY FOR GROUPING ELEMENTARY SCHOOLS IN BOJONEGORO REGENCY BASED ON CAPACITY AND EDUCATIONAL FACILITIES

ABSTRACT


INTRODUCTION
Talking about the quality of education must be connected to the learning process in the classroom. Learning in the school includes two critical aspects, namely, teachers and students. Teachers have the task of teaching, while students have the purpose of learning. Good environmental conditions can have a positive impact on the teaching-learning process. An attractive environment will create a learning atmosphere in the school. The classroom's physical environment also influences elementary school students' development process. Environmental elements such as objects around students, including class size, room temperature, classroom environment, and student comfort, influence students' cognitive development [1]. A good education will produce a quality generation and become the key to improving the country's development. According to UNESCO, education in Indonesia ranks 10th out of 14 developing countries, where 21% of schools in urban areas, 37% in rural areas, and 66% in remote areas still lack teachers [2]. The most significant contributor to the low quality of education is undoubtedly from rural and remote areas. This is due to the need for more reliable educators willing to live in remote areas, the uneven distribution of teachers, and the uneven quality of teachers.
According to Law No. 2 of 1989 on the national education system, Elementary education is organized to develop attitudes and abilities. Elementary education is a nine-year education, consisting of a six-year program in Elementary school and a three-year education program in junior high school. Elementary education aims to provide students with essential skills to develop their lives as individuals, members of society, citizens, and members of humanity and to prepare students to follow education [3].
Improving a country's education quality must also begin with improving the quality of education in remote areas. Good quality education must be spread evenly in each region. The lack of equitable distribution of education occurs a lot in remote provinces, so there is much distribution of education subsidies that have not been evenly and thoroughly distributed [4].
Teachers are essential components in the teaching and learning process and participate in efforts to form potential human resources in the development field. According to experts, professional teachers are all people who have the authority and responsibility for the education of students, both individually and in groups, at school or outside school [1].
Students are members of society who seek to develop their potential through the learning process available at specific paths, levels, and types of education [5]. Students are one of the components of education that cannot be abandoned because, without students, it would not be possible for the learning process to run. Students are human components that occupy a central position in the teaching and learning process where students as parties who want to achieve goals, have goals, and then want to achieve them optimally.
The development of the world of education needs the latest innovations related to facilities tailored to educational capacity to improve the quality of education in Bojonegoro Regency. The development of capacity and facilities to advance civilization indicates that the world of education needs an upgrade to support education so that learning is carried out more effectively and efficiently. Capacity and facilities in the world of education have the advantage of making it easier to provide good comfort for teaching staff and students. A wide distribution of facilities is expected to improve today's quality of education. It is necessary to group elementary schools to solve the problem in the Bojonegoro district based on the completeness of capacity and educational facilities. The process of clustering elementary schools is statistically done with the clustering method. With the clustering of Elementary schools, policymakers can determine more efficient policies for improving Elementary education in the Bojonegoro district.
Based on data from the Regional Education Balance Sheet (NPD) in 2021, data on Elementary schools in the Bojonegoro district is presented. It is obtained with a total of 723 schools with 72,103 students and 19 dropouts, 5,041 teachers with 2,581 civil servants and 2,460 non-civil servants, 4,585 classrooms, and 4,853 classrooms with 1,456 in good condition and 3,397 in a lightly damaged condition [6].
In the Bojonegoro sub-district, there are Elementary schools with less than five new students. It is such as Banjarejo 3 Elementary School, which has only two new students, Klangon 1 Elementary School has only three students, and Ledok Kulon 3 Elementary school has only five new students. The small number of new students is due to the need for more school-age children in the neighborhood and the proximity between elementary schools. The elementary schools concerned have requested the Education Office policy for equal distribution of students in elementary schools so that all schools can get students according to the room's capacity and teachers [7]. Based on research conducted by Winarti et al. (2021) related to teacher professionalism and the quality of education at Mojodelik 1 State Elementary School with a qualitative approach shows that, in general, teachers are professional in mastering the material, concept structure, and scientific mindset [9]. So far, educational research in Bojonegoro Regency still needs to be improved, so it is a critical research area.
In this research, the aim is to compare clustering methods to get the best method to be used for clustering elementary schools in Bojonegoro Regency based on educational capacity and facilities. The clustering methods used are K-Means, K-Medoids, and Random clustering.
The clustering method is appropriate in dividing or grouping elementary schools in Bojonegoro Regency based on educational capacity and facilities. The clustering method is the process of grouping some data based on the characteristics of the data obtained. It can explain the relationship between one member and another by grouping the members' similarities in one group and making a difference with other groups regarding their respective data characteristics [10].
K-Means research conducted by Priambodo and Prasetyo (2018) obtained the results of mapping the Banten Province area based on the level of education that has a shortage, sufficiency, and excess of teachers according to the district/city [2]. The benefit of the results of this study is that it can be a suggestion for the Banten Provincial Education Office in terms of teacher equity. Furthermore, K-Means research by Pradanyana and Permana (2018) concluded that the number of clusters and the amount of data used affect the quality of clusters formed by the K-Means and K-Nearest Neighbors methods used [11]. From the K-Means research conducted by Rangan et al. (2018), the resulting accuracy measure of the Support Vector Machine method with an accuracy rate of 93.33%, while the K-means clustering method with Support Vector Machine (SVM) support obtained an accuracy rate of 99.33% [12]. In the research of Nugraha and Hairani (2018), the K-Means Clustering method is used by displaying the results in the form of a map so that the education office or institutions that handle education in Indonesia can compare the quality of education in each province in Indonesia [13]. Anggraeni and Putra's research (2021) stated that the K-Means clustering method is a method with optimal cluster accuracy results. However, the use of this method is longer with the determination of the centroid, which must be accurate so that the output results are as expected [14]. In the research of Gates and Ahn (2019), the Random Clustering method is used for clustering Cancer types where the clustering process is more like Hierarchical Clustering than random Clustering [18]. This method is considered easier to implement and works for real cluster weights q > 0 with results for several q that can be found at once in one simulation. However, this method must improve accuracy for large values when q > 1. Random Clustering performs random flat Clustering on observational data, which groups object that are similar to each other and different from objects that belong to other clusters [19].
In this study, the best clustering model criteria used a principle similar to the Elbow method in Putu et al.'s research (2021), which measures cluster density performance and produces an average within cluster distance measure with the help of Rapidminer software for each clustering method experiment on clusters 2 to 10. The best cluster k is selected at the average within-cluster distance value, close to zero, on the first stable sloping curve [20].
In the field of education in Bojonegoro Regency, it is still considered less targeted because it needs to know which groups of elementary schools need attention related to increasing educational capacity and facilities. Grouping elementary schools with clustering methods can determine groups of elementary schools based on educational capacity and facilities in the Bojonegoro district. This research benefits the Bojonegoro education office in supporting equalizing educational capacity and facilities for all elementary schools. Thus, it is necessary to propose research titled "Data Mining Study for Grouping Elementary Schools in Bojonegoro Regency Based on Educational Capacity and Facilities".

Research Design
The research design applied is a quantitative research approach that uses three clustering methods, namely K-Means, K-Medoids, and Random Clustering, to be compared. The clustering method is applied using Rapidminer Software with the best method criteria measured based on cluster density performance, producing an average within cluster distance measure.

Population and Sample
The sample in this study represents the observed population. It uses secondary data representing educational capacity and facilities, namely data on the number of teachers, number of students, number of classrooms, and number of rombel of an elementary school in Bojonegoro district taken from the Ministry of Education, Culture, Research and Technology in 2021.

Sampling Technique
The sampling technique used in this research uses a purposive sample that takes data from the database according to the research objectives. As the case study application, the education data sample was taken from the official website of the Ministry of Education, Culture, Research, and Technology.

Research Subjects
The data sources used in this study are secondary data in the form of data on students, teachers, rombel, and classrooms of elementary schools in Bojonegoro Regency obtained from the website of the Ministry of Education, Culture, Research and Technology in 2021. The research variables used in this study are presented in Table 1: The research variables used are Teacher, Student, Classroom, and Rombel variables, the description of which has been presented in Table 1. All variables have a measurement scale, that is, the ratio.

Data Analysis Technique
The analysis step in this research is to compare three clustering methods, including K-Means, K-Medoids, and Random Clustering, which can be done with the help of RapidMiner software. The application of clustering methods with Rapidminer software is made by making a process design arranged in Figure 1 with several tools such as: Retrieve data, Normalize, Multiply, Method (K-Means), Data to Similarity, multiply second, and Performance. In Figure 1, the tools or operators are given in the research. Retrieve data is used to input the dataset. Normalize to standardize is used for the observation data in a standard Normal distribution. Multiply is given to divide the process into more than one process. Method (K-Means) is used to apply the K-Mean clustering method. Data to Similarity accounts for the Similarity between examples of a dataset. Performance is used to measure the Performance of an applied method.
The Retrieve data tool helps open datasets, Normalize to standardize data, and Multiply to divide the process of 1 path into more than one path. The method can be selected for Clustering methods such as K-Means, K-Medoid, and Random Clustering methods, Data to Similarity serves to anticipate data repetition. Performance is used in testing the Performance of the Method with the provisions of Cluster density performance. The steps of data analysis in this study are given in stages:

Flow Chart
In this research, a flow chart can provide a detailed description of the data analysis steps in this study. The flow chart is used following the visualization in  Figure 2 gives a visualization of the stages of comparison between three clustering methods: the methods of K-Means, K-Medoids, and Random Clustering. Starting with the input dataset as a primary school education database and then evaluating the three clustering methods. The process continues by measuring the performance results of the three clustering methods and then comparing the results of the three clustering methods. The output given is the result of the best clustering method, then the output for the descriptive statistics of the dataset based on the grouping of the best clustering method.

Results
In this study, the amount of data used is 723 in the form of elementary schools in Bojonegoro Regency regarding the number of teachers, students, classrooms, and rombel in 2021. Elementary School data is taken in the vulnerable months of January -December 2021. The data criteria used consist of the number of Students, Teachers, Classrooms, and Rombel of Elementary Schools in Bojonegoro Regency. After the data selection process, 723 datasets were obtained, which are summarized in the descriptive statistics table in Table 2.  Outliers will affect the mean value in the observation data so that the median will replace the centering value. The presence of outliers in the observation data is indicated by the Z-score value outside the Normal curve (-3 < Z-score < 3).    Figure 3 shows the visualization of the outliers detection using Boxplot. Outliers are identified when the observation data points are above or below the boxplot diagram given to each research variable.
The Boxplot identification results lead to the conclusion that there are very many outliers in all research variables.
The performance vector value with the best results from each clustering method is used in the clustering method comparison process. Based on the clustering method evaluation process of the centroid-based K-Means, K-Medoids, and Random Clustering methods, the comparison results of the three clustering methods are described in Table 4.  Table 4 provides the results of the comparison of the three clustering methods based on the Average Within Cluster Distance value. The best group k value is obtained from observing the movement of the average within-cluster distance value which slopes first to the minimum value where the best value falls on k equal to 5. The best cluster is selected from the average within-cluster distance value that rises the first stable slope, namely in cluster 5 with a K-Means performance value of -3569.258, K-Medoid of -4321.351, and Random Clustering of -9121.728. This result will be seen in the visualization of the multi-line diagram in Figure 4.  Figure 4 shows the visualization of multiple lines that give the movement of the average within-cluster distance value. The best cluster is selected from the average within-cluster distance value with the first stable slope increase, which is cluster 5 with K-Means. The best clustering result for clustering elementary schools in Bojonegoro Regency is the K-Means clustering method. For group categorization, a summary of descriptive statistics is given in Table 5. Based on Table 5

Discussion
In Table 2, the variant value of each observation variable has a reasonably diverse value, so the data distribution is not normal. The standard deviation of all variables is enormous because the average is also huge. The results of identifying outlier values in the observation data are obtained, but in applying the K-Median method, the outlier values can be adequately resolved. Figure 3 shows that the Box-plot output for the variables of Students, Teachers, Classrooms, and Rombel found outliers. The most significant difference between the minimum and maximum values is shown from the Student's variable, so this variable needs more attention from decision-makers.
In evaluating the clustering method, the resulting performance vector value of the clustering method in the K-Means, K-Medoids, and Random Clustering methods occurs in selecting cluster k of 5. It is also following the equalization of the capacity and facilities of elementary schools, namely: Very Complete, Complete, Quite Complete, Incomplete, and Incomplete. In comparing clustering methods in Table 4, the best clustering method is obtained with the performance vector value closest to 0 and the stability of the results, which falls on the K-Means method. The results of the clustering application produce clusters of elementary schools with almost the same amount of capacity and facilities in each cluster.
In Table 5, cluster_0 consists of 177 elementary schools with the lowest average number of students, the lowest average number of teachers, the lowest average number of classrooms, and the lowest average number of rombel. Thus cluster_0 includes the capacity and facilities of elementary schools classified as Incomplete. In cluster_1, there are 310 elementary schools with the second lowest average students, second lowest average teachers, second lowest average classrooms, and second lowest average number of rombel. Thus cluster_1 includes the capacity and facilities of elementary schools classified as less complete. From cluster_2, there are 236 elementary schools with the second highest average number of students, the second highest average number of teachers, the second highest average number of classrooms, and the second highest average number of rombel. Thus cluster_2 includes the capacity and facilities of elementary Schools classified as Complete. In cluster_3, there are 14 elementary schools with the highest average number of students, the highest average number of teachers, the highest average number of classrooms, and the highest average number of rombel. Thus cluster_3 includes the capacity and facilities of elementary schools classified as Highly Complete. From cluster_4, there are 176 elementary schools with the third highest average students, third highest average teachers, third highest average classrooms, and third highest average rombel. Thus cluster_4 includes elementary school capacities and facilities classified as Fairly Complete.

CONCLUSIONS
Based on the results that researchers have carried out, the following conclusions can be obtained: 1. Data on the capacity and completeness of elementary school facilities is obtained with a large number of outliers with maximum and minimum values with a huge difference so that the observation data is far from the Normal distribution. 2. From the comparison results of Clustering methods, it is obtained that the grouping of Primary School data with the best method falls on the K-Means method by getting 5 clusters. 3. The group of schools with highly complete primary school capacity and facilities is 14 schools (cluster_3), complete 236 schools (cluster_2), fairly complete 176 schools (cluster_4), less complete 310 schools (cluster_1), and incomplete 177 schools (cluster_0).

ACKNOWLEDGMENT
We want to thank Universitas Nahdlatul Ulama Sunan Giri (UNUGIRI) for providing computer laboratory facilities that facilitated the implementation of this research. We would also like to thank the Bojonegoro District Education Office for publishing capacity data and observation facilities on the official website.