CLUSTER ANALYSIS OF MULTIVARIATE PANEL DATA ON DATA CONTAINING OUTLIERS

  • Kristuisno Martsuyanto Kapiluka Statistics and Data Science Study Program, School of Data Science, Mathematics, and Informatics, IPB University, Indonesia https://orcid.org/0009-0004-5990-7798
  • Hari Wijayanto Statistics and Data Science Study Program, School of Data Science, Mathematics, and Informatics, IPB University, Indonesia https://orcid.org/0000-0002-7507-2602
  • Anwar Fitrianto Statistics and Data Science Study Program, School of Data Science, Mathematics, and Informatics, IPB University, Indonesia https://orcid.org/0000-0001-7050-3082
Keywords: Calinski-Harabasz, Clustering, Outlier, Panel data, Trajectory

Abstract

One clustering method for panel data is K-Means Longitudinal (KML), which considers only a single trajectory per subject over time. To address this limitation, KML was extended into K-Means Longitudinal 3D (KML3D), which enables clustering of joint or multivariate longitudinal data by considering multiple trajectories measured simultaneously for each subject. Both KML and KML3D provide new insights into clustering panel data using a non-hierarchical K-means approach. Hereinafter, this method is referred to as KML3D K-Means. KML3D K-Means implements the K-Means algorithm, specifically designed to cluster trajectories in panel data, and uses the mean as the clustering centroid. In practice, the K-Means algorithm is less effective in clustering data with outliers. This issue can be overcome by KML3D K-Medoids, a method based on KML3D that uses the median as the centroid. This study aims to determine cluster analysis for multivariate panel data on data containing outliers with KML3D K-Means and KML3D K-Medoids. Both methods are applied to panel data of social and welfare statistical data from 34 provinces observed for 8 years (2016 – 2023). The comparison of methods is based on the Calinski–Harabasz index. The results of the study show that KML3D K-Medoids has a Calinski-Harabasz index that is higher than KML3D K-Means in clustering multivariate panel data with outliers. The analysis identified three optimal clusters (k = 3) based on the Calinski–Harabasz (CH) index, which can be categorized as the “more prosperous”, “moderately prosperous”, and “less prosperous” groups. The growth rate analysis reveals disparities in development trajectories across clusters, with cluster 3 showing the most consistent improvements, cluster 1 moderate progress, and cluster 2 lagging in key social and welfare indicators.

Downloads

Download data is not yet available.

References

Ö. Akay and G. Yüksel, “CLUSTERING THE MIXED PANEL DATASET USING GOWER’S DISTANCE AND K-PROTOTYPES ALGORITHMS,” Commun. Stat. - Simul. Comput., vol. 47, no. 10, pp. 3031–3041, Nov. 2018. doi: https://doi.org/10.1080/03610918.2017.1367806.

E. U. Oti, M. O. Olusola, F. C. Eze, and S. U. Enogwe, “COMPREHENSIVE REVIEW OF K-MEANS CLUSTERING ALGORITHMS,” Int. J. Adv. Sci. Res. Eng., vol. 07, no. 08, pp. 64–69, 2021. doi: https://doi.org/10.31695/IJASRE.2021.34050.

P. Jiang, J. Cao, W. Yu, and F. Nie, “A ROBUST ENTROPY REGULARIZED K-MEANS CLUSTERING ALGORITHM FOR PROCESSING NOISE IN DATASETS,” Neural Comput. Appl., vol. 37, pp. 6617–6632, Jan. 2025, doi: https://doi.org/10.1007/s00521-024-10899-4.

X. Ao, Y. Zhang, Y. Zhou, and D. Xu, “RESEARCH ON WEIGHTED CLUSTER ANALYSIS METHOD OF PANEL DATA,” J. Phys. Conf. Ser., vol. 1848, no. 1, p. 12036, 2021. doi: https://doi.org/10.1088/1742-6596/1848/1/012036.

J. Hu and S. Szymczak, “A REVIEW ON LONGITUDINAL DATA ANALYSIS WITH RANDOM FOREST,” Brief. Bioinform., vol. 24, no. 2, p. bbad002, Mar. 2023. doi: https://doi.org/10.1093/bib/bbad002.

C. Genolini and B. Falissard, “KML: A PACKAGE TO CLUSTER LONGITUDINAL DATA,” Comput. Methods Programs Biomed., vol. 104, no. 3, pp. e112–e121, 2011, doi: https://doi.org/10.1016/j.cmpb.2011.05.008.

N. G. P. Den Teuling, S. C. Pauws, and E. R. van den Heuvel, “A COMPARISON OF METHODS FOR CLUSTERING LONGITUDINAL DATA WITH SLOWLY CHANGING TRENDS,” Commun. Stat. Simul. Comput., vol. 52, no. 3, pp. 621–648, 2020. doi: https://doi.org/10.1080/03610918.2020.1861464.

C. Genolini, X. Alacoque, M. Sentenac, and C. Arnaud, “KML AND KML3D: R PACKAGES TO CLUSTER LONGITUDINAL DATA,” J. Stat. Softw., vol. 65, no. 4 SE-Articles, pp. 1–34, Jun. 2015. doi: https://doi.org/10.18637/jss.v065.i04.

S. Mullin et al., “LONGITUDINAL K-MEANS APPROACHES TO CLUSTERING AND ANALYZING EHR OPIOID USE TRAJECTORIES FOR CLINICAL SUBTYPES,” J. Biomed. Inform., vol. 122, no. July, p. 103889, 2021. doi: https://doi.org/10.1016/j.jbi.2021.103889.

X. Lu et al., “HEPATITIS B ANTIBODY TRAJECTORIES IN MEDICAL SCHOOL STUDENTS: AN EMPIRICAL COMPARISON OF LONGITUDINAL CLUSTERING METHODS”. 2024. doi: https://doi.org/10.21203/rs.3.rs-4899940/v1.

S. Wahl et al., “COMPARATIVE ANALYSIS OF PLASMA METABOLOMICS RESPONSE TO METABOLIC CHALLENGE TESTS IN HEALTHY SUBJECTS AND INFLUENCE OF THE FTO OBESITY RISK ALLELE,” Metabolomics, vol. 10, Jun. 2014. doi: https://doi.org/10.1007/s11306-013-0586-x.

A. Nas, S. Mulatsih, and M. Findi, “REGIONAL CLUSTERING BASED ON HDI COMPONENTS IN INDONESIA,” Int. J. Sci. Res. Sci. Eng. Technol., vol. 4099, pp. 21–25, 2021. doi: https://doi.org/10.32628/IJSRSET21813.

A. Degirmenci and O. Karal, “EFFICIENT DENSITY AND CLUSTER BASED INCREMENTAL OUTLIER DETECTION IN DATA STREAMS,” Inf. Sci. (Ny)., vol. 607, pp. 901–920, 2022. doi: https://doi.org/10.1016/j.ins.2022.06.013.

E. Herman, K.-E. Zsido, and V. Fenyves, “CLUSTER ANALYSIS WITH K-MEAN VERSUS K-MEDOID IN FINANCIAL PERFORMANCE EVALUATION,” 2022. doi: https://doi.org/10.3390/app12167985.

N. R. Pradana Ratnasari, “COMPARATIVE STUDY OF K-MEAN, K-MEDOID AND HIERARCHICAL CLUSTERING USING DATA OF TUBERCULOSIS INDICATORS IN INDONESIA,” Indones. J. Life Sci., vol. 5, no. 2, pp. 9–20, 2023. doi: https://doi.org/10.54250/ijls.v5i02.181.

N. F. Fahrudin and R. Rindiyani, “COMPARISON OF K-MEDOIDS AND K-MEANS ALGORITHMS IN SEGMENTING CUSTOMERS BASED ON RFM CRITERIA,” E3S Web Conf., vol. 484, 2024. doi: https://doi.org/10.1051/e3sconf/202448402008.

T. Caliński and H. JA, “A DENDRITE METHOD FOR CLUSTER ANALYSIS,” Commun. Stat. - Theory Methods, vol. 3, pp. 1–27, Jan. 1974. doi: https://doi.org/10.1080/03610927408827101.

B. P. Statistik, “STATISTIK INDONESIA”, www.bps.go.id. Accessed: Apr. 24, 2024. [Online]. Available: www.bps.go.id

N. Aini, A. Lestari, M. N. Hayati, F. Deny, and T. Amijaya, “ANALISIS CLUSTER PADA DATA KATEGORIK DAN NUMERIK DENGAN PENDEKATAN CLUSTER ENSEMBLE (STUDI KASUS : PUSKESMAS DI PROVINSI KALIMANTAN TIMUR KONDISI DESEMBER 2017),” J. EKSPONENSIAL Vol. 11, vol. 11, pp. 117–126, 2020. doi: https://doi.org/10.30872/eksponensial.v11i2.652

R. Juan, “FUSION CLUSTERING ANALYSIS OF MULTIVARIATE PANEL DATA,” J. Appl. Stat. Manag., 2013, [Online]. Available: https://api.semanticscholar.org/CorpusID:124374320

D. Puspitasari, M. Wahyudi, M. Rizaldi, A. Nurhadi, K. Ramanda, and Sumanto, “K-MEANS ALGORITHM FOR CLUSTERING THE LOCATION OF ACCIDENT-PRONE ON THE HIGHWAY,” J. Phys. Conf. Ser., vol. 1641, no. 1, 2020. doi: https://doi.org/10.1088/1742-6596/1641/1/012086.

G. Puentes, “COMPARISON BETWEEN NEURAL NETWORK CLUSTERING, HIERARCHICAL CLUSTERING, AND K-MEANS CLUSTERING: APPLICATIONS USING FLUIDIC LENSES,” Opt. Express, vol. 33, no. 13, pp. 28405–28419, 2025. doi: https://doi.org/10.1364/OE.566535.

T. M. Ghazal et al., “PERFORMANCES OF K-MEANS CLUSTERING ALGORITHM WITH DIFFERENT DISTANCE METRICS,” Intell. Autom. Soft Comput., vol. 30, no. 2, pp. 735–742, 2021. doi: https://doi.org/10.32604/iasc.2021.019067.

P. Arora, Deepali, and S. Varshney, “ANALYSIS OF K-MEANS AND K-MEDOIDS ALGORITHM FOR BIG DATA,” Procedia Comput. Sci., vol. 78, pp. 507–512, 2016. doi: https://doi.org/10.1016/j.procs.2016.02.095.

E. Schubert and P. J. Rousseeuw, “FASTER K-MEDOIDS CLUSTERING: IMPROVING THE PAM, CLARA, AND CLARANS ALGORITHMS BT - SIMILARITY SEARCH AND APPLICATIONS,” G. Amato, C. Gennaro, V. Oria, and M. Radovanović, Eds., Cham: Springer International Publishing, 2019, pp. 171–187. doi: https://doi.org/10.1007/978-3-030-32047-8_16

N. Sureja, B. Chawda, and A. Vasant, “AN IMPROVED K-MEDOIDS CLUSTERING APPROACH BASED ON THE CROW SEARCH ALGORITHM,” J. Comput. Math. Data Sci., vol. 3, no. July 2021, p. 100034, 2022. doi: https://doi.org/10.1016/j.jcmds.2022.100034.

J. Baarsch and M. E. Celebi, “INVESTIGATION OF INTERNAL VALIDITY MEASURES FOR K-MEANS CLUSTERING,” Lect. Notes Eng. Comput. Sci., vol. 2195, pp. 471–476, 2012.

A. Nowak-Brzezińska and I. Gaibei, “HOW THE OUTLIERS INFLUENCE THE QUALITY OF CLUSTERING?,” Entropy, vol. 24, no. 7, 2022. doi: https://doi.org/10.3390/e24070917.

Published
2025-11-24
How to Cite
[1]
K. M. Kapiluka, H. Wijayanto, and A. Fitrianto, “CLUSTER ANALYSIS OF MULTIVARIATE PANEL DATA ON DATA CONTAINING OUTLIERS”, BAREKENG: J. Math. & App., vol. 20, no. 1, pp. 0439-0452, Nov. 2025.