A COMPARATIVE STUDY OF PIPELINE-VALIDATED MACHINE LEARNING CLASSIFIERS FOR PERMISSION-BASED ANDROID MALWARE DETECTION

Keywords: Android Malware, Classification, Gradient Boosting Machine, Logistic Regression, Permission-Based Detection, Random Forest

Abstract

The growing prevalence of Android malware distributed through third-party APK sideloading poses a significant security threat to users and developers. This study aims to evaluate the effectiveness of three machine learning algorithms—Logistic Regression (LR), Random Forests (RF), and Gradient Boosting Machine (GBM)—for static Android malware detection based on permission features. The experiment employs the publicly available Android Malware Prediction Dataset (Kaggle, accessed 2025), containing 4,464 application samples with 328 binary permission attributes. A leakage-free CRISP-DM workflow was implemented, integrating data cleaning, automated feature selection via SelectKBest (Mutual Information), and hyperparameter optimisation using GridSearchCV with stratified 5-fold cross-validation. Results on the unseen hold-out test set show that GBM achieved the best performance, with 96.05% accuracy and 0.9924 ROC-AUC, outperforming LR and RF. In addition, GBM exhibited superior probability calibration (Brier Score = 0.0344) and interpretability, as confirmed through SHAP analysis. The ablation study further validated that optimal model performance saturates at 30–40 selected features. This research contributes a reproducible and pipeline-validated comparative framework for static Android malware detection, addressing prior studies’ limitations regarding feature selection bias and data leakage. Nevertheless, the study is limited by its reliance on static permission features and the absence of dynamic behavioural data, which may restrict generalisation to evolving malware families.

Downloads

Download data is not yet available.

References

I. Kandel and M. Castella, “HOW DEEPLY TO FINE-TUNE A CONVOLUTIONAL NEURAL NETWORK: A CASE STUDY USING A HISTOPATHOLOGY DATASET,” Comput. Secur., vol. 81, no. 5, p. ii, 2022, [Online]. Available: https://doi.org/10.1016/j.cose.2022.102785%0Ahttps://doi.org/10.1016/j.jksuci.2022.02.026%0Ahttps://doi.org/10.1016/j.ijepes.2022.108733%0Ahttps://doi.org/10.1016/j.cmpb.2022.107141%0Ahttps://doi.org/10.1016/j.chemolab.2022.104534%0Ahttps://doi.org/10.101

C. Easttom, “ANDROID OPERATING SYSTEM,” An In-Depth Guid. to Mob. Device Forensics, 2021, doi: https://doi.org/10.1201/9781003118718-4.

J. Lee, H. Jang, S. Ha, and Y. Yoon, “ANDROID MALWARE DETECTION USING MACHINE LEARNING WITH FEATURE SELECTION BASED ON THE GENETIC ALGORITHM,” Mathematics, vol. 9, no. 21, pp. 1–20, 2021, doi: https://doi.org/10.3390/math9212813.

R. Satrio Hadikusuma, L. Lukas, and E. M. Rizaludin, “METHODS OF STEALING PERSONAL DATA ON ANDROID USING A REMOTE ADMINISTRATION TOOL WITH SOCIAL ENGINEERING TECHNIQUES,” Ultim. J. Tek. Inform., vol. 15, no. 1, pp. 44–49, 2023, doi: https://doi.org/10.31937/ti.v15i1.3122.

H. A. S. Alsharya, “LEVERAGING SOCIAL ENGINEERING TECHNIQUES FOR ETHICAL PURPOSES: AN APPROACH TO DEVELOP FAKE ANDROID APP FOR COLLECTING VALUABLE DATA DISCREETLY,” Wasit J. Comput. Math. Sci., vol. 3, no. 3, pp. 45–59, 2024, doi: https://doi.org/10.31185/wjcms.268.

G. M. Naidoo and A. Reddy Moonasamy, “WHATSAPP AS A TOOL FOR TEACHING AND LEARNING DURING THE COVID-19 LOCKDOWN,” Univers. J. Educ. Res., vol. 10, no. 10, pp. 570–580, 2022, doi: https://doi.org/10.13189/ujer.2022.101003.

A. O. Japinye, D. O. Ukeagu, and E. C. Ejianya, “ENHANCING MOBILE SECURITY THROUGH HAPTIC FEEDBACK: A MULTI-PARTICIPANT INVESTIGATION INTO MITIGATING SOCIAL ENGINEERING ATTACKS ON ANDROID DEVICES,” Eur. J. Comput. Sci. Inf. Technol., vol. 13, no. 33, pp. 1–15, 2025, doi: https://doi.org/10.37745/ejcsit.2013/vol13n33115.

B. Urooj, M. A. Shah, C. Maple, M. K. Abbasi, and S. Riasat, “MALWARE DETECTION: A FRAMEWORK FOR REVERSE ENGINEERED ANDROID APPLICATIONS THROUGH MACHINE LEARNING ALGORITHMS,” IEEE Access, vol. 10, no. August, pp. 89031–89050, 2022, doi: https://doi.org/10.1109/ACCESS.2022.3149053.

E.-M. Maier, L. M. Tanczer, and L. D. Klausner, SURVEILLANCE DISGUISED AS PROTECTION: A COMPARATIVE ANALYSIS OF SIDELOADED AND IN-STORE PARENTAL CONTROL APPS, vol. 2025, no. 2. Association for Computing Machinery, 2025. doi: https://doi.org/10.56553/popets-2025-0052.

Z. Fang, W. Han, and Y. Li, “PERMISSION BASED ANDROID SECURITY: ISSUES AND COUNTERMEASURES,” Comput. Secur., vol. 43, no. 0, pp. 205–218, 2024, doi: https://doi.org/10.1016/j.cose.2014.02.007.

F. Akbar, M. Hussain, R. Mumtaz, Q. Riaz, A. W. A. Wahab, and K. H. Jung, “PERMISSIONS-BASED DETECTION OF ANDROID MALWARE USING MACHINE LEARNING,” Symmetry (Basel)., vol. 14, no. 4, 2022, doi: https://doi.org/10.3390/sym14040718.

A. Muzaffar, H. Ragab Hassen, M. A. Lones, and H. Zantout, “AN IN-DEPTH REVIEW OF MACHINE LEARNING BASED ANDROID MALWARE DETECTION,” Comput. Secur., vol. 121, p. 102833, 2022, doi: https://doi.org/10.1016/j.cose.2022.102833.

A. Iqubal and A. Payal, “MALWARE DETECTION TECHNIQUE FOR ANDROID DEVICES USING MACHINE LEARNING ALGORITHMS,” 2024 Int. Conf. Comput. Sci. Commun. ICCSC 2024, no. 9, pp. 0–3, 2024, doi: https://doi.org/10.1109/ICCSC62048.2024.10830310.

P. Singh, P. Tiwari, and S. Singh, “ANALYSIS OF MALICIOUS BEHAVIOR OF ANDROID APPS,” Procedia Comput. Sci., vol. 79, pp. 215–220, 2019, doi: https://doi.org/10.1016/j.procs.2016.03.028.

W. Xie and X. Zhang, “THE APPLICATION OF MACHINE LEARNING IN ANDROID MALWARE DETECTION,” 2024 4th Int. Conf. Neural Networks, Inf. Commun. Eng. NNICE 2024, pp. 1–4, 2024, doi: https://doi.org/10.1109/NNICE61279.2024.10498936.

K. Liu, S. Xu, G. Xu, M. Zhang, D. Sun, and H. Liu, “A REVIEW OF ANDROID MALWARE DETECTION APPROACHES BASED ON MACHINE LEARNING,” IEEE Access, vol. 8, pp. 124579–124607, 2020, doi: https://doi.org/10.1109/ACCESS.2020.3006143.

D. Revaldo, “ANDROID MALWARE DETECTION DATASET,” Kaggle. Accessed: Mar. 15, 2024. [Online]. Available: https://www.kaggle.com/datasets/dannyrevaldo/android-malware-detection-dataset

F. Martinez-Plumed et al., “CRISP-DM TWENTY YEARS LATER: FROM DATA MINING PROCESSES TO DATA SCIENCE TRAJECTORIES,” IEEE Trans. Knowl. Data Eng., vol. 33, no. 8, pp. 3048–3061, 2021, doi: https://doi.org/10.1109/TKDE.2019.2962680.

R. Surendran, T. Thomas, and S. Emmanuel, “A TAN BASED HYBRID MODEL FOR ANDROID MALWARE DETECTION,” J. Inf. Secur. Appl., vol. 54, 2020, doi: https://doi.org/10.1016/j.jisa.2020.102483.

Z. Sun, G. Wang, P. Li, H. Wang, M. Zhang, and X. Liang, “AN IMPROVED RANDOM FOREST BASED ON THE CLASSIFICATION ACCURACY AND CORRELATION MEASUREMENT OF DECISION TREES,” Expert Syst. Appl., vol. 237, no. PB, p. 121549, 2024, doi: https://doi.org/10.1016/j.eswa.2023.121549.

Abdullah-All-Tanvir, I. Ali Khandokar, A. K. M. Muzahidul Islam, S. Islam, and S. Shatabda, “A GRADIENT BOOSTING CLASSIFIER FOR PURCHASE INTENTION PREDICTION OF ONLINE SHOPPERS,” Heliyon, vol. 9, no. 4, p. e15163, 2023, doi: https://doi.org/10.1016/j.heliyon.2023.e15163.

N. N. M. Nasri, M. F. A. Razak, R. R. Saedudin, S. M. Azmara, and A. Firdaus, “ANDROID MALWARE DETECTION USING MACHINE LEARNING,” Proc. - 2020 Innov. Intell. Syst. Appl. Conf. ASYU 2020, vol. 9, no. 1, pp. 327–333, 2020, doi: https://doi.org/10.1109/ASYU50717.2020.9259834.

K. A. Ahmed, K. Boopalan, K. Lokeshwaran, R. Sugumar, and C. Kotteeswaran, “ANALYSIS OF ANDROID MALWARE DETECTION USING MACHINE LEARNING TECHNIQUES,” AIP Conf. Proc., vol. 2935, no. 1, pp. 85–108, 2024, doi: https://doi.org/10.1063/5.0199036.

A. Droos, A. Al-Mahadeen, T. Al-Harasis, R. Al-Attar, and M. Ababneh, “ANDROID MALWARE DETECTION USING MACHINE LEARNING,” 2022 13th Int. Conf. Inf. Commun. Syst. ICICS 2022, pp. 36–41, 2022, doi: https://doi.org/10.1109/ICICS55353.2022.9811130.

A. Kapoor, H. Kushwaha, and E. Gandotra, “PERMISSION BASED ANDROID MALICIOUS APPLICATION DETECTION USING MACHINE LEARNING,” 2019 Int. Conf. Signal Process. Commun. ICSC 2019, pp. 103–108, 2019, doi: https://doi.org/10.1109/ICSC45622.2019.8938236.

J. Brzozowska, J. Pizoń, G. Baytikenova, A. Gola, A. Zakimova, and K. Piotrowska, “DATA ENGINEERING IN CRISP-DM PROCESS PRODUCTION DATA – CASE STUDY,” Appl. Comput. Sci., vol. 19, no. 3, pp. 83–95, 2023, doi: https://doi.org/10.35784/acs-2023-26.

K. M. Arsyad, A. Yunita, H. M. Krismartopo, A. S. Dimar, K. Dewi, and I. Madrinovella, “REVEALING INSIGHTS THROUGH EXPLORATORY DATA ANALYSIS ON EARTHQUAKE DATASET,” J. Sci. Informatics Soc., vol. 1, no. 1, pp. 1–6, 2023, doi: https://doi.org/10.57102/jsis.v1i1.18.

D. Theng and K. K. Bhoyar, “FEATURE SELECTION TECHNIQUES FOR MACHINE LEARNING: A SURVEY OF MORE THAN TWO DECADES OF RESEARCH,” Knowl. Inf. Syst., vol. 66, no. 3, pp. 1575–1637, 2024, doi: https://doi.org/10.1007/s10115-023-02010-5.

British Medical Journal, “ERRATUM: SPEARMAN’S RANK CORRELATION COEFFICIENT,” BMJ, vol. 349, no. December, p. 7528, 2014, doi: https://doi.org/10.1136/bmj.g7528.

J. Gonzalez-Lopez, S. Ventura, and A. Cano, “DISTRIBUTED SELECTION OF CONTINUOUS FEATURES IN MULTILABEL CLASSIFICATION USING MUTUAL INFORMATION,” IEEE Trans. Neural Networks Learn. Syst., vol. 31, no. 7, pp. 2280–2293, 2020, doi: https://doi.org/10.1109/TNNLS.2019.2944298.

E. Dumitrescu, S. Hué, C. Hurlin, and S. Tokpavi, “MACHINE LEARNING FOR CREDIT SCORING: IMPROVING LOGISTIC REGRESSION WITH NON-LINEAR DECISION-TREE EFFECTS,” Eur. J. Oper. Res., vol. 297, no. 3, pp. 1178–1192, 2022, doi: https://doi.org/10.1016/j.ejor.2021.06.053.

N. R. Panda, J. K. Pati, J. N. Mohanty, and R. Bhuyan, “A REVIEW ON LOGISTIC REGRESSION IN MEDICAL RESEARCH,” Natl. J. Community Med., vol. 13, no. 4, pp. 265–270, 2022, doi: https://doi.org/10.55489/njcm.134202222.

Jajang, N. Nurhayati, and S. J. Mufida, “ORDINAL LOGISTIC REGRESSION MODEL AND CLASSIFICATION TREE ON ORDINAL RESPONSE DATA,” Barekeng, vol. 16, no. 1, pp. 75–82, 2022, doi: https://doi.org/10.30598/barekengvol16iss1pp075-082.

A. Devaux, C. Proust-Lima, and R. Genuer, “RANDOM FORESTS FOR TIME-FIXED AND TIME-DEPENDENT PREDICTORS: THE DYNFOREST R PACKAGE,” 2023, [Online]. Available: http://arxiv.org/abs/2302.02670

M. Denuit, D. Hainaut, and J. Trufin, “BAGGING TREES AND RANDOM FORESTS,” in Effective Statistical Learning Methods for Actuaries II: Tree-Based Methods and Extensions, Cham: Springer International Publishing, 2020, pp. 107–130. doi: https://doi.org/10.1007/978-3-030-57556-4_4.

R. M. Syafei and D. A. Efrilianda, “MACHINE LEARNING MODEL USING EXTREME GRADIENT BOOSTING (XGBOOST) FEATURE IMPORTANCE AND LIGHT GRADIENT BOOSTING MACHINE (LIGHTGBM) TO IMPROVE ACCURATE PREDICTION OF BANKRUPTCY,” Recursive J. Informatics, vol. 1, no. 2, pp. 64–72, 2023, doi: https://doi.org/10.15294/rji.v1i2.71229.

R. Auti, A. Bhatt, and S. Tidake, “COMPARATIVE ANALYSIS OF MACHINE LEARNING ALGORITHMS FOR GENOMIC DATA,” 2023 1st DMIHER Int. Conf. Artif. Intell. Educ. Ind. 4.0, IDICAIEI 2023, vol. 13, no. 1, pp. 217–223, 2023, doi: https://doi.org/10.1109/IDICAIEI58380.2023.10406455.

B. Li et al., “PREDICTING OUTCOMES FOLLOWING ENDOVASCULAR ABDOMINAL AORTIC ANEURYSM REPAIR USING MACHINE LEARNING,” Ann. Surg., vol. 279, no. 3, 2024, [Online]. Available: https://journals.lww.com/annalsofsurgery/fulltext/2024/03000/predicting_outcomes_following_endovascular.23.aspx

Y. Nohara, K. Matsumoto, H. Soejima, and N. Nakashima, “EXPLANATION OF MACHINE LEARNING MODELS USING SHAPLEY ADDITIVE EXPLANATION AND APPLICATION FOR REAL DATA IN HOSPITAL,” Comput. Methods Programs Biomed., vol. 214, no. February, pp. 1–7, 2022, doi: https://doi.org/10.1016/j.cmpb.2021.106584.

Y. Xue, X. Cai, and F. Neri, “A MULTI-OBJECTIVE EVOLUTIONARY ALGORITHM WITH INTERVAL BASED INITIALIZATION AND SELF-ADAPTIVE CROSSOVER OPERATOR FOR LARGE-SCALE FEATURE SELECTION IN CLASSIFICATION,” Appl. Soft Comput., vol. 127, p. 109420, 2022, doi: https://doi.org/10.1016/j.asoc.2022.109420.

Published
2026-01-26
How to Cite
[1]
A. R. Lubis, D. Wulandari, L. T. Adha, T. Ariyani, Y. Lase, and F. Lubis, “A COMPARATIVE STUDY OF PIPELINE-VALIDATED MACHINE LEARNING CLASSIFIERS FOR PERMISSION-BASED ANDROID MALWARE DETECTION”, BAREKENG: J. Math. & App., vol. 20, no. 2, pp. 1675–1692, Jan. 2026.