A COMPARATIVE STUDY OF PIPELINE-VALIDATED MACHINE LEARNING CLASSIFIERS FOR PERMISSION-BASED ANDROID MALWARE DETECTION
Abstract
The growing prevalence of Android malware distributed through third-party APK sideloading poses a significant security threat to users and developers. This study aims to evaluate the effectiveness of three machine learning algorithms—Logistic Regression (LR), Random Forests (RF), and Gradient Boosting Machine (GBM)—for static Android malware detection based on permission features. The experiment employs the publicly available Android Malware Prediction Dataset (Kaggle, accessed 2025), containing 4,464 application samples with 328 binary permission attributes. A leakage-free CRISP-DM workflow was implemented, integrating data cleaning, automated feature selection via SelectKBest (Mutual Information), and hyperparameter optimisation using GridSearchCV with stratified 5-fold cross-validation. Results on the unseen hold-out test set show that GBM achieved the best performance, with 96.05% accuracy and 0.9924 ROC-AUC, outperforming LR and RF. In addition, GBM exhibited superior probability calibration (Brier Score = 0.0344) and interpretability, as confirmed through SHAP analysis. The ablation study further validated that optimal model performance saturates at 30–40 selected features. This research contributes a reproducible and pipeline-validated comparative framework for static Android malware detection, addressing prior studies’ limitations regarding feature selection bias and data leakage. Nevertheless, the study is limited by its reliance on static permission features and the absence of dynamic behavioural data, which may restrict generalisation to evolving malware families.
Downloads
References
I. Kandel and M. Castella, “HOW DEEPLY TO FINE-TUNE A CONVOLUTIONAL NEURAL NETWORK: A CASE STUDY USING A HISTOPATHOLOGY DATASET,” Comput. Secur., vol. 81, no. 5, p. ii, 2022, [Online]. Available: https://doi.org/10.1016/j.cose.2022.102785%0Ahttps://doi.org/10.1016/j.jksuci.2022.02.026%0Ahttps://doi.org/10.1016/j.ijepes.2022.108733%0Ahttps://doi.org/10.1016/j.cmpb.2022.107141%0Ahttps://doi.org/10.1016/j.chemolab.2022.104534%0Ahttps://doi.org/10.101
C. Easttom, “ANDROID OPERATING SYSTEM,” An In-Depth Guid. to Mob. Device Forensics, 2021, doi: https://doi.org/10.1201/9781003118718-4.
J. Lee, H. Jang, S. Ha, and Y. Yoon, “ANDROID MALWARE DETECTION USING MACHINE LEARNING WITH FEATURE SELECTION BASED ON THE GENETIC ALGORITHM,” Mathematics, vol. 9, no. 21, pp. 1–20, 2021, doi: https://doi.org/10.3390/math9212813.
R. Satrio Hadikusuma, L. Lukas, and E. M. Rizaludin, “METHODS OF STEALING PERSONAL DATA ON ANDROID USING A REMOTE ADMINISTRATION TOOL WITH SOCIAL ENGINEERING TECHNIQUES,” Ultim. J. Tek. Inform., vol. 15, no. 1, pp. 44–49, 2023, doi: https://doi.org/10.31937/ti.v15i1.3122.
H. A. S. Alsharya, “LEVERAGING SOCIAL ENGINEERING TECHNIQUES FOR ETHICAL PURPOSES: AN APPROACH TO DEVELOP FAKE ANDROID APP FOR COLLECTING VALUABLE DATA DISCREETLY,” Wasit J. Comput. Math. Sci., vol. 3, no. 3, pp. 45–59, 2024, doi: https://doi.org/10.31185/wjcms.268.
G. M. Naidoo and A. Reddy Moonasamy, “WHATSAPP AS A TOOL FOR TEACHING AND LEARNING DURING THE COVID-19 LOCKDOWN,” Univers. J. Educ. Res., vol. 10, no. 10, pp. 570–580, 2022, doi: https://doi.org/10.13189/ujer.2022.101003.
A. O. Japinye, D. O. Ukeagu, and E. C. Ejianya, “ENHANCING MOBILE SECURITY THROUGH HAPTIC FEEDBACK: A MULTI-PARTICIPANT INVESTIGATION INTO MITIGATING SOCIAL ENGINEERING ATTACKS ON ANDROID DEVICES,” Eur. J. Comput. Sci. Inf. Technol., vol. 13, no. 33, pp. 1–15, 2025, doi: https://doi.org/10.37745/ejcsit.2013/vol13n33115.
B. Urooj, M. A. Shah, C. Maple, M. K. Abbasi, and S. Riasat, “MALWARE DETECTION: A FRAMEWORK FOR REVERSE ENGINEERED ANDROID APPLICATIONS THROUGH MACHINE LEARNING ALGORITHMS,” IEEE Access, vol. 10, no. August, pp. 89031–89050, 2022, doi: https://doi.org/10.1109/ACCESS.2022.3149053.
E.-M. Maier, L. M. Tanczer, and L. D. Klausner, SURVEILLANCE DISGUISED AS PROTECTION: A COMPARATIVE ANALYSIS OF SIDELOADED AND IN-STORE PARENTAL CONTROL APPS, vol. 2025, no. 2. Association for Computing Machinery, 2025. doi: https://doi.org/10.56553/popets-2025-0052.
Z. Fang, W. Han, and Y. Li, “PERMISSION BASED ANDROID SECURITY: ISSUES AND COUNTERMEASURES,” Comput. Secur., vol. 43, no. 0, pp. 205–218, 2024, doi: https://doi.org/10.1016/j.cose.2014.02.007.
F. Akbar, M. Hussain, R. Mumtaz, Q. Riaz, A. W. A. Wahab, and K. H. Jung, “PERMISSIONS-BASED DETECTION OF ANDROID MALWARE USING MACHINE LEARNING,” Symmetry (Basel)., vol. 14, no. 4, 2022, doi: https://doi.org/10.3390/sym14040718.
A. Muzaffar, H. Ragab Hassen, M. A. Lones, and H. Zantout, “AN IN-DEPTH REVIEW OF MACHINE LEARNING BASED ANDROID MALWARE DETECTION,” Comput. Secur., vol. 121, p. 102833, 2022, doi: https://doi.org/10.1016/j.cose.2022.102833.
A. Iqubal and A. Payal, “MALWARE DETECTION TECHNIQUE FOR ANDROID DEVICES USING MACHINE LEARNING ALGORITHMS,” 2024 Int. Conf. Comput. Sci. Commun. ICCSC 2024, no. 9, pp. 0–3, 2024, doi: https://doi.org/10.1109/ICCSC62048.2024.10830310.
P. Singh, P. Tiwari, and S. Singh, “ANALYSIS OF MALICIOUS BEHAVIOR OF ANDROID APPS,” Procedia Comput. Sci., vol. 79, pp. 215–220, 2019, doi: https://doi.org/10.1016/j.procs.2016.03.028.
W. Xie and X. Zhang, “THE APPLICATION OF MACHINE LEARNING IN ANDROID MALWARE DETECTION,” 2024 4th Int. Conf. Neural Networks, Inf. Commun. Eng. NNICE 2024, pp. 1–4, 2024, doi: https://doi.org/10.1109/NNICE61279.2024.10498936.
K. Liu, S. Xu, G. Xu, M. Zhang, D. Sun, and H. Liu, “A REVIEW OF ANDROID MALWARE DETECTION APPROACHES BASED ON MACHINE LEARNING,” IEEE Access, vol. 8, pp. 124579–124607, 2020, doi: https://doi.org/10.1109/ACCESS.2020.3006143.
D. Revaldo, “ANDROID MALWARE DETECTION DATASET,” Kaggle. Accessed: Mar. 15, 2024. [Online]. Available: https://www.kaggle.com/datasets/dannyrevaldo/android-malware-detection-dataset
F. Martinez-Plumed et al., “CRISP-DM TWENTY YEARS LATER: FROM DATA MINING PROCESSES TO DATA SCIENCE TRAJECTORIES,” IEEE Trans. Knowl. Data Eng., vol. 33, no. 8, pp. 3048–3061, 2021, doi: https://doi.org/10.1109/TKDE.2019.2962680.
R. Surendran, T. Thomas, and S. Emmanuel, “A TAN BASED HYBRID MODEL FOR ANDROID MALWARE DETECTION,” J. Inf. Secur. Appl., vol. 54, 2020, doi: https://doi.org/10.1016/j.jisa.2020.102483.
Z. Sun, G. Wang, P. Li, H. Wang, M. Zhang, and X. Liang, “AN IMPROVED RANDOM FOREST BASED ON THE CLASSIFICATION ACCURACY AND CORRELATION MEASUREMENT OF DECISION TREES,” Expert Syst. Appl., vol. 237, no. PB, p. 121549, 2024, doi: https://doi.org/10.1016/j.eswa.2023.121549.
Abdullah-All-Tanvir, I. Ali Khandokar, A. K. M. Muzahidul Islam, S. Islam, and S. Shatabda, “A GRADIENT BOOSTING CLASSIFIER FOR PURCHASE INTENTION PREDICTION OF ONLINE SHOPPERS,” Heliyon, vol. 9, no. 4, p. e15163, 2023, doi: https://doi.org/10.1016/j.heliyon.2023.e15163.
N. N. M. Nasri, M. F. A. Razak, R. R. Saedudin, S. M. Azmara, and A. Firdaus, “ANDROID MALWARE DETECTION USING MACHINE LEARNING,” Proc. - 2020 Innov. Intell. Syst. Appl. Conf. ASYU 2020, vol. 9, no. 1, pp. 327–333, 2020, doi: https://doi.org/10.1109/ASYU50717.2020.9259834.
K. A. Ahmed, K. Boopalan, K. Lokeshwaran, R. Sugumar, and C. Kotteeswaran, “ANALYSIS OF ANDROID MALWARE DETECTION USING MACHINE LEARNING TECHNIQUES,” AIP Conf. Proc., vol. 2935, no. 1, pp. 85–108, 2024, doi: https://doi.org/10.1063/5.0199036.
A. Droos, A. Al-Mahadeen, T. Al-Harasis, R. Al-Attar, and M. Ababneh, “ANDROID MALWARE DETECTION USING MACHINE LEARNING,” 2022 13th Int. Conf. Inf. Commun. Syst. ICICS 2022, pp. 36–41, 2022, doi: https://doi.org/10.1109/ICICS55353.2022.9811130.
A. Kapoor, H. Kushwaha, and E. Gandotra, “PERMISSION BASED ANDROID MALICIOUS APPLICATION DETECTION USING MACHINE LEARNING,” 2019 Int. Conf. Signal Process. Commun. ICSC 2019, pp. 103–108, 2019, doi: https://doi.org/10.1109/ICSC45622.2019.8938236.
J. Brzozowska, J. Pizoń, G. Baytikenova, A. Gola, A. Zakimova, and K. Piotrowska, “DATA ENGINEERING IN CRISP-DM PROCESS PRODUCTION DATA – CASE STUDY,” Appl. Comput. Sci., vol. 19, no. 3, pp. 83–95, 2023, doi: https://doi.org/10.35784/acs-2023-26.
K. M. Arsyad, A. Yunita, H. M. Krismartopo, A. S. Dimar, K. Dewi, and I. Madrinovella, “REVEALING INSIGHTS THROUGH EXPLORATORY DATA ANALYSIS ON EARTHQUAKE DATASET,” J. Sci. Informatics Soc., vol. 1, no. 1, pp. 1–6, 2023, doi: https://doi.org/10.57102/jsis.v1i1.18.
D. Theng and K. K. Bhoyar, “FEATURE SELECTION TECHNIQUES FOR MACHINE LEARNING: A SURVEY OF MORE THAN TWO DECADES OF RESEARCH,” Knowl. Inf. Syst., vol. 66, no. 3, pp. 1575–1637, 2024, doi: https://doi.org/10.1007/s10115-023-02010-5.
British Medical Journal, “ERRATUM: SPEARMAN’S RANK CORRELATION COEFFICIENT,” BMJ, vol. 349, no. December, p. 7528, 2014, doi: https://doi.org/10.1136/bmj.g7528.
J. Gonzalez-Lopez, S. Ventura, and A. Cano, “DISTRIBUTED SELECTION OF CONTINUOUS FEATURES IN MULTILABEL CLASSIFICATION USING MUTUAL INFORMATION,” IEEE Trans. Neural Networks Learn. Syst., vol. 31, no. 7, pp. 2280–2293, 2020, doi: https://doi.org/10.1109/TNNLS.2019.2944298.
E. Dumitrescu, S. Hué, C. Hurlin, and S. Tokpavi, “MACHINE LEARNING FOR CREDIT SCORING: IMPROVING LOGISTIC REGRESSION WITH NON-LINEAR DECISION-TREE EFFECTS,” Eur. J. Oper. Res., vol. 297, no. 3, pp. 1178–1192, 2022, doi: https://doi.org/10.1016/j.ejor.2021.06.053.
N. R. Panda, J. K. Pati, J. N. Mohanty, and R. Bhuyan, “A REVIEW ON LOGISTIC REGRESSION IN MEDICAL RESEARCH,” Natl. J. Community Med., vol. 13, no. 4, pp. 265–270, 2022, doi: https://doi.org/10.55489/njcm.134202222.
Jajang, N. Nurhayati, and S. J. Mufida, “ORDINAL LOGISTIC REGRESSION MODEL AND CLASSIFICATION TREE ON ORDINAL RESPONSE DATA,” Barekeng, vol. 16, no. 1, pp. 75–82, 2022, doi: https://doi.org/10.30598/barekengvol16iss1pp075-082.
A. Devaux, C. Proust-Lima, and R. Genuer, “RANDOM FORESTS FOR TIME-FIXED AND TIME-DEPENDENT PREDICTORS: THE DYNFOREST R PACKAGE,” 2023, [Online]. Available: http://arxiv.org/abs/2302.02670
M. Denuit, D. Hainaut, and J. Trufin, “BAGGING TREES AND RANDOM FORESTS,” in Effective Statistical Learning Methods for Actuaries II: Tree-Based Methods and Extensions, Cham: Springer International Publishing, 2020, pp. 107–130. doi: https://doi.org/10.1007/978-3-030-57556-4_4.
R. M. Syafei and D. A. Efrilianda, “MACHINE LEARNING MODEL USING EXTREME GRADIENT BOOSTING (XGBOOST) FEATURE IMPORTANCE AND LIGHT GRADIENT BOOSTING MACHINE (LIGHTGBM) TO IMPROVE ACCURATE PREDICTION OF BANKRUPTCY,” Recursive J. Informatics, vol. 1, no. 2, pp. 64–72, 2023, doi: https://doi.org/10.15294/rji.v1i2.71229.
R. Auti, A. Bhatt, and S. Tidake, “COMPARATIVE ANALYSIS OF MACHINE LEARNING ALGORITHMS FOR GENOMIC DATA,” 2023 1st DMIHER Int. Conf. Artif. Intell. Educ. Ind. 4.0, IDICAIEI 2023, vol. 13, no. 1, pp. 217–223, 2023, doi: https://doi.org/10.1109/IDICAIEI58380.2023.10406455.
B. Li et al., “PREDICTING OUTCOMES FOLLOWING ENDOVASCULAR ABDOMINAL AORTIC ANEURYSM REPAIR USING MACHINE LEARNING,” Ann. Surg., vol. 279, no. 3, 2024, [Online]. Available: https://journals.lww.com/annalsofsurgery/fulltext/2024/03000/predicting_outcomes_following_endovascular.23.aspx
Y. Nohara, K. Matsumoto, H. Soejima, and N. Nakashima, “EXPLANATION OF MACHINE LEARNING MODELS USING SHAPLEY ADDITIVE EXPLANATION AND APPLICATION FOR REAL DATA IN HOSPITAL,” Comput. Methods Programs Biomed., vol. 214, no. February, pp. 1–7, 2022, doi: https://doi.org/10.1016/j.cmpb.2021.106584.
Y. Xue, X. Cai, and F. Neri, “A MULTI-OBJECTIVE EVOLUTIONARY ALGORITHM WITH INTERVAL BASED INITIALIZATION AND SELF-ADAPTIVE CROSSOVER OPERATOR FOR LARGE-SCALE FEATURE SELECTION IN CLASSIFICATION,” Appl. Soft Comput., vol. 127, p. 109420, 2022, doi: https://doi.org/10.1016/j.asoc.2022.109420.
Copyright (c) 2026 Arif Ridho Lubis, Dewi Wulandari, Lilis Tiara Adha, Tika Ariyani, Yuyun Lase, Fahdi Saidi Lubis

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Authors who publish with this Journal agree to the following terms:
- Author retain copyright and grant the journal right of first publication with the work simultaneously licensed under a creative commons attribution license that allow others to share the work within an acknowledgement of the work’s authorship and initial publication of this journal.
- Authors are able to enter into separate, additional contractual arrangement for the non-exclusive distribution of the journal’s published version of the work (e.g. acknowledgement of its initial publication in this journal).
- Authors are permitted and encouraged to post their work online (e.g. in institutional repositories or on their websites) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published works.




1.gif)


