EVALUATING ROBERTA AND GPT-BASED MODELS FOR SDG MULTICLASS TEXT CLASSIFICATION ACROSS DIFFERENT DOCUMENT LENGTHS

  • Uswatun Hasanah Statistics and Data Science, School of Data Science, Mathematics, and Informatics, IPB University, Indonesia https://orcid.org/0009-0009-4689-527X
  • Agus Mohamad Soleh Statistics and Data Science, School of Data Science, Mathematics, and Informatics, IPB University, Indonesia https://orcid.org/0000-0002-2732-1985
  • Cici Suhaeni Statistics and Data Science, School of Data Science, Mathematics, and Informatics, IPB University, Indonesia https://orcid.org/0009-0001-0347-3810
  • Anwar Fitrianto Statistics and Data Science, School of Data Science, Mathematics, and Informatics, IPB University, Indonesia https://orcid.org/0000-0001-7050-3082
Keywords: Fine-tuning, GPT, RoBERTa, SDGs, Text Classification

Abstract

Multiclass text classification remains a difficult task, primarily due to semantic ambiguity and differences in input length. This study evaluates RoBERTa and GPT-based models for multiclass text classification, focusing on how prompting strategies and document length affect accuracy and robustness. Experiments were conducted using the OSDG Community Dataset, which contains approximately 15,000 labeled samples. The dataset was partitioned into four subsets based on input length: short, medium, long, and all combined. Three GPT variants (zero-shot, few-shot, and fine-tuned) were compared against a RoBERTa baseline. Fine-tuning was implemented via OpenAI’s supervised API with prompt-response formatting. Performance was assessed through F1-score, precision, recall, and balanced accuracy. Fine-tuned GPT achieved the strongest results in all settings, with a macro F1-score of 0.9204 on the all-combined dataset, representing a 4.61% improvement over RoBERTa. Consistent gains were also observed across short (8.63%), medium (3.83%), and long (20.31%) texts. The largest improvement occurred on long documents, while medium-length inputs provided the most stable performance across models. These findings highlight the effectiveness of task-specific fine-tuning in enhancing GPT’s capability to classify SDG-related texts across diverse input lengths.

Downloads

Download data is not yet available.

References

United Nations, “TRANSFORMING OUR WORLD: THE 2030 AGENDA FOR SUSTAINABLE DEVELOPMENT,” New York, USA, 2015.

United Nations, “THE SUSTAINABLE DEVELOPMENT GOALS REPORT,” 2024. Accessed: May 27, 2025. [Online]. Available: https://unstats.un.org/sdgs/report/2024/

M. Angin et al., “A ROBERTA APPROACH FOR AUTOMATED PROCESSING OF SUSTAINABILITY REPORTS,” Sustainability (Switzerland), vol. 14, no. 23, Dec. 2022, doi: https://doi.org/10.3390/su142316139

G. Han, J. Tsao, and X. Huang, “LENGTH-AWARE MULTI-KERNEL TRANSFORMER FOR LONG DOCUMENT CLASSIFICATION,” May 2024, [Online]. Available: http://arxiv.org/abs/2405.07052

M. T. Lafleur, “USING LARGE LANGUAGE MODELS TO HELP TRAIN MACHINE LEARNING SDG CLASSIFIERS,” UN Desa Working Paper 180, Nov. 2023, [Online]. Available: https://desapublications.un.org/working-papers.

A. Hajikhani and A. Suominen, “MAPPING THE SUSTAINABLE DEVELOPMENT GOALS (SDGS) IN SCIENCE, TECHNOLOGY AND INNOVATION: APPLICATION OF MACHINE LEARNING IN SDG-ORIENTED ARTEFACT DETECTION,” Scientometrics, vol. 127, no. 11, pp. 6661–6693, Nov. 2022, doi: https://doi.org/10.1007/s11192-022-04358-x

Y. Liu et al., “ROBERTA: A ROBUSTLY OPTIMIZED BERT PRETRAINING APPROACH,” in ICLR 2020 Conference Blind Submission, Jul. 2019.

C. Y. Sy, L. L. Maceda, M. Joy, P. Canon, and N. M. Flores, “BEYOND BERT: EXPLORING THE EFFICACY OF ROBERTA AND ALBERT IN SUPERVISED MULTICLASS TEXT CLASSIFICATION,” International Journal of Advanced Computer Science and Applications, vol. 15, no. 3, pp. 45–53, 2024, doi: https://doi.org/10.14569/IJACSA.2024.0150323

K. I. Roumeliotis, N. D. Tselikas, and D. K. Nasiopoulos, “LEVERAGING LARGE LANGUAGE MODELS IN TOURISM: A COMPARATIVE STUDY OF THE LATEST GPT OMNI MODELS AND BERT NLP FOR CUSTOMER REVIEW CLASSIFICATION AND SENTIMENT ANALYSIS,” Information (Switzerland), vol. 15, no. 12, Dec. 2024, doi: https://doi.org/10.3390/info15120792

M. Sebők, V. Kovács, M. Bánóczy, D. M. Eriksen, N. Neptune, and P. Roussille, “BEYOND TOKEN LIMITS: ASSESSING LANGUAGE MODEL PERFORMANCE ON LONG TEXT CLASSIFICATION,” Sep. 2025, [Online]. Available: http://arxiv.org/abs/2509.10199

D. Trautmann, “LARGE LANGUAGE MODEL PROMPT CHAINING FOR LONG LEGAL DOCUMENT CLASSIFICATION,” Aug. 2023, [Online]. Available: http://arxiv.org/abs/2308.04138

K. Nawab et al., “FINE-TUNING FOR ACCURACY: EVALUATION OF GENERATIVE PRETRAINED TRANSFORMER (GPT) FOR AUTOMATIC ASSIGNMENT OF INTERNATIONAL CLASSIFICATION OF DISEASE (ICD) CODES TO CLINICAL DOCUMENTATION,” J Med Artif Intell, vol. 7, no. June, 2024, doi: https://doi.org/10.21037/jmai-24-60

OSDG, U. I. S. D. G. A. I. Lab, and PPMI, “OSDG COMMUNITY DATASET (OSDG-CD),” Apr. 2024, Zenodo.

A. Kurniasih and L. Parningotan Manik, “ON THE ROLE OF TEXT PREPROCESSING IN BERT EMBEDDING-BASED DNNS FOR CLASSIFYING INFORMAL TEXTS,” IJACSA) International Journal of Advanced Computer Science and Applications, vol. 13, no. 6, 2022, doi: https://doi.org/10.14569/IJACSA.2022.01306109

M. Siino, I. Tinnirello, and M. La Cascia, “IS TEXT PREPROCESSING STILL WORTH THE TIME? A COMPARATIVE SURVEY ON THE INFLUENCE OF POPULAR PREPROCESSING METHODS ON TRANSFORMERS AND TRADITIONAL CLASSIFIERS,” Inf Syst, vol. 121, Mar. 2024, doi: https://doi.org/10.1016/j.is.2023.102342

V. Singh, M. Pencina, A. J. Einstein, J. X. Liang, D. S. Berman, and P. Slomka, “IMPACT OF TRAIN/TEST SAMPLE REGIMEN ON PERFORMANCE ESTIMATE STABILITY OF MACHINE LEARNING IN CARDIOVASCULAR IMAGING,” Sci Rep, vol. 11, no. 1, Dec. 2021, doi: https://doi.org/10.1038/s41598-021-93651-5

T. B. Brown et al., “LANGUAGE MODELS ARE FEW-SHOT LEARNERS,” in Advances in Neural Information Processing Systems, vol. 33, Curran Associates, Inc, 2020, pp. 1877–901.

M. Skrynnyk, G. Disassa, A. Krachkov, and J. Devera, “SDGI CORPUS: A COMPREHENSIVE MULTILINGUAL DATASET FOR TEXT CLASSIFICATION BY SUSTAINABLE DEVELOPMENT GOALS,” in Proceedings of the 2nd Symposium on NLP for Social Good (NSG 2024), United Kingdom, Apr. 2024.

R. Pan, J. A. García-Díaz, and R. Valencia-García, “COMPARING FINE-TUNING, ZERO AND FEW-SHOT STRATEGIES WITH LARGE LANGUAGE MODELS IN HATE SPEECH DETECTION IN ENGLISH,” CMES - Computer Modeling in Engineering and Sciences, vol. 140, no. 3, pp. 2849–2868, 2024, doi: https://doi.org/10.32604/cmes.2024.049631

E. Latif and X. Zhai, “FINE-TUNING CHATGPT FOR AUTOMATIC SCORING,” Computers and Education: Artificial Intelligence, vol. 6, Jun. 2024, doi: https://doi.org/10.1016/j.caeai.2024.100210

Z. Khundmiri, “COMPARATIVE ANALYSIS OF ALGORITHMS IN GPT-3: A SURVEY ON PERFORMANCE, TRAINING, AND FINE-TUNING STRATEGIES,” International Journal of Engineering Research & Technology (IJERT), vol. 11, no. 06, 2023, [Online]. Available: www.ijert.org

Y. Liu et al., “ROBERTA: A ROBUSTLY OPTIMIZED BERT PRETRAINING APPROACH,” arXiv preprint, Jul. 2019.

R. Mohawesh, H. Bany Salameh, Y. Jararweh, M. Alkhalaileh, and S. Maqsood, “FAKE REVIEW DETECTION USING TRANSFORMER-BASED ENHANCED LSTM AND ROBERTA,” International Journal of Cognitive Computing in Engineering, vol. 5, pp. 250–258, Jan. 2024, doi: https://doi.org/10.1016/j.ijcce.2024.06.001

A. Hussain, A. Saadia, and F. M. Alserhani, “RANSOMWARE DETECTION AND FAMILY CLASSIFICATION USING FINE-TUNED BERT AND ROBERTA MODELS,” Egyptian Informatics Journal, vol. 30, Jun. 2025, doi: https://doi.org/10.1016/j.eij.2025.100645

M. S. I. Malik, A. Nazarova, M. M. Jamjoom, and D. I. Ignatov, “MULTILINGUAL HOPE SPEECH DETECTION: A ROBUST FRAMEWORK USING TRANSFER LEARNING OF FINE-TUNING ROBERTA MODEL,” Journal of King Saud University - Computer and Information Sciences, vol. 35, no. 8, Sep. 2023, doi: https://doi.org/10.1016/j.jksuci.2023.101736

M. A. Talukder et al., “A HYBRID DEEP LEARNING MODEL FOR SENTIMENT ANALYSIS OF COVID-19 TWEETS WITH CLASS BALANCING,” Sci Rep, vol. 15, no. 1, Dec. 2025, doi: https://doi.org/10.1038/s41598-025-97778-7

I. Loshchilov and F. Hutter, “DECOUPLED WEIGHT DECAY REGULARIZATION,” Jan. 2019, [Online]. Available: http://arxiv.org/abs/1711.05101

D. Masters and C. Luschi, “REVISITING SMALL BATCH TRAINING FOR DEEP NEURAL NETWORKS,” Apr. 2018.

G. K. M et al., “HYBRID OPTIMIZATION DRIVEN FAKE NEWS DETECTION USING REINFORCED TRANSFORMER MODELS,” Sci Rep, vol. 15, no. 1, Dec. 2025, doi: https://doi.org/10.1038/s41598-025-99936-3

V. Ganganwar and R. Rajalakshmi, “EMPLOYING SYNTHETIC DATA FOR ADDRESSING THE CLASS IMBALANCE IN ASPECT-BASED SENTIMENT CLASSIFICATION,” Journal of Information and Telecommunication, vol. 8, no. 2, pp. 167–188, 2024, doi: https://doi.org/10.1080/24751839.2023.2270824

A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, “IMPROVING LANGUAGE UNDERSTANDING BY GENERATIVE PRE-TRAINING,” 2018. [Online]. Available: https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf

A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, “LANGUAGE MODELS ARE UNSUPERVISED MULTITASK LEARNERS,” San Francisco, CA, USA, 2019. Accessed: Oct. 12, 2025. [Online]. Available: https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf

A. Vaswani et al., “ATTENTION IS ALL YOU NEED,” in Proceedings of the 31st International Conference on Neural Information Processing Systems, Red Hook, NY, USA: Curran Associates Inc., 2017, pp. 6000–6010.

M. Alawida, S. Mejri, A. Mehmood, B. Chikhaoui, and O. Isaac Abiodun, “A COMPREHENSIVE STUDY OF CHATGPT: ADVANCEMENTS, LIMITATIONS, AND ETHICAL CONSIDERATIONS IN NATURAL LANGUAGE PROCESSING AND CYBERSECURITY,”, Multidisciplinary Digital Publishing Institute (MDPI), Aug. 01, 2023, doi: https://doi.org/10.3390/info14080462

P. P. Ray, “CHATGPT: A COMPREHENSIVE REVIEW ON BACKGROUND, APPLICATIONS, KEY CHALLENGES, BIAS, ETHICS, LIMITATIONS AND FUTURE SCOPE,” KeAi Communications Co, Jan. 01, 2023, doi: https://doi.org/10.1016/j.iotcps.2023.04.003

P. Ohm et al., “FOCUSING ON FINE-TUNING: UNDERSTANDING THE FOUR PATHWAYS FOR SHAPING GENERATIVE AI,” Columbia Sci. Technol. Law Rev, vol. 25, no. 214, pp. 214–245, 2024, [Online]. Available: https://perma.cc/YDY7-ZAS6. https://doi.org/10.52214/stlr.v25i2.12762

OpenAI, “FINE-TUNING GPT MODELS.” Accessed: Jun. 21, 2025. [Online]. Available: https://platform.openai.com/docs/guides/fine-tuning

H. Yang, Y. Zhang, J. Xu, H. Lu, P. Ann Heng, and W. Lam, “UNVEILING THE GENERALIZATION POWER OF FINE-TUNED LARGE LANGUAGE MODELS,” in Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1, Mexico: Association for Computational Linguistics, pp. 884–899, 2024, doi: https://doi.org/10.18653/v1/2024.naacl-long.51

T.-G. Marchitan, C. Creanga, and L. P. Dinu, “TEAM UNIBUC-NLP AT SEMEVAL-2024 TASK 8: TRANSFORMER AND HYBRID DEEP LEARNING BASED MODELS FOR MACHINE-GENERATED TEXT DETECTION,” in Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024), Mexico: Association for Computational Linguistics, pp. 403–411, Jun. 2024, doi: https://doi.org/10.18653/v1/2024.semeval-1.63

J. Howard and S. Ruder, Universal Language Model Fine-tuning for Text Classification. 2018, doi: https://doi.org/10.18653/v1/P18-1031

S. Casola, I. Lauriola, and A. Lavelli, “PRE-TRAINED TRANSFORMERS: AN EMPIRICAL COMPARISON,” Machine Learning with Applications, vol. 9, p. 100334, Sep. 2022, doi: https://doi.org/10.1016/j.mlwa.2022.100334

M. K. S. Ma’aitah, A. Helwan, and A. Radwan, “URINARY BLADDER ACUTE INFLAMMATIONS AND NEPHRITIS OF THE RENAL PELVIS: DIAGNOSIS USING FINE-TUNED LARGE LANGUAGE MODELS,” J Pers Med, vol. 15, no. 2, Feb. 2025, doi: https://doi.org/10.3390/jpm15020045

W. K. S. Ojemann et al., “ZERO-SHOT EXTRACTION OF SEIZURE OUTCOMES FROM CLINICAL NOTES USING GENERATIVE PRETRAINED TRANSFORMERS,” J Healthc Inform Res, Sep. 2025, doi: https://doi.org/10.1007/s41666-025-00198-5

OpenAI, “GPT-4 Technical Report,” 2024.

S. F. Taskiran, B. Turkoglu, E. Kaya, and T. Asuroglu, “A COMPREHENSIVE EVALUATION OF OVERSAMPLING TECHNIQUES FOR ENHANCING TEXT CLASSIFICATION PERFORMANCE,” Sci Rep, vol. 15, no. 1, Dec. 2025, doi: https://doi.org/10.1038/s41598-025-05791-7

I. Tabassum and V. Nunavath, “A HYBRID DEEP LEARNING APPROACH FOR MULTI-CLASS CYBERBULLYING CLASSIFICATION USING MULTI-MODAL SOCIAL MEDIA DATA,” Applied Sciences (Switzerland), vol. 14, no. 24, Dec. 2024, doi: https://doi.org/10.3390/app142412007

M. Atay, “A COMPARATIVE STUDY OF PROMPTING AND FINE-TUNING FOR BINARY TEXT CLASSIFICATION OF SUSTAINABLE DEVELOPMENT GOALS,” Middle East Technical University, Turkey, 2024.

M. Sushil et al., “A COMPARATIVE STUDY OF LARGE LANGUAGE MODEL-BASED ZERO-SHOT INFERENCE AND TASK-SPECIFIC SUPERVISED CLASSIFICATION OF BREAST CANCER PATHOLOGY REPORTS,” Journal of the American Medical Informatics Association, vol. 31, no. 10, pp. 2315–2327, Oct. 2024, doi: https://doi.org/10.1093/jamia/ocae146

Published
2026-04-08
How to Cite
[1]
U. Hasanah, A. Mohamad Soleh, C. Suhaeni, and A. Fitrianto, “EVALUATING ROBERTA AND GPT-BASED MODELS FOR SDG MULTICLASS TEXT CLASSIFICATION ACROSS DIFFERENT DOCUMENT LENGTHS”, BAREKENG: J. Math. & App., vol. 20, no. 3, pp. 2645-2664, Apr. 2026.