OPTIMIZING LONG TEXT CLASSIFICATION PERFORMANCE THROUGH KEYWORD-BASED SENTENCE SELECTION: A CASE STUDY ON ONLINE NEWS CLASSIFICATION FOR INDONESIAN GDP GROWTH-RATE DETECTION

  • Dinda Pusparahmi Sholawatunnisa Statistical Computing Department, Politeknik Statistika STIS, Indonesia
  • Lya Hulliyyatus Suadaa Statistical Computing Department, Politeknik Statistika STIS, Indonesia https://orcid.org/0000-0001-5949-7873
Keywords: Extractive Summarization, Keyword-based sentence selection, Long-text handling

Abstract

Efficiently managing lengthy textual data, particularly in online news, is crucial for enhancing the performance of long text classification. This study delves into innovative approaches to streamline the Gross Domestic Product (GDP) computation process by harnessing modern data analytics, Natural Language Processing (NLP), and online news sources. Leveraging online news data introduces real-time information, promising to improve the accuracy and timeliness of economic indicators like GDP. However, handling the complexity of extensive textual data poses a challenge, demanding advanced NLP techniques. This research shifts from traditional word-weight-based methods to keyword-based extractive summarization techniques to address this. These tailored approaches ensure that selected sentences align precisely with specific keywords relevant to the research case, such as GDP growth rate detection. The study emphasizes the necessity of adapting summarization methods to capture information in unique research contexts effectively. According to classification results, the implementation of sentence selection successfully demonstrated improved performance in terms of classification accuracy. Specifically, there was an average accuracy increase of 0.0226 for machine learning and 0.0164 for transfer learning models. Additionally, in terms of computational efficiency, sentence selection also accelerates processing time during hyperparameter tuning and fine-tuning, as observed using the same computational resources.

Downloads

Download data is not yet available.

References

F. Khairani, A. Kurnia, M. N. Aidi, and S. Pramana, “Predictions of Indonesia Economic Phenomena Based on Online News Using Random Forest,” SinkrOn, vol. 7, no. 2, pp. 532–540, Apr. 2022, doi: 10.33395/sinkron.v7i2.11401.

S. G. Chowdhury, S. Routh, and S. Chakrabarti, “News Analytics and Sentiment Analysis to Predict Stock Price Trends.” [Online]. Available: www.ijcsit.com

J. J. Duarte, S. Montenegro González, and J. C. Cruz, “Predicting Stock Price Falls Using News Data: Evidence from the Brazilian Market,” Comput Econ, vol. 57, no. 1, pp. 311–340, Jan. 2021, doi: 10.1007/s10614-020-10060-y.

B. S. Kumar and V. Ravi, “A survey of the applications of text mining in financial domain,” Knowl Based Syst, vol. 114, pp. 128–147, Dec. 2016, doi: 10.1016/j.knosys.2016.10.003.

C. Retno, Y. 1, and H. Setiawan, “EDUKATIF: JURNAL ILMU PENDIDIKAN Analisis Framing dan Diksi Berita pada Media Online Detik Travel dan CNN Indonesia Sebagai Bahan Ajar Teks Berita,” Jurnal Ilmu Pendidikan, vol. 4, pp. 803–814, 2022, doi: 10.31004/edukatif.v4i1.1859.

K. Fiok et al., “Text Guide: Improving the Quality of Long Text Classification by a Text Selection Method Based on Feature Importance,” IEEE Access, vol. 9, pp. 105439–105450, 2021, doi: 10.1109/ACCESS.2021.3099758.

G. Domeniconi, G. Moro, R. Pasolini, and C. Sartori, “A study on term weighting for text categorization: A novel supervised variant of tf.idf,” in DATA 2015 - 4th International Conference on Data Management Technologies and Applications, Proceedings, SciTePress, 2015, pp. 26–37. doi: 10.5220/0005511900260037.

A. Jadhav and V. Rajan, “Extractive Summarization with SWAP-NET: Sentences and Words from Alternating Pointer Networks.”

H. Li, J. Zhu, J. Zhang, C. Zong, and X. He, “Keywords-Guided Abstractive Sentence Summarization.” [Online]. Available: www.aaai.org

SimilarWeb, “Detik.com Website Traffic Rank Juli-September 2022,” https://www.similarweb.com/website/detik.com/#overview.

U. Brajawidagda, C. G. Reddick, and A. T. Chatfield, “Urban resilience in extreme events: Analyzing online news and Twitter use during the 2016 Jakarta terror attack,” Information Polity, vol. 22, no. 2–3, pp. 159–177, 2017, doi: 10.3233/IP-170410.

S. Vijayarani and M. P. Research Scholar, “Preprocessing Techniques for Text Mining-An Overview.”

V. S and J. R, “Text Mining: open Source Tokenization Tools – An Analysis,” Advanced Computational Intelligence: An International Journal (ACII), vol. 3, no. 1, pp. 37–47, Jan. 2016, doi: 10.5121/acii.2016.3104.

T. Joachims, “A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization.”

D. Liparas, Y. Hacohen-Kerner, A. Moumtzidou, S. Vrochidis, and I. Kompatsiaris, “LNCS 8849 - News Articles Classification Using Random Forests and Weighted Multimodal Features,” 2014. [Online]. Available: http://www.bbc.com/news/business-25445906

Universitas Diponegoro. Faculty of Science and Mathematics. Department of Informatics, Institute of Electrical and Electronics Engineers. Indonesia Section, Universitas Diponegoro. IEEE Student Branch, and Institute of Electrical and Electronics Engineers, ICICoS 2019 : the 3rd International Conference on Informatics and Computational Sciences : proceedings : October 29th -30th, 2019, Semarang, Central Java, Indonesia.

M. Panda and M. R. Patra, “EVALUATING MACHINE LEARNING ALGORITHMS FOR DETECTING NETWORK INTRUSIONS,” 2009.

A. E. Brouwer and W. H. Haemers, “Spectra of graphs-Monograph,” 2011.

C. C. Aggarwal and C. X. Zhai, “A survey of text classification algorithms,” in Mining Text Data, vol. 9781461432234, Springer US, 2012, pp. 163–222. doi: 10.1007/978-1-4614-3223-4_6.

T. B. Shahi and A. K. Pant, “Nepali news classification using Naïve Bayes, Support Vector Machines and Neural Networks,” in Proceedings - 2018 International Conference on Communication, Information and Computing Technology, ICCICT 2018, Institute of Electrical and Electronics Engineers Inc., Mar. 2018, pp. 1–5. doi: 10.1109/ICCICT.2018.8325883.

U. Suleymanov, S. Rustamov, M. Zulfugarov, O. Orujov, N. Musayev, and A. Alizade, “Empirical Study of Online News Classification Using Machine Learning Approaches,” in 2018 IEEE 12th International Conference on Application of Information and Communication Technologies (AICT), IEEE, Oct. 2018, pp. 1–6. doi: 10.1109/ICAICT.2018.8747012.

B. Wilie et al., “IndoNLU: Benchmark and Resources for Evaluating Indonesian Natural Language Understanding,” Sep. 2020, [Online]. Available: http://arxiv.org/abs/2009.05387

F. Z. Tala, “A Study of Stemming Effects on Information Retrieval in Bahasa Indonesia.”

Medved M and Suchomel V, “Indonesian web corpus (idWac),” LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics (FAL), Faculty of Mathematics and Physics, Charles University, 2017.

F. Koto, A. Rahimi, J. H. Lau, and T. Baldwin, “IndoLEM and IndoBERT: A Benchmark Dataset and Pre-trained Language Model for Indonesian NLP,” Online. [Online]. Available: https://huggingface.co/

S. Wu and M. Dredze, “Beto, Bentz, Becas: The Surprising Cross-Lingual Effectiveness of BERT.” [Online]. Available: https://github.com/

Y. Wang, W. Che, J. Guo, Y. Liu, and T. Liu, “Cross-Lingual BERT Transformation for Zero-Shot Dependency Parsing,” Sep. 2019, [Online]. Available: http://arxiv.org/abs/1909.06775

J. Devlin, M.-W. Chang, K. Lee, K. T. Google, and A. I. Language, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.” [Online]. Available: https://github.com/tensorflow/tensor2tensor

S. Wu and M. Dredze, “Are All Languages Created Equal in Multilingual BERT?,” May 2020, [Online]. Available: http://arxiv.org/abs/2005.09093

R. Nallapati, F. Zhai, and B. Zhou, “SummaRuNNer: A Recurrent Neural Network based Sequence Model for Extractive Summarization of Documents,” Nov. 2016, [Online]. Available: http://arxiv.org/abs/1611.04230

P. Refaeilzadeh, L. Tang, and H. Liu, “C Cross-Validation.”

Published
2024-05-25
How to Cite
[1]
D. Sholawatunnisa and L. Suadaa, “OPTIMIZING LONG TEXT CLASSIFICATION PERFORMANCE THROUGH KEYWORD-BASED SENTENCE SELECTION: A CASE STUDY ON ONLINE NEWS CLASSIFICATION FOR INDONESIAN GDP GROWTH-RATE DETECTION”, BAREKENG: J. Math. & App., vol. 18, no. 2, pp. 1081-1094, May 2024.