CABLE NEWS NETWORK (CNN) ARTICLES CLASSIFICATION USING RANDOM FOREST ALGORITHM WITH HYPERPARAMETER OPTIMIZATION

  • Dewi Retno Sari Saputro Department of Mathematics, Faculty of Mathematics and Natural Sciences, Universitas Sebelas Maret, Indonesia
  • Krisna Sidiq Department of Mathematics, Faculty of Mathematics and Natural Sciences, Universitas Sebelas Maret, Indonesia
Keywords: Classification, Hyperparameter, Random forest

Abstract

The growth of news articles on the internet occurs in a short period with large amounts so necessary to be grouped into several categories for easy access. There is a method for grouping news articles, namely classification. One of the classification methods is random forest which is built on decision tree. This research discusses the application of random forest as a method of classifying news articles into six categories, these are business, entertainment, health, politics, sport, and news. The data used is Cable News Network (CNN) articles from 2011 to 2022. The data is in form of text and has large amounts so good handling is needed to avoid overfitting and underfitting. Random forest is proper to apply to the data because the algorithm works very well on large amounts of data. However, random forest has a difficult interpretation if the combination of parameters is not appropriate in the data processing. Therefore, hyperparameter optimization is needed to discover the best combination of parameters in the random forest. This research uses search cross-validation (SearchCV) method to optimize hyperparameters in the random forest by testing the combinations one by one and validating those. Then we obtain the classification of news articles into six categories with an accuracy value of 0.81 on training and 0.76 on testing.

Downloads

Download data is not yet available.

References

House of Lords, “Growing up with the internet,” Parliament.uk, no. March, 2017, [Online]. Available: https://www.publications.parliament.uk/pa/ld201617/ldselect/ldcomuni/130/13002.htm.

O. de Clercq, L. de Bruyne, and V. Hoste, “News topic classification as a first step towards diverse news recommendation,” Comput. Linguist. Netherlands J., vol. 10, pp. 37–55, 2020.

G. Kaur and K. Bajaj, “News Classification and Its Techniques: A Review,” IOSR J. Comput. Eng., vol. 18, no. 1, pp. 22–26, 2016, doi: 10.9790/0661-18132226.

D. Ariadi and K. Fithriasari, “Klasifikasi Berita Indonesia Menggunakan Metode Naive Bayesian Classification dan Support Vector Machine dengan Confix Stripping Stemmer,” J. Sains dan Seni ITS, vol. 4, no. 2, pp. 248–253, 2015.

Fanny, Y. Muliono, and F. Tanzil, “A Comparison of Text Classification Methods k-NN, Naïve Bayes, and Support Vector Machine for News Classification,” J. Inform. J. Pengemb. IT, vol. 3, no. 2, pp. 157–160, 2018, doi: 10.30591/jpit.v3i2.828.

J. Ali, R. Khan, N. Ahmad, and I. Maqsood, “Random forests and decision trees,” IJCSI Int. J. Comput. Sci. Issues, vol. 9, no. 5, pp. 272–278, 2012.

D. Liparas, Y. HaCohen-Kerner, A. Moumtzidou, S. Vrochidis, and I. Kompatsiaris, “News Articles Classification Using Random Forests and Weighted Multimodal Features BT - Multidisciplinary Information Retrieval,” 2014, pp. 63–75.

M. Ramadhan, I. Sitanggang, F. NASUTION, and A. GHIFARI, “Parameter Tuning in Random Forest Based on Grid Search Method for Gender Classification Based on Voice Frequency,” in DEStech Transactions on Computer Science and Engineering, Oct. 2017, pp. 625–629, doi: 10.12783/dtcse/cece2017/14611.

H. Unger, “CNN News Articles from 2011 to 2022,” Kaggle, 2022. https://www.kaggle.com/datasets/hadasu92/cnn-articles-after-basic-cleaning.

R. Siringoringo, “Klasifikasi Data Tidak Seimbang Menggunakan Algoritma SMOTE dan k-Nearest Neighbor,” J. ISD, vol. 3, no. 1, pp. 44–49, 2018.

C. Fan, M. Chen, X. Wang, J. Wang, and B. Huang, “A Review on Data Preprocessing Techniques Toward Efficient and Reliable Knowledge Discovery From Building Operational Data,” Front. Energy Res., vol. 9, no. March, pp. 1–17, 2021, doi: 10.3389/fenrg.2021.652801.

M. Schonlau and R. Y. Zou, “The random forest algorithm for statistical learning,” Stata J., vol. 20, no. 1, pp. 3–29, 2020, doi: 10.1177/1536867X20909688.

N. Altman and M. Krzywinski, “Ensemble methods: Bagging and random forests,” Nat. Methods, vol. 14, no. 10, pp. 933–934, 2017, doi: 10.1038/nmeth.4438.

M. Adnan, A. A. S. Alarood, M. I. Uddin, and I. ur Rehman, “Utilizing grid search cross-validation with adaptive boosting for augmenting performance of machine learning models,” PeerJ Comput. Sci., vol. 8, no. Ml, pp. 1–29, 2022, doi: 10.7717/PEERJ-CS.803.

M. Hossin and M. N. Sulaiman, “A Review on Evaluation Metrics for Data Classification Evaluations,” Int. J. Data Min. Knowl. Manag. Process, vol. 5, no. 2, pp. 01–11, 2015, doi: 10.5121/ijdkp.2015.5201.

A. Hakim, A. Erwin, I. E. Kho, M. Galinium, and W. Muliady, “Automated document classification for news article in Bahasa Indonesia based on term frequency inverse document frequency (TF-IDF) approach,” Jan. 2015, doi: 10.1109/ICITEED.2014.7007894.

G. Lemaître, F. Nogueira, and C. K. Aridas, “Imbalanced-learn: A python toolbox to tackle the curse of imbalanced datasets in machine learning,” J. Mach. Learn. Res., vol. 18, pp. 1–5, 2017.

P. Probst, M. N. Wright, and A.-L. Boulesteix, “Hyperparameters and tuning strategies for random forest,” WIREs Data Min. Knowl. Discov., vol. 9, no. 3, p. e1301, 2019, doi: https://doi.org/10.1002/widm.1301.

E. Elgeldawi, A. Sayed, A. R. Galal, and A. M. Zaki, “Hyperparameter tuning for machine learning algorithms used for arabic sentiment analysis,” Informatics, vol. 8, no. 4, pp. 1–21, 2021, doi: 10.3390/informatics8040079.

Published
2023-06-11
How to Cite
[1]
D. Saputro and K. Sidiq, “CABLE NEWS NETWORK (CNN) ARTICLES CLASSIFICATION USING RANDOM FOREST ALGORITHM WITH HYPERPARAMETER OPTIMIZATION”, BAREKENG: J. Math. & App., vol. 17, no. 2, pp. 0847-0854, Jun. 2023.