CABLE NEWS NETWORK (CNN) ARTICLES CLASSIFICATION USING RANDOM FOREST ALGORITHM WITH HYPERPARAMETER OPTIMIZATION

ABSTRACT


INTRODUCTION
The growth of news articles on the internet occurred in a short period with large numbers [1]. A large number of news articles need to be grouped into several categories for easy access to news articles [2]. News articles can be grouped manually, but it takes a long time. There is a method for grouping news articles, namely classification. Classification is the process of identifying and grouping objects or ideas into predetermined categories. Classification of news articles is included in text mining because it requires preparatory steps to convert unstructured data into structured information [3].
Research related to the classification of news articles has been carried out using various algorithms and methods. Dion and Kartika (2015) researched to classify news articles using the Naïve Bayes algorithm and support vector machine (SVM), showing good levels of accuracy, which are 82,2% and 88,1% [4]. Fanny (2018) conducted another research using the k-nearest neighbor (KNN) algorithm with the type of correlation similarity obtained good results, namely 86,11% on stemmer evaluation [5]. However, all three algorithms have shortcomings in classifying large amounts of news article data. Naïve Bayes has very strong independence between features, thereby reducing the correlation between features that play an important role in determining categories. SVM has a weakness in processing large amounts of data, it causes frequent overfitting or underfitting. Meanwhile, KNN cannot do learning well if the type of attribute used is not in accordance with the dataset.
The dataset used in this research is Cable News Network (CNN) news articles from 2011 to 2022. CNN is a multi-national cable news channel that provides 24-hour news coverage. The dataset has a large size and strong correlation between its features, so it is not appropriate to use algorithms such as Naïve Bayes, SVM, and KNN for classification. Therefore, in this research, the random forest algorithm is used, which is an ensemble and multi-class classification algorithm. Random forest is an algorithm that is built from a combination of various decision trees [6]. The advantage of random forest is that it can work very well on large amounts of data. In addition, random forests can estimate features that are important in the classification process and provide experimental methods to detect correlations between features [7].
Random forest algorithm is prone to overfitting or underfitting if the combination of parameters used is not appropriate for processing the data. Therefore, in this research, hyperparameter optimization was carried out to find the best combination of parameters in the random forest algorithm. This research uses the search cross-validation (SearchCV) method, which is a method of selecting a combination of parameters and random forest algorithms by testing the combinations one by one and validating them [8].

RESEARCH METHODS
This research was conducted in four main stages, which are data acquisition, data preprocessing, classification, and evaluation.

Data Acquisition
This research uses data from Cable News Network (CNN) news articles from 2011 to 2022 [9]. The number of rows in the dataset before preprocessing is 37.949 rows of data. The dataset has 11 columns, these are Index, Author, Date published, Category, Section, Url, Headline, Description, Keywords, Second headline, and Article text. Table 1 shows the top five rows of data used in this research before data preprocessing.

Data source: Kaggle
Based on Table 1, the variables used in this research are Index, Category, and Article Text. In this research, the Article Text variable in the form of text data is categorized based on the classes available in the Category variable. In this research, news articles were grouped into six categories, namely news, sport, politics, business, health, and entertainment, because the vr, travel, and style categories have a very small number of rows of data. The amount of data that is not balanced can make the results of the algorithm's accuracy skew towards the majority object [10]. Therefore, the category with the smallest number of rows needs to be deleted and balanced by the sampling method.

Data Preprocessing
At the data preprocessing stage, various data handling processes are carried out. The handling process aims to ensure good data quality before being used during data analysis. Several things need to be ensured, namely data accuracy, completeness, consistency, timeliness, reliability, and being able to be interpreted [11]. In this research, data preprocessing was carried out on text data, including case folding, converting numbers to words, removing special characters, removing stop words, lemmatization, stemming, split data training and testing, vectorization with TF-IDF, and data balancing.

Classification
Random forest algorithm is built from a collection of several decision trees. A decision tree is a treeshaped structure that has several parts, namely the root node, which is used to collect data, the inner node which contains data questions, and the leaf node which is used to solve problems and make decisions [12]. Prediction results from the random forest are obtained through the highest results from each decision tree (voting for classification), as shown in Figure 1. For random forest consisting of trees, Equation (1) is used to predict the class label of the case through voting [7].
Where is the indicator function and ℎ is the -th tree of the random forest. Random forest algorithm is more accurate in estimating the error rate than the decision tree. In particular, the error rate has been proven mathematically to always converge as the number of trees increases. To be able to produce accurate and stable predictions, random forest works by applying the bagging method (bootstrap aggregation). The bagging method is a collection of several meta-algorithms that aim to improve the accuracy of machine learning algorithms [13]. Random forest can work efficiently when applied to largescale datasets with high accuracy and easy-to-understand results. However, random forest requires setting the right combination of parameters in the dataset to avoid cases of overfitting or underfitting. Ramadhan et al., in their research, using SearchCV method to find the best combination of parameters in random forest model [8]. In addition, Adnan et al. also use SearchCV to improve the performance of the classification model that has been created. SearchCV is a method of selecting a combination of parameters and models by testing the combinations one by one and validating each combination to produce the best model performance [14].

Evaluation
Evaluation is the stage of measuring the performance of a model that has been made so that it can be considered in choosing the best model. In this research, the evaluation metrics used are the value of accuracy and confusion matrix. The accuracy value is obtained by dividing the number of correct predictions by the total number of predictions. However, accuracy can be misleading if there is a large class imbalance, so an additional evaluation metric is used, namely the confusion matrix. The confusion matrix displays and compares the actual value with the predicted value in the classification case. The confusion matrix displays and compares the actual value with the predicted value in a classification case. In the research conducted by Hossin and Sulaiman, the confusion matrix helps in knowing the comparison between prediction errors and prediction accuracy in each classification class in detail [15].

RESULTS AND DISCUSSION
In this research, the data is divided into two parts, namely training data and testing data with a proportion of 85:15, then vectorization is carried out. Vectorization is the process of converting text data into numeric data. The vectorization method used is TF-IDF (Term Frequency-Inverse Document Frequency) [16]. Before training the data, it is necessary to first check the proportion of each category/class to avoid data imbalance. From the search results shown in Table 2, the data is not balanced because it is dominated by the "news" and "sport" categories. Data imbalance can cause overfitting or underfitting in the classification process so that low accuracy is obtained. Overfitting occurs when the model learns the training data too well, while underfitting does not study the training data well. Therefore, it is necessary to balance the data. The library used in this study is "Imblearn". "Imblearn" is a method for balancing data in each class so that it has the same amount, with the random undersampling method, which reduces the data in the majority category/class [17]. Random forest algorithm used for classification is set to random_state=5. Figure 2 shows metrics for evaluating the classification results using the random forest algorithm without hyperparameter optimization. Numbers 0, 1, 2, 3, 4, and 5 in the classification report indicate the index of each category, as shown in Table  2.

Figure 2. Metrics for evaluating random forest algorithm classification results without hyperparameter optimization
In Figure 2, it can be seen that the accuracy value of the training data is perfect, which is 1, while the accuracy value of the testing data is 0,79. It can be concluded that the model of random forest algorithm that has been made is overfitting. Random forest is indeed suitable to be applied to large amounts of data. However, without proper handling often results in overfitting or underfitting. Therefore, it is necessary to handle the random forest algorithm, one of which is by optimizing hyperparameters [18]. Hyperparameter optimization is a technique to find a combination of parameters with the random forest algorithm so that the best classification results are obtained [19]. Table 3 shows the combination of parameters used in the random forest algorithm training. In Table 3, it can be seen that the random forest algorithm by default regulates the combination of parameters used.
In this research, a search for the best combination of parameters in the random forest algorithm will be carried out. The method used to find the combination of parameters is search cross-validation (SearchCV) [8]. The parameters used are n_estimators, max_features, max_depth, min_samples_split, min_samples_leaf, and bootstrap. The best combination of parameters is then fitted to the random forest algorithm. Table 4 shows the comparison of various combinations of parameters and the level of accuracy of the classification results. Each value set for each parameter has a significant impact on the level of accuracy. The number of values for each parameter will produce good accuracy, but it takes a long time, and vice versa. In Table 4, it can be seen that Combination 5 has the best accuracy, and the distance between the accuracy of the training data and the testing data is neither too far nor too close. Figure 3 shows the evaluation metrics for classification results using Combination 5.  Figure 3 shows that the classification of news articles into six categories has been successfully carried out. This can be seen from the accuracy value of 0.81 in training and 0.76 in testing so that the model created is protected from overfitting and underfitting. This value also shows that the model is able to classify well, namely 76% of new data can be classified correctly. In addition, Figure 4 shows the confusion matrix for each classification class.  Figure 4 shows the amount of accuracy and prediction error of the data in each classification class. The sport category is the class with the highest number of correctly predicted data, namely 59 data, while the news is the class with the smallest number of correctly predicted data, namely 31 data. Overall, the model made is able to classify with fairly good accuracy; it can be seen from the diagonal which has a dark color. Thus, the random forest algorithm with hyperparameter optimization can classify news articles into six categories.

CONCLUSIONS
News articles need to be grouped into several categories for easy access for readers. One of the algorithms that can be used for classification problems is random forest. In this research, the random forest algorithm worked well in classifying CNN news articles into six categories, namely news, sport, politics, business, health, and entertainment. Hyperparameter optimization on the random forest algorithm has a significant impact on the classification results where the method used is randomized search cross-validation. This method can find the best combination of parameters in the random forest algorithm at random. However, keep in mind that determining the combination of parameters is much better if the value set for each parameter is varied and numerous, but it takes a long time. In addition, hyperparameter optimization aims to avoid overfitting, underfitting, and inappropriate processing of training data. This research obtained classification results with an accuracy value of 0,81 on training and 0,76 accuracy on testing. Thus, this research can be used as a reference in classifying news articles for easy access.