PERFORMANCE COMPARISON OF GRADIENT-BASED CONVOLUTIONAL NEURAL NETWORK OPTIMIZERS FOR FACIAL EXPRESSION RECOGNITION

. A convolutional neural network (CNN) is one of the machine learning models that achieve excellent success in recognizing human facial expressions. Technological developments have given birth to many optimizers that can be used to train the CNN model. Therefore, this study focuses on implementing and comparing 14 gradient-based CNN optimizers to classify facial expressions in two datasets, namely the Advanced Computing Class 2022 (ACC22) and Extended Cohn-Kanade (CK+) datasets. The 14 optimizers are classical gradient descent, traditional momentum, Nesterov momentum, AdaGrad, AdaDelta, RMSProp, Adam, Radam, AdaMax, AMSGrad, Nadam, AdamW, OAdam, and AdaBelief. This study also provides a review of the mathematical formulas of each optimizer. Using the best default parameters of each optimizer, the CNN model is trained using the training data to minimize the cross-entropy value up to 100 epochs. The trained CNN model is measured for its accuracy performance using training and testing data. The results show that the Adam, Nadam, and AdamW optimizers provide the best performance in model training and testing in terms of minimizing cross-entropy and accuracy of the trained model. The three models produce a cross-entropy of around 0.1 at the 100th epoch with an accuracy of more than 90% on both training and testing data. Furthermore, the Adam optimizer provides the best accuracy on the testing data for the ACC22 and CK+ datasets, which are 100% and 98.64%, respectively. Therefore, the Adam optimizer is the most appropriate optimizer to be used to train the CNN model in the case of facial expression recognition.


INTRODUCTION
Facial expression recognition is part of the classification problem and is still a challenging area of research [1]. Not only needed in human-to-human communication, but facial expression recognition also plays an essential role in human-computer communication, including human-robot interaction [2]. In an automated system, the system can provide services according to the emotions of each customer. In other applications, facial expression recognition can also be used in virtual reality [3], augmented reality [4], and mental disease diagnosis [5].
Many methods have been developed to recognize facial expressions. In general, an automatic facial expression recognition system usually consists of four stages: data pre-processing, feature extraction, feature selection, and classification [6]. One method that gives good results is the convolutional neural network (CNN) [7]. The advantage of CNN is that the feature extraction and selection stages are carried out automatically at the feature learning stage using convolutional and sub-sampling layers. After that, the outcome of the feature learning is flattened and then used as input for the classification using fully-connected layers. Although it gives good results, the main problem of classification using CNN is the large number of parameters, so the training process requires a long computational time [8].
Various optimizers that can be used to train CNN models have been developed by many experts. The earliest and most common optimization method is gradient descent [9]. The gradient descent method is easy to implement, but convergence speed is slower as it approaches the optimal solution. The speed of this convergence can be accelerated by considering both traditional and Nesterov momentum. Subsequently, an optimizer with an adaptive learning rate was developed, known as AdaGrad. AdaGrad performs more significant updates for infrequent and minor updates for frequent parameters. AdaGrad is extended into two methods, namely AdaDelta and RMSprop. Adaptive Moment Estimation (Adam) is another method that computes adaptive learning rates for each parameter, designed to combine the advantages of the two methods, namely AdaGrad and RMSProp. Moreover, Adam has many variations, such as rectified Adam, AdaMax, AMSGrad, NADAM, ADAMW, OADAM, and AdaBelief.
With the development of many choices of optimizers that can be used to train CNN, this study aims to implement and compare 14 CNN optimizers for facial expression recognition problems. Facial expression data comes from Advance Computing Class 2022 (ACC22) and Extended Cohn-Kanade (CK+) datasets. The performance measured is the convergence speed of the optimizer to minimize the loss function used, namely cross-entropy. In addition, the accuracy of the CNN model on the training and testing data is also measured. The study results are expected to provide suggestions on an excellent optimizer that can be used to optimize CNN problems, especially in the case of facial expression recognition.

Data Collection
This study uses two facial expression datasets, i.e., the dataset collected from applied mathematics master's degree students taking the 2022 advanced computing class (ACC22) and the dataset from the extended Cohn-Kanade (CK+) dataset. The details of each dataset are as follows.

Advance Computing Class 2022 (ACC22) Dataset
The first data was collected from IPB University applied mathematics master's students, who are taking advanced computing courses in 2022, called the Advance Computing Class 2022 (ACC22) Dataset. This data consists of seven students who express three types of expressions: neutral, smile, and grin [10]. The data has been augmented into 567 labelled data, then stored with grayscale and a pixel size of 48 so that in the WHCN (width, height, channel, batch size) order, the size of ACC22 data is 48×48×1×567.

Extended Cohn-Kanade (CK+) Dataset
The CK+ dataset is an extension of the CK dataset. This data was collected from a total of 123 subjects aged 18 to 50 years [11]. This study extracted the last three frames from each sequence in the CK+ dataset, resulting in 981 data labelled with seven different types of expressions, i.e., angry, disgust, fear, happy, sad, surprise, and neutral. The CK+ data is also extracted at grayscale and a pixel size of 48 so that in the WHCN order, the size of CK+ data is 48×48×1×981.

Convolutional Neural Network (CNN)
A convolutional neural network (CNN) is a feedforward neural network that can extract features from image data automatically using convolution structures. CNN has many advantages compared to general artificial neural networks: 1) Local connection. Each neuron in the new layer is only connected to a small number of neurons in the previous layer, effectively reducing parameters and accelerating convergence; 2) Weight sharing. A group of connections can share the same weight, which reduces parameters even further.
3) Sub-sampling dimensionality reduction [12]. In general, there are two stages in the CNN model, i.e., feature learning and classification. Feature learning usually consists of convolution and pooling layers, while the classification layer consists of fully connected layers.
Convolution is essential for feature learning, producing outputs called feature maps. Four components must be defined in the convolution layer: padding, stride, kernel, and activation function. When setting the convolution kernel to a specific size, the information on the border will be lost. Therefore, padding was introduced to enlarge the input by zero on each edge, which can adjust the size indirectly. The number of additional borders depends on the needs and the size of the specified kernel. Once the padding is set, the convolution operation is performed according to the fixed kernel. In a two-dimensional CNN, the kernel is a matrix of size × containing a number of parameter values. Each sub-matrix on the input is multiplied point by point by the kernel then summed and operated with an activation function. The sub-matrix is shifted right and down so that each value in the input matrix is convoluted. The length of the shift step is called the stride. Stride is employed to control the density of convolving. The larger the stride, the lower the density. After convolution, feature maps consist of a large number of features that is prone to causing overfitting problem. As a result, sub-sampling (max-pooling and average pooling) is proposed to avoid redundancy. After all the convolution and sub-sampling processes have been carried out, the last feature maps are flattened and used as input to the fully-connected layer for the classification stage. Figure 1 illustrates the procedure of convolution layer with max-pooling on two-dimensional CNN. AlexNet comprises eight processing layers, including five convolution layers and three fully connected layers. The hallmark of AlexNet is three successive convolution layers, after two convolution and subsampling layers. The sub-sampling uses max pooling, and the activation function used is rectified linear unit (ReLU), given by = + = max 0, . In deep learning, if the estimation of parameters uses limited data, it can lead to high variance and overfitting. Therefore, dropout is employed on fully-connected layers to prevent overfitting and improve the generalization ability of the network [15].

Optimizer
Before the CNN model is ready to be used, the model must be trained using training data to minimize the loss function , which is a measure of the proximity of the model output to the actual label. This study uses cross-entropy as a loss function given by where is the number of training samples, is the CNN model to be learned, and is the parameter to be optimized [16]. This study uses 14 optimizers to train the CNN model, with the details of each optimizer as follows.

Gradient Descent
Gradient descent is one of the most popular algorithms for optimizing neural networks [17]. Gradient descent computes the gradient of the loss function w.r.t. the parameters for the training dataset, and the parameters are updated in the negative gradient direction to minimize the loss function given by * = − ⋅ (2) where is the learning rate, which determines the step size in each iteration and thus influences the number of iterations to reach the optimal value [18].

Momentum Gradient Descent
In classic gradient descent, the update speed is fixed and does not take into account updates in previous iterations (epochs), i.e.
where is called the momentum factor and is usually set to 0.9 or similar. , , ,

Nesterov Accelerated Gradient Descent
The Nesterov Accelerated Gradient Descent (NAG) is an improvement over the traditional momentum method [21]. In Nesterov momentum, the momentum ⋅ −1 is also added to , when calculating the gradient of the loss function. Therefore, the updating formula of Nesterov momentum is given by The improvement of Nesterov momentum is reflected in updating the gradient of the future position instead of the current position.

AdaGrad
Adaptive gradient descent (AdaGrad) is a refinement of the gradient descent method that adjusts the learning rate dynamically based on the historical gradient in previous iterations [22]. The updating formula of AdaGrad is given by = , where is the parameter of at iteration , and is the gradient of parameter . Since the learning rate changes with each iteration based on the accumulative gradient of the previous iterations, manual learning rate tuning are no longer necessary.

AdaDelta
The main problem with AdaGrad is that the learning rate will go to zero as iterations increase, causing parameter updates to become ineffective. AdaDelta only focuses on the gradients in a window over a period and uses the exponential moving average to calculate the second-order cumulative momentum [23]. The updating formula of AdaDelta is given by where is the exponential decay parameter. With AdaDelta, there is no need to set the learning rate , as it has been removed from the update rule.

RMSProp
RMSProp and AdaDelta have been developed independently around about the same time. Both were designed to cope with the learning rate of AdaGrad, which is radically going to zero. With a learning procedure that is almost similar to AdaDelta, the updating formula of RMSProp is given by where is the exponential decay parameter and is usually set to 0.9 at a learning rate of 0.001 [24]. The difference between RMSProp and AdaDelta is that RMSProp still requires settings for the learning rate.

Adam
Adaptive moment estimation (Adam) is another advanced gradient descent method which combines the adaptive learning rate and momentum methods [25]. In addition to storing an exponentially decaying average of past squared gradients, like AdaDelta and RMSProp, Adam also keeps an exponentially decaying average of past gradients, similar to the momentum method. The updating formula of Adam is given by where 1 and 2 are exponential decay rates. The default values of 1 , 2 , and are suggested to be set to 0.9, 0.999, and 10 − , respectively.

Radam
Due to the limited number of samples in the early stage of model training, the adaptive learning rate has an undesirably large variance. It can cause the model to converge to suspicious or bad local optima. Rectified Adam (Radam) is a variation of Adam that overcomes this issue [26]. Radam considers the existence of high variance by estimating the degrees of freedom in each iteration, i.e., If the variance is tractable, i.e., > , then the updating formula of Radam is given by Otherwise, Radam will calculate the regular Adam optimizer.

AdaMax
The factor in Adam scales the gradient inversely proportionally to the 2 norm of the past gradients and the current gradient. AdaMax is a variant of Adam which considers ∞ rather than 2 [25]. In AdaMax, the value of is replaced with , so that the updating formula of AdaMax is given by Bias correction for is not necessary because depends on the max operation, and bias towards zero is not as recommended as and in Adam.

AMSGrad
AMSGrad slightly modifies the Adam optimizer to provide the algorithm with long-term memory of past gradients [27] [28]. AMSGrad slightly changes the bias correction of , denoted by ̃ to avoid confusion with Adam. Thus, the updating formula of AMSGrad is given by Note that AMSGrad does not use the correction bias of , but uses itself.

Nadam
Nesterov-accelerated adaptive moment estimation (Nadam) is a combination optimizer of the Nesterov momentum and Adam. The updating formula of Nadam is given by The formula derivation from Nadam can be studied further in [29][18].

AdamW
Adam with decoupled weight decay (AdamW) is a variant of ADAM fixing (as in repairing) its weight decay regularization. The updating formula of AdamW is given by where is the weight decay value. Note that AdamW uses the 2 regularization to calculate the gradient of the parameter , i.e., = / + . To account for possible scheduling of and , AdamW introduce a scaling factor delivered by a user-defined procedure SetScheduleMultiplier(t) [30]. Within the -th run, the value of decays according to a cosine annealing learning rate for each batch [31].

OAdam
Optimistic Adam (OAdam) is a variant of Adam adding an "optimistic" term suitable for adversarial training. The updating formula of OAdam is given by Optimistic Adam is claimed to be able to achieve high numbers of inception scores after very few epochs of training [32].

AdaBelief
The AdaBelief optimizer is another variant of the well-known ADAM optimizer. No extra parameters are introduced in AdaBelief. Specifically, in Adam, the update direction is /√ , where is the EMA of 2 ; in AdaBelief, the update direction is /√ , where is the EMA of − 2 [33]. Thus, the updating formula of AdaBelief is given by AdaBelief adaptively scales the step size by the difference between the predicted and observed gradients. AdaBelief is the first optimizer to achieve three goals simultaneously: fast convergence as in adaptive methods, good generalization as in SGD, and training stability in complex settings such as GANs.
The CNN model in Figure 2 was trained using the 14 optimizers mentioned above to classify facial expressions in the ACC22 and CK+ datasets. The training process uses 80% of the data, and the other 20% is used for testing. The hyperparameter values of each optimizer use their respective best default settings, as in Table 1. Moreover, the value of is set to 10 −8 for the optimizer that requires it.

Performance Metrics
After the model is trained, the ' is assessed using a confusion matrix based on training and testing data. For the case of -class classification, the confusion matrix obtained can be seen in Figure 3. The value of 1,1 indicates the amount of data that is actually class-1 predicted by the model in class-1 as well, while 1,2 shows the actual data is class-1 but is predicted to be class-2 by the model. The same interpretation applies to values 1,3 , 2,1 , to 3,3 . Th ' y b using the resulting confusion matrix. For the case of classification with n classes, the accuracy is determined by The accuracy value is the ratio between the number of main diagonals (trace) and the number of all elements in the confusion matrix.

Device Requirement
The entire computing process in this study was carried out using a Lenovo Ideapad 330-14AST with an AMD A4-9125 Radeon R3 processor, 4 Core (2C+2G) 2.30 GHz and 8 GB RAM. This study uses the Julia programming language version 1.6.5, which has fast performance, dynamic language, and is open source. The pre-processing stage was carried out using the Images.jl package. Meanwhile, the CNN model construction and training process used the Flux.jl package available on Julia.

Pre-processing
The datasets are partitioned into two parts for training and testing, randomly by setting the seed of the random generator to 1. With a ratio of 80:20%, the training and testing data obtained are 454 and 113 images for ACC22, while as many as 785 and 196 images for CK+, respectively.

Performance Loss Function
Using the training data, each optimizer is used to train the CNN model to minimize the cross-entropy value. The training process is carried out up to 100 epochs. The same procedure was run 10 times to avoid computational bias. The following are the results of the cross-entropy (loss function) propagation and the computational time required for the ACC22 and CK+ datasets. Figure 4 compares all optimizers that minimize the cross-entropy value in the first and second running processes. Both figures (4A and 4B) show that Adam, AdamW, and Nadam are the three fastest optimizers in training the CNN model for facial expression recognition on the ACC22 dataset. In only 100 epochs, the three optimizers obtained a cross-entropy of less than 0.1, although the fastest optimizer differed in the first and second runs. Adam is the fastest optimizer on the first run, but Adam is slower than Nadam and AdamW on the second run. Furthermore, several optimizers can train models at moderate convergence speeds, including AMSGrad, AdaMax, RMSProp, OAdam, Radam, and AdaDelta. There are behavioural differences in using the AdaBelief optimizer, which provides low convergence speed on the first run and moderate speed on the second run. Classic optimizers such as gradient descent, momentum, Nesterov, and AdaGrad provide very slow convergence speeds. Even worse, AdaGrad could not train the CNN model, which was indicated by an increase in the cross-entropy and stuck around the value of 10.   The same pattern is also seen in Figure 5, which shows the epoch-to-epoch cross-entropy propagation in the CK+ dataset. Adam, Nadam, and AdamW optimizers still hold the fastest optimizer. The behaviour of the AdaBelief optimizer also looks unstable in this dataset. In the first run, the AdaBelief optimizer belongs to the slow optimizer group, but it has a fast convergence speed in the second run. Even better, the crossentropy value obtained at the 100th epoch is smaller than the Adam, Nadam, and AdamW optimizers. Furthermore, differences are also seen in the AMSGrad optimizer. On the ACC22 dataset, AMSGrad can train the CNN model at a moderate speed, but AMSGrad is very slow on the CK+ dataset, where the cross-entropy value is stuck at 1.8. To further generalize the results obtained, the cross-entropy value at the 100th epoch of each optimizer for 10 runs is visualized in a boxplot, as shown in Figure 6. Figures 6A and 6B re-emphasize that the most consistently fast optimizers for training CNN models in facial expression recognition are the Adam, Nadam, and AdamW optimizers. Although it once had good results, the AdaBelief optimizer worked slowly. The same results were also shown for AMSGrad, where AMSGrad had a moderate convergence speed for training models on the ACC22 dataset but very slowly on the CK+ dataset. Moreover, Figure 6C shows the minimum computational time required for each optimizer to train the CNN model. The minimum (fastest) time is taken rather than the entire time because the slow computation time can be caused by other factors, such as interference on the device. Based on Figure 6c, the computation time of each optimizer to reach the 100th epoch is not significantly different. In the ACC22 dataset, the minimum computation time is around 500s, while the CK+ dataset requires a minimum computational time of around 900s.

Performance Accuracy
After comparing the resulting loss function after 100 epochs, the accuracy of each optimizer is calculated using Equation 18. The accuracy results from 10 runs are shown in the boxplot in Figure 7 for both training and testing data. The accuracy value strongly correlates with the loss function (cross-entropy), which is used as a target in model training. The lower the cross-entropy value, the better the accuracy of the model. The best accuracy value after 100 epochs is obtained using the Adam, Nadam, and AdamW optimizers. These optimizers provide accuracy values above 90% for both training and testing data for the ACC22 and CK+ datasets. Furthermore, optimizers with moderate convergence speed yield accuracy of around 80%, such as RMSProp, AdaMax, and Radam; and an additional AMSGrad optimizer for the ACC22 dataset. The best accuracy of each optimizer on the testing data can be seen in Table 2, both for the ACC22 and CK+ datasets. Referring to Table 2, the three optimizers, Adam, Nadam, and AdamW, were able to classify facial expressions in the ACC22 dataset with 100% accuracy for the testing data on the 3rd, 2nd and 4th run, respectively. Meanwhile, the Adam and AdaBelief optimizers provide the highest accuracy on the CK+ dataset, 98.46%, on the 3rd and 2nd run, respectively. However, even though they give good results, AdaBelief does not always consistently provide them. Thus, the Adam optimizer is more reliable for training CNN models. In the ACC22 dataset, the best optimizer can produce an accuracy of 100%, so the confusion matrix is a diagonal matrix, indicating that each facial expression can be predicted accurately. Meanwhile, the confusion matrix of the best results in the CK+ dataset is shown in Table 3, which uses the Adam optimizer. According to Table 3, misclassification occurs when predicting angry facial expressions. Three of the 25 test data expressed anger, but the CNN model predicted it would be sad.

CONCLUSIONS
This study focuses on implementing and comparing the performance of gradient-based CNN model optimizers for facial expression recognition. Using the ACC22 and CK+ datasets, the Adam, Nadam, and AdamW optimizers provide the best performance in model training and testing in terms of minimizing crossentropy and accuracy of the trained model. The three models produce a cross-entropy of around 0.1 at the 100th epoch with an accuracy of more than 90% on both training and testing data. Furthermore, the Adam optimizer provides the best accuracy on the testing data for the ACC22 and CK+ datasets, which are 100% and 98.64%, respectively. Therefore, the Adam optimizer is the most appropriate optimizer to be used to train the CNN model in the case of facial expression recognition.