COMPARING GAUSSIAN KERNEL AND QUADRATIC SPLINE OF NONPARAMETRIC REGRESSION IN MODELING INFECTIOUS DISEASES

ABSTRACT


INTRODUCTION
Sustainable Development Goals (SDGs) contain several objectives, one of which is to "ensure healthy lives and promote well-being for all at all ages", and one of the targets is to "end infectious diseases" [1], [2].Due to the pandemic and other ongoing crises, progress toward universal health coverage is being hampered, exacerbating existing health inequalities and jeopardizing the realization of this goal.Health systems in lowand middle-income countries have been particularly challenged by this since they already had inadequate resources prior to the outbreak [1].Since Indonesia is one of the middle-income countries, it is mandatory to pay attention to infectious diseases [3].It has been reported that Tuberculosis, Diarrhoeal, and Pneumonia (lower respiratory infections) are the top three infectious diseases that cause death in Indonesia for both sexes and all ages [4].In addition to the three previous diseases, Coronavirus disease (COVID-19) is another infectious disease included in the disease control category by the Indonesian Ministry of Health, where this disease was declared a pandemic by WHO on March 11, 2020 [5].Nonparametric regression can be used to analyze infectious diseases since they do not follow specific assumptions or patterns.
Nonparametric regression can be used to analyze infectious diseases since they do not follow specific assumptions or patterns.Regression analysis is a statistical technique to explore and model relationships between variables in a dataset [6].In regression, the functions of estimate parameters are referred to as regression functions or regression curves, and it describes the relationships between dependent and independent variables [7], [8].According to [9], for the proper interpretation of data, observational errors must be minimized to focus on the essential details of mean dependence between independent and dependent variables.The mean function can be approximated in two ways, namely parametric and nonparametric techniques [9].
Based on [8], the parametric regression model presumes that the form of f(x) is known and can depend on the parameters in a linear or nonlinear fashion.A fitted curve is generated by using this regression, which has been chosen from the family of curves that the model allows.Essentially, experimenters select a set of curves from a collection of all curves and input them into the inferential process.Consequently, the data provide limited information regarding the development of a model based on its assumed parametric form [8].
In contrast, as stated by [8], nonparametric regression techniques emphasize data to determine f(x).Regression curves are assumed to belong to some infinite-dimensional collection of functions, such as a linear function; for instance, f(x) may be considered to differ from a square-integrable second derivative.It allows great flexibility in the form of the regression curve and, in particular, makes no assumptions regarding a parametric model.When specifying this regression model, the experimenter chooses a function space that contains the unknown regression, which is usually motivated by the assumption that the regression function will possess smoothness (i.e., continuity and differentiability).The data is then utilized to determine an element of this function space that is representative of the unknown curve Compared to other kernel functions, gaussian is considered smoother than other kernel functions and has optimal criteria for selecting bandwidth for normal data or close to normal, which enables a fine and precise balance between fitting and smoothing [33], [36].Meanwhile, for spline, [37] indicates that linear splines are generally used for simple data patterns, whereas quadratic splines are usually used for data with complex data patterns.The statement is consistent with research by [38], which found quadratic splines more appropriate for modeling toddler growth data (a complex set of data with outliers) than linear splines.Until this date, no research has been comparing nonparametric regressions, namely Gaussian Kernel and Quadratic Spline, in infection studies.Thus, this research intends to compare both regressions in four infectious diseases in Indonesia by 2021.The research aims to provide insight into nonparametric regression in infectious disease case studies.

Data
The data is secondary data from the Ministry of Health and the Central Bureau of Statistics (BPS) in Indonesia.This study aimed to examine infectious diseases in Indonesia by 2021, and the sample consists of four infectious diseases in Indonesia by 2021 (Tuberculosis, Diarrhoeal, Pneumonia, and COVID-19).

a. Tuberculosis
Tuberculosis (T) is caused by a bacterium called Mycobacterium Tuberculosis which attacks the lungs and other organs in the body, and pulmonary tuberculosis is responsible for 80% of all cases [39], [40].Gender is one of the factors that influence the number of tuberculosis cases, with men having twice the potential to contract this disease as women due to their high mobility, so they are more likely to be exposed, and there is also the knowledge that Indonesia will have a higher number of men than women in 2021, both nationally and provincially [5], [41], [42].Thus, tuberculosis cases are the dependent variable (YT), while the male population is the independent variable (XT).

b. Diarrhoeal
Diarrhoeal (D), the second leading cause of death in children under five years of age, is defined as the passing of loose stools as well as an increase in stool frequency, weight, or volume, and it is usually associated with long-term health issues such as malnutrition, stunted growth, and immune system defects [43]- [45].According to previous research, breastfeeding up to 6 months of age lowers diarrhea risk since most children under 2 consume infant formula, and exclusively breastfeeding lowers diarrhea risk [44], [46].In addition, infants with the lowest and medium breastfeeding performance index (BPI) categories were 2.22 and 2.15 times more likely to develop diarrhea than infants with the highest BPI, respectively [47].Thus, diarrhoeal cases are the dependent variable (YD), while exclusively breastfed infants are the independent variable (XD).

c. Pneumonia
Pneumonia (P), the leading cause of death for children under five each year, is a viral, bacterial, or fungal respiratory tract infection that attacks lung tissues [48], [49].Previous research found that low birth weight was associated with severe pneumonia and increased mortality risk for children under five [50]- [52].Thus, pneumonia cases are the dependent variable (YP), while low birth weight infants are the independent variable (Xp).

d. COVID-19
COVID-19 (C) is a highly contagious virus caused by a type of Coronavirus that has been spreading worldwide, forcing most countries to recommend or require restrictive measures such as home isolation and mask-wearing [53]- [55].Unlike males, females account for most COVID-19 cases in many countries, including Indonesia [5], [56]- [58].In addition, it was found that females were more likely than males to suffer from long COVID [59], [60].Thus, COVID-19 cases are the dependent variable (YC), while the female population is the independent variable (XC).

Correlation and Scatter Plot
A correlation analysis was conducted to determine whether there was a significant relationship between the dependent (Y) and independent (X) variables.According to [61]- [63], the correlation coefficient, commonly expressed as Pearson product-moment correlation, measures the strength of an association in which two variables play a similar role and can be substituted for each other.It takes values between -1 and 1, where -1 or 1 implies a perfect linear relationship.Moreover, its characterization is weak ( )  A scatter plot was constructed to determine whether regression function f(x) is known (or not).It is a pattern of X and Y relationships that showed by a plot [65], [ ( ) , 1,..., with μ as an unknown function (or regression function/curve) and εi as an observation error.Regression analysis aims to obtain a reasonable solution to the unknown response function m, and one of the ways to do so is through nonparametric techniques, such as kernel or spline regression [8], [9].

Gaussian Kernel
Kernel regression is a nonparametric regression that uses kernel weighting functions (or kernel function) to estimate conditional expectations of random variables (which are dependent and independent), and the estimation value is expressed as a weighted sum of the responses at any point t [8], [67].Based on [8], [9], the kernel function (in one dimension x) that are constructed using the following formula: with K defining the kernel functions, while λ defining the size (referred to as bandwidth or smoothing parameter) [8], [9].There are several kernel functions, including the Gaussian kernel.According to [9], [68], the gaussian kernel is written as follows: ( ) ( ) ( ) where Equation (4) represents the kernel with u as (t, ti;λ).Meanwhile, the estimator obtained from Equation (3) is written as follows: with its kernel estimators proposed by Nadaraya and Watson in 1964.It is written as follows: )

Quadratic Spline
Spline regression is a nonparametric regression that employs continuous segmented (truncated) polynomials to estimate the behavior of data that encounters variance at various intervals [67].Generally, truncated polynomial spline functions with function μ and order m (g+1) are defined as functions with knots (joint points indicating changes in the data's behavior) that are constructed using the following formula: where the truncated function is written as follows: with βj as a polynomial coefficient, βk+g as a truncated coefficient, and This regression has three as its order (g=2) [67].From Equation ( 7), quadratic spline can be written as follows: ( ) ( ) and Equation ( 9) can be substituted into Equation ( 2); hence, it can be written as follows: ( ) with 1, , in = [37], [72].Coefficient β is estimated using the Ordinary Least Squares (OLS) [67].

Generalized Cross Validation
The regression curve can be obtained by selecting the optimum bandwidth for the kernel and optimum knots for the spline (number and location of knot points) [14], [36], [69].Generalized Cross Validation (GCV) can be used to calculate both, where a model with the lowest GCV score will be selected for the subsequent analysis [69], [73].GCV is written as follows: ( ) where ( )

Analyses Steps
The analysis steps for this research are: (i) doing data exploration; (ii) conducting the model of the gaussian kernel; (iii) conducting the model of the quadratic spline; and (iv) comparing both models.All steps are performed using R software.

Data Exploration
Data exploration in this research consists of performing a correlation analysis and making scatter plots.Table 1 presents the result of the correlation analysis using Equation (1).As illustrated in Table 1, four infectious diseases demonstrate a strong correlation between the X and Y variables; hence, indicating that the dependent and independent variables are indeed correlated and can be used to perform a regression analysis.The scatter plot for four infectious diseases showed in

Gaussian Kernel
Optimal bandwidth (λ) values must be selected for the gaussian kernel modeling.The curve is unsmooth in models with a small bandwidth value, whereas in models with a large bandwidth value, the curve is oversmooth [69].Due to this, the optimal bandwidth should be selected using the GCV score in Equation (11); the optimal bandwidth is the one that has the lowest GCV score.Table 2 presents the results of GCV scores for four infectious diseases, sorted by lowest GCV (only three iterations are shown as an example).As illustrated in Table 2, the optimum bandwidth for tuberculosis, diarrhoeal, pneumonia, and COVID-19 are 73.5, 10.5, 28, and 1, respectively, these bandwidth are the smoothing parameters in kernel.Therefore, as an example of tuberculosis, a model for the gaussian kernel can be written as follows:

Quadratic Spline
A spline model can be obtained by selecting the optimal knot (ξ) points based on the number and location of knots.If there are too many knots in a model, the curve tends to be overfitted and unsmooth, whereas if there are too few knots, the curve becomes over smooth and cannot describe the data distribution [76].Accordingly, the optimal knot should be selected using the GCV score in Equation (11); the optimal knot is the one that has the lowest GCV score.Table 3, Table 4, Table 5, and Table 6 presents the results of GCV scores for four infectious diseases, sorted by smallest GCV (only three iterations are shown as an example).In order to determine the best quadratic regression model, all knot points are then calculated for their R 2 values.Table 7 presents the results of R 2 for four infectious diseases.As illustrated in Table 7, the best quadratic spline model for tuberculosis has three knots and six regression coefficients, with the highest R 2 of 92.17%.

Comparison of Kernel and Spline Regression
Compared to SMAPE, MAPE, MAE, MSE, and RMSE, R 2 appears to be the most informative rate in many cases based on our experience and the results of this study [79].Thus, the R 2 value is used to compare the best nonparametric regression.Table 8 presents the results of R 2 for the gaussian kernel, quadratic spline, and linear regression for four infectious diseases.As illustrated in Table 8, the gaussian kernel is the most suitable regression technique for modeling four infectious diseases in Indonesia by 2021, as it has high R 2 values (for each disease) compared to quadratic spline and linear regression.

CONCLUSIONS
Based on the analysis findings, the R 2 values for tuberculosis, diarrhoeal, pneumonia, and COVID-19 are 99.85%,100%, 99.91%, and 99.99%, respectively.Therefore, the most suitable regression technique for modeling four infectious diseases in Indonesia by 2021 is the gaussian kernel, since it has high R 2 values compared to two other regression techniques.Research was limited to one independent variable for each infectious disease.Therefore, independent variables can be added to compare gaussian kernel and quadratic spline for further research.
GCV is simple and efficient in its calculations, asymptotically optimal, invariant to transformation, and does not require σ 2 information compared to other techniques, such as Cross Validation, Unbiassed Risk, and Generalized Maximum Likelihood [14],[75].

Figure 1 .
As shown in Figure 1, all the scatter plots do not follow a certain pattern.Therefore, nonparametric regression is used to analyze the data.plot of Y and X for four infectious diseases, (a) Scatter plot of Y and X of Tuberculosis, (b) Scatter plot of Y and X of Diarrhoeal, (c) Scatter plot of Y and X of Pneumonia, and (d) Scatter plot of Y and X of COVID-19

Table 7 . R 2 Values of Quadratic Spline Model for Four Infectious Diseases
Meanwhile, for diarrhoeal, pneumonia, and COVID-19, models with one knot have R 2 values similar to two or three knots.Furthermore, according to [

65], [72], [75], [77], [78], models
with low oscillation and parsimony (a model that consists of a few parameters and is capable of producing a high R 2 value) are recommended for modeling.Therefore, for parsimony reasons, the best quadratic spline model in diarrhoeal, pneumonia, and COVID-19 consists of one knot and has four regression coefficients with R 2 of 78.14%, 58.71%, and 55.43%, respectively.The quadratic spline model for four infectious diseases can be written in Equation