ASSESSMENT AND COMPARISON OF MACHINE LEARNING ALGORITHM CAPABILITY IN SPATIAL MODELING OF DENGUE FEVER VULNERABILITY BASED ON LANDSAT IMAGE 8 OLI/TIRS

The spread of dengue fever in Indonesia has become a major health problem. Spatial modeling for the distribution of dengue fever vulnerability is an important step to support the planning and mitigation of dengue fever in Indonesia. This study aims to assess and compare the capability of two machine learning algorithms to create a spatial model of dengue fever vulnerability. The research was conducted in Baubau City, Southeast Sulawesi Province by taking 129 cases that occurred from 2015 to February 2016. In this study, the model was created using R software and machine learning algorithms including support vector machine (SVM) and random forest (RF). The six modeling variables involved include land use/cover, BLFEI, NDVI, LST, rainfall and humidity extracted from Landsat 8 OLI/TIRS imagery as well as BMKG (Meteorological, Climatological, and Geophysical Agency of Indonesia) and BWS climate data. The model's capability was assessed using the Area Under Curve-Receiver Operating Characteristic (AUC-ROC) curve. The results of the research show that both algorithms provide excellent model accuracy with AUC values of 1 for SVM and Available at http://jurnal.unimed.ac.id/2012/index.php/geo e-ISSN: 2549–7057 | p-ISSN: 2085–8167 Assessment and Comparison .....|212 0.997 for RF with SVM as the best algorithm for modeling dengue fever in Baubau City.


INTRODUCTION
Various serious problems are being faced by the world today, where climate change ranks first of the ten problems (World Economic Forum, 2017). Furthermore, climate change plays a role in the transmission and expansion of various infectious diseases spread by mosquito species such as Aedes aegypti and Aedes albopictus which are known as dengue hemorrhagic fever (Lee et al., 2018).
DHF has become a serious problem because it causes more than half of the world's population to be hospitalized and is responsible for death (Gubler, 2012). (WHO, 2019) stated that DHF has become the main arbovirus in the world with an increase in the number of sufferers reaching three times in the last five decades. Another worrying fact is that the worst dengue cases occurred in Southeast Asia and the Western Pacific with an increase in cases reaching 2.2 million from 2008 to 2010 (WHO, 2019).
Indonesia was one of the endemic areas with the third largest incidence rate in Southeast Asia and the Western Pacific (Ministry of Health, 2017). Based on the report (Ministry of Health of the Republic of Indonesia, 2019) as of February 2016, dengue fever caused 169 deaths from 16,692 new cases.
The implementation of handling efforts that have not been maximized (WHO, 2007) and the absence of vaccines or appropriate control efforts (Wilder-Smith et al., 2019) have made the spread and outbreak of DHF continue to increase, especially in tropical areas including Indonesia (WHO, 2019). Therefore, prevention efforts are considered as the main step in reducing the incidence.
One way of preventions carried out in the spatial field is spatial modeling of regional vulnerabilities, both based on physical environmental conditions (Ding et al., 2018) and socio-economic (Yue et al., 2018) or a combination of both (Adzan & Danoedoro, 2012); Widayani, 2010).
Spatial modeling is an activity to simplify complex spatial phenomena in the real world with the aim of understanding how spatial phenomena are formed. Furthermore, the best accuracy of DHF modeling can vary from one region to another, depending on several factors, one of which is the complexity of the modeling method (Pineda-Cortel et al., 2019).
The presence of remote sensing technology is very helpful in spatial modeling because it can overcome the problem of availability of physical environmental data which is generally difficult to obtain. Landsat 8 OLI/TIRS imagery is one of the remote sensing products that is widely used in natural resource management to the field of public health (NASA, 2019).
Along with the increase in spatial data collection tools and sensors, the term geospatial big data has emerged which is the forerunner of the field of artificial intelligent geospatial science (geoAI). GeoAI itself is a combination of spatial science, machine learning, data mining, and advanced computational methods to reveal more meaningful information from geospatial big data.
In line with that (Dempsey, 2012) states that the increasing number and variety of data formats from geospatial big data collected today presents its own challenges in terms of storing, managing, processing, analyzing, visualizing, and checking data quality. Thus, the presence of machine learning methods in spatial data analysis is a challenge as well as a new opportunity in spatial modeling of dengue fever As it is known that there is no single modeling method that is suitable for all regions, it is necessary to deepen the method that is seen as emerging as a novelty in spatial modeling. In terms of the methods used, various spatial modeling methods have now been applied and one of them is machine learning. Therefore, the presence of machine learning, especially the support vector machine (SVM) and random forest (RF) algorithms which are present as a novelty of analytical methods in spatial modeling, especially in geospatial big data, need to be assessed for their capability to model dengue fever.
Various previous machine learning studies using SVM, and RF algorithms are generally found in disaster modeling such as landslides (Motevalli et al., 2018), forest fires (Ngoc Thach et al., 2018) and have not been widely carried out in the field of disease epidemiology, especially dengue fever. (Ding et al., 2018). Apart from not being explored much, previous studies on spatial modeling of DHF also only compared the capabilities between algorithms on a global scale. Therefore, we need a study of spatial modeling of DHF on a local scale that focuses on the assessment and comparison of model capabilities.
This study aims to assess and compare the capability of two machine learning algorithms to produce a spatial model of regional vulnerability to dengue fever in Baubau City, Southeast Sulawesi Province (Figure 1). The selection of research locations is based on the incident rate (IR) value where this number is used to assess the high and low of an area affected by DHF. Based on the Indonesian Ministry of Health (2018), Southeast Sulawesi was the province with the third highest IR rate nationally. Moreover, according to data from the Southeast Sulawesi Provincial Health Office (2018), Baubau City was designated as one of the areas of extraordinary occurrence (KLB) of DHF with the third largest incidence of 17 regencies/cities in Southeast Sulawesi. Administratively, the research area covers the Districts of Sorawolio, Bungi, Lea-Lea, Wolio, Betoambari, Kokalukuna, Murhum and Batupuaro Districts (Statistics Indonesia of Baubau City, 2018).

Tools and Materials
The tools and materials used to carry out this research are described in Tables 1 and 2

Extraction of Modeling Variables
The involvement of spatial variables was taken through an in-depth literature search to explore how closely the relationship between the incidence of dengue fever was and the variables to be used. The spatial variables in question include land cover/use (Tiong et al., 2015;Vanwambeke et al., 2007), surface temperature (Kamimurai et al., 2002), humidity (Costa et al., 2010), rainfall (Iriani, 2012;Valdez et al., 2018), built-up land index (BLFEI) and vegetation density index (NDVI) (Palaniyandi, 2014;Widayani et al., 2018).

. Classification of Land Cover and Use
Classification of land cover/use referred to Danoedoro (2012) with regard to three main factors that affect the quality of the classification including classification schemes, sample criteria and algorithms. The classification scheme used was SNI-7645-1: 2014 regarding land cover classification-part 1: small and medium scale (BSN, 2014) with a maximum likelihood classification algorithm. This algorithm requires that the sample distribution is evenly distributed over the image coverage with an area of at least 10-40 pixels.

Vegetation Index (NDVI)
The calculation of the vegetation density index refers to the equation used by (Ganie & Nusrath, 2016) in their research.
In which NIR and RED were near infrared channels (band 5) and red channels (band 4) Landsat 8 OLI/TIRS images with wavelengths of 0.851 -0.879µm and 0.636 -0.673µm, respectively.

Built-up Land Index (BLFEI)
Meanwhile, the appearance of built-up land and open land and water bodies was calculated using the building In which 1 and 2 are spectral radian calibration constants and band 10 absolute temperature calibration constants, is TOA spectral radians or corrected spectral radian value in units (Watts/(m2*srad*µm)). This corrected spectral radian was obtained from the calibration of the radian band 10 value using the object emissivity value ( ) which was calculated using the equation (Coll et al., 2010) in (Fawzi, 2014) obtained through the http://atmcorr.gsfc.nasa.gov/.

Interpolation of Rainfall and Humidity
Data Rainfall and humidity are variables obtained through field measurements by Meteorological, Climatological, and Geophysical Agency and BWS climate observation stations. Data from these two agencies was in the form of daily climate data which was tabulated into annual averages and then transformed into spatial data by interpolation method using ArcGIS 10.6 software. The interpolation method chosen was the inverse distance weight (IDW) in the study (Kurniadi et al., 2018).
In which 0 is the approximate value at point 0; is the value of z at control point i; is the distance between point i and point 0 and is the influence of neighboring points and is the number of station points used.

Data Analysis Techniques
Vulnerability classification based on machine learning algorithms was carried out using software R. R is a statistical and graphic computing program which in its development has been known as one of the most reliable software in data analysis and data science (R Foundation, 2020). Its reliability is due to its opensource nature so that it allows the source code to be checked, modified, added and shared by anyone so that the distribution of the package continues to grow following various data science analysis needs.
Package is a collection of scripts which are generally in the form of functions or data that can be used for certain needs (Hidayatuloh, 2020). Various packages were involved in this research including the raster, rgdal, rgeos, ggplot2, RSNNS, e1071 package and the kernlab package (Hijmans & Elith, 2019).
The dengue fever modeling in this study is aimed at assessing and comparing the capabilities of two machine learning algorithms that are known to have been widely used and relatively strong in various spatial models including support vector machine (SVM) and random forest (RF) algorithms. 3.1. SVM Algorithm SVM was first introduced by Vapnik (1995) in (Motevalli et al., 2018) as a statistical learning theory. In its development, SVM has been widely used in various environmental fields and natural disaster studies such as (Motevalli et al., 2018;Nguyen et al., 2018) and even health studies, especially those related to the environment (Ding et al., 2018). The advantage of the SVM method is its good ability to classify even with limited training data (Mountrakis et al., 2011).
Taking into account the training dataset, = ( , ) = 1 in which ∈ is the input variable including land cover/use, NDVI, BLFEI, LST, annual rainfall and humidity. is the number of samples from the training data, is the dimension of the training data, ∈ {−1, +1} describes the label class. Then the best decision function can be formed using equation (5).
In which is the offset value; is a lagrange; and ( , ) are kernel functions. The ability of the SVM model depends on the kernel function, in this study was the radial basis function (RBF).
Since RBF was chosen as a kernel function, the ability of SVM is controlled by two main parameters, namely kernel regularization (C) and kernel width (γ) which must be chosen correctly in order to provide maximum SVM algorithm capabilities. The selection method was done by trial and error based on the range of values that had been used by previous researchers such as and that was done using the grid search method.
In terms of trial and error, the value range proposed by (Ding et al., 2018) was six kernel regularization values (C), namely 2, 4, 8, 16, 24 and 32, while the value of the kernel width (γ) being tested consists of from 0.01 to 0.08. There were 48 combinations of values that would be tested based on the two sets of values proposed. The values proposed by (Nguyen et al., 2018) were 44 for regularization and 0.075 for kernel width.

RF Algorithm
Random forest (RF) is one of the most popular techniques in machine learning and is widely used for classification and regression of factor inputs (Breiman, 2001). The advantage of this method is that it is able to provide very high classification accuracy and a faster computational process, thus making RF receive more attention in various fields, one of which is remote sensing (Yue et al., 2018).
Conceptually, RF is a learning approach that is built to make predictions based on a set of decision tree classifiers. Based on this concept, several parts of the data set are generated randomly through alternations derived from the training data set. Furthermore, each part of the data set is used to build a decision tree using CART (classification and regression tree) (Breiman, 2001). Based on the above concept, the RF capability is influenced by two parameters, namely the number of decision trees (N-tree) and the number of modeling variables (N-fact) that must be selected and determined correctly. The method of determining the value of the two parameters was done by trial and error from the values proposed by previous researchers.
In terms of the number of variables involved (N-fact), this study used six input variables which were also used when modeling dengue fever using the SVM algorithm. The six variables consisted of land use/cover, NDVI, BLFEI, LST, rainfall and humidity. Meanwhile for the number of decision trees (N-tree), several values had been proposed by previous researchers such as (Lawrence et al., 2006;Stevens et al., 2015) who proposed 500 Ntree, other researchers such as (Motevalli et al., 2018) found that the value of 1,000 for the N-tree and 3 for the mtry parameter gave maximum results for landslide modeling, also (Ding et al., 2018) stated that 500 was the best value for the N-tree for modeling the distribution of Ae. Aegypti dan Ae. Albopictus mosquitos. The mtry parameter is used to control the number of variables that are randomly sampled as candidates for each variable separation in forming a decision tree model (Liaw & Wiener, 2002). The range of mtry parameter values to be tested in this study was 1 to 8.

Assessment and Comparison of Model Capability
The ability of the dengue fever vulnerability model generated from the two algorithms was assessed using the ROC-AUC (area under curve-receiver operating characteristic) curve. Where the qualitative relationship between ROC-AUC and model capability is 0.9-1 (perfect), 0.8-0.9 (very good), 0.7-0.8 (good), 0.6-0.7 (moderate), and 0.5-0.6 (poor) (Swets, 1988). Figure 2 would explain step by step how this research was carried out. Broadly speaking, the research phase was grouped into three main parts, namely (1) satellite image processing and spatial data analysis to produce research variables, (2) the process of training algorithms based on historical data and research variables, and (3) the process of testing the model that had been obtained based on DHF history data.

RESULTS AND DISCUSSION
The classification results show as many as 13 types of land cover/use at the study site (Figure 4a). Based on the classification scheme used, the 13 types of land cover/use can be grouped into built -  The NDVI value of the research location ranges from 0.66 to 0.86 with a value of 0.75-0.81 being the value with the highest number of pixels (Figures 3a and  4b). The NDVI value of 0.7-0.8 is the value with the most pixels due to the land cover of the research location being dominated by forest land cover and natural/seminatural vegetation and the permanent cultivation vegetated land cover of 87.62% of the total area (3).
Based on the processing results (Figure 4b), it is known that the BLFEI pixel value of the research location ranged from -1.05 to 0.84 as shown in Figure 5b. From the histogram, it is also known that the highest number of pixel frequencies ranged from -0.9 to -0.6. Referring to the histogram generated by (Bouhennache et al., 2018), an initial conclusion can be drawn that the land cover of the study site was dominated by vegetation. The number of pixels in the built-up histogram ( Figure  4c) began to increase in the value range of -0.6 to -0.2 with a pixel frequency of about 200 pixels.  The surface temperature value is obtained by correcting the radian band 10 value using atmospheric parameters including the downwelling and upwelling values of atmospheric radiation and the object's emissivity value (Fawzi, 2014). The emissivity value of the object at the study site ranged from 0.96 to 0.97 (Figure 5d).
The surface temperature of the object at the research location based on calculations ranged from 14,64 0 C to 40,43 0 C with the most common temperature values found in the temperature range of 24 0 C to 32 0 C as seen in Figure 4d above.
The average annual rainfall at Betoambari, Kaisabu, Ngkari-ngkari and Post Rain Pada stations were 140. 66, 151.84, 111.39 and 122.02 mm/year, respectively. This value was then interpolated using the IDW method and shows information that the rainfall in the study area ranged from 111.39 to 151.84 mm/year.  A few programming languages (scripts) are run to classify vulnerabilities using the SVM and RF algorithms in the R software. This script was created first in the preparation stage. For SVM, a total of 48 combinations of combined values of kernel regularization (C) and kernel width (γ) were tested using the trial-and-error method, which is presented in Table 3 below: Based on Table 3f above, when the kernel width value was set at 0.06 and the regularization parameter at 16, the best AUC value was 1 with a correlation of 0.96. The combination of kernel width values of 0.05 and 0.08 with regularization parameter values generally gives a maximum AUC value of 1 but still with a correlation value below 0.96 as shown in Tables 3e and 3h. Meanwhile, the value test proposed by (Ding et al., 2018) gives an AUC value of 0.99 with a correlation of 094.
As with SVM, the trial-and-error method was also used for the RF algorithm. There were 16 values obtained from the combination of N-tree values of 500 and 1,000 and mtry values 1 to; based on Tabel 3a it is known that the highest AUC value was obtained when the N-tree parameter values and mtry parameters were set at 500 and 1. The highest AUC value obtained was 0.99 with a correlation of 0.94. The smallest AUC value obtained was 0.98 when the N-tree and mtry parameter values were set to 1000 and 8 respectively. This smallest AUC value was also found in several combinations of Ntree values and mtry values above 3, namely 5, 7, and 8. In terms of capability comparison, the model generated by the SVM algorithm was considered to be better than the model generated by the RF algorithm with values of 1 (Figure 6a) and 0.99 (Figure 6b) respectively. Although the AUC RF value was not better than SVM, this value was still classified as an "excellent" class; meanwhile value between 0.8-0.9 is very good, 0.7-0.8 good, 0.6-07 moderate and 0, 5-0.6 bad.

CONCLUSION
This study provides two main conclusions including the assessment and comparison of the ability of SVM and RF algorithms in modeling DHF. In terms of assessment, the SVM algorithm provides maximum capability when the kernel width parameter value (γ) is set at 0.06 and 16 for the regularization value (C), while the RF algorithm provides the best performance when the decision tree number (N-tree) parameter value is set at 500 and the value of the m try parameter is 1 with the value of the number of variables (N-fact) of 6. For comparison of model capabilities, the SVM algorithm is considered to provide better capabilities than the RF algorithm with an AUC value of 1 and a correlation of 0.96. Although in terms of comparison the capability of the RF algorithm is not better than the SVM algorithm, the AUC value obtained is 0.99 with a correlation of 0.94 has been included in the "excellent" category to express the qualitative relationship between the AUC value and the model's capability.