GIS-based non-grain cultivated land susceptibility prediction using data mining methods | Scientific Reports – Nature.com

Research flow

The NCL susceptibility prediction study includes four main parts: (1) screening and analysis of the influencing factors of NCL; (2) construction of the NCL susceptibility prediction model; (3) NCL susceptibility prediction; and (4) evaluation of the prediction results. The Research flow is shown in Fig.2.

The NCL locations were obtained based on information of Google Earth interpretation, field survey, and data released by local government, which derived in a total of 184 NCL locations. For determining the non-NCL locations, GIS software was applied, and 184 locations were randomly selected. In order to decreasing the bias of modeling, we generated non-NCL points by 200m distance for NCL. At each point, the data was divided into training samples and testing samples in a ratio of 7/3, thus forming the training dataset and the testing dataset together (Fig.3).

Currently, there is no unified consensus on the factors influencing NCL. Therefore, based on historical research materials and on-site field investigations24,25,26,27,28, 16 appropriate Non-grain Cultivated Land Susceptibility conditioning factors (NCLSCFs) were chosen for modelling NCL susceptibility in accordance with topographical, geological, hydrological, climatological and environmental situations. Alongside this, a systematic literature review has also been performed on NCL modelling to aid in the identification of the most suitable NCLSCFs for this study. The NCLSCF maps were shown in Fig.4.

Typical NCL factors map: (a) Slope; (b) Aspect; (c) Plan curvature; (d) Profile curvature; (e) TWI; (f) SPI; (g) Rainfall; (h) Drainage density; (i) Distance from river; (j) Lithology; (k) Fault density; (l) Distance from fault; (m) Landuse; (n) Soil; (o) Distance from road.

(1) Topographical factors

The occurrences of NCL and their recurrent frequency are very much dependent on topographical factors of an area. Several topographical factors like slope, elevation, curvature, etc. are triggering parameters for the development of NCL activities29. Here, six topographical factors were chosen: altitude, slope, aspect, plan and profile curvature and topographic wetness index (TWI). All these factors also perform a considerable part in NCL development in study area. These factors were prepared using shuttle radar topographical mission (SRTM) sensor digital elevation model (DEM) data with 30m resolution in the ArcGIS software. The output topographical factors of altitude ranges from 895 to 3289m (Fig.3), slope map 0261.61%, aspect map has nine directions (flat, north, northeast, east, southeast, south, southwest, west, northwest), plan curvature12.59 to 13.40, profile curvature13.05 to 12.68 and TWI 4.96 to 24.75. The following equation was applied to compute TWI:

$$TWI = Lnfrac{propto }{mathrm{tanbeta }+mathrm{ C}}$$

(1)

where,specifies flow accumulation, specifies slope and C is the constant value (0.01).

(2) Hydrological factors

Sub-surface hydrology is treated as the activating mechanism for the happening of NCL, as water performs a significant part in the soil moisture content. Therefore, four hydrological factors, namely drainage density, distance from river, stream power index (SPI) and annual rainfall, for modelling NCL susceptibility were chosen30. Here, SRTM DEM data of 30m spatial resolution was used to map the first three hydrological variables. Drainage density and distance from river map was prepared using line density extension and Euclidean extension tool respectively in GIS platform. The following formula was applied to compute SPI.

$$SPI = As*tan beta$$

(2)

where, As specifies the definite catchment area in square meters and specifies the slope angle in degrees. The precipitation map of the area was derived from the statistics of 19 climatological stations around the province with a statistical period of 25years and in accordance with the kriging interpolation method in GIS platform. The output drainage density value ranges from 0 to 1.68km/km2. Meanwhile, the value of distance from river ranges between 0 and 9153.93m, average annual rainfall varies from 175 to 459.98mm and the value of SPI ranges from 0 to 8.44.

(3) Geological factors

The characteristics of rock mass, i.e., lithological characteristics of an area, significantly impact on NCL activities31. Therefore, in NCL susceptibility studies geological factors are indeed commonly used as input parameters to optimize NCL prediction assessment. In the current study, three geological factors (namely lithology, fault density and distance from fault) were chosen. The lithological map and fault lines were obtained in accordance with the geological map of study gathered from local government at a scale of 1:100,000. Fault density and distance from fault factor map was prepared using line density extension and extension tool respectively in GIS platform. In this area, the value of fault density varies from 0 to 0.54km/km2 and distance from fault ranges from 0 to 28,247.1m respectively. The lithological map in this area is presented in Fig.4b.

(4) Environmental factors

Several environmental factors can also be significant triggering factors for NCL occurrence in mountainous or hilly regions32. Here, land use land cover (LULC), soil and distance from road were selected as environmental variables for predicting of NCL susceptibility. The LULC map was obtained in accordance with Landsat OLI 8 satellite images applying the maximum probability algorithm in the ENVI. Soil texture map was prepared based on the soil map of study area. The road map of this area was digitized from the topographical map by the local government. The output LULC factor was classified into six land use classes, while the soil map was classified into eight soil texture groups and the value of distance from road ranges from 0 to 31,248.1m.

As the NCLSCFs are selected artificially and their dimensions, as well as the quantification methods of data, are derived through mathematical operations, as subsequent input data for modeling, there may be potential multicollinearity problems among the NCLSCFs33. Such problems arise due to precise or highly correlated relationships between NCLSCFs, which can lead to model distortion or difficulty in estimation. In light of this, to avoid potential multicollinearity problems, this study examines the variance inflation factor and tolerance index to assess whether there exists multicollinearity among the NCLSCFs.

The MC analysis was conducted among the chosen NCLSCFs to optimize the NCL susceptibility model and its predictions34. TOL and VIF statistical tool were used to test MC using SPSS software. Studies indicate that there is a multicollinearity issue if VIF value is>5 and TOL value is<0.10. TOL and VIF were measured applying the following formula:

$$TOL=1-{R}_{j}^{2}$$

(3)

$$VIF=frac{1}{TOL}$$

(4)

where, R2 represents a regression value of j on other various factors.

This section details the machine learning models of GBM and XGB, as used in NCL susceptibility studies.

In prediction performance analysis, GBM is one of the most popular machine learning methods, more frequently applied by researchers in different fields and treated as a supervised classification technique. A variety of classification and regression issues are also often solved by the GBM method, which was first proposed by Friedman35. This model is based on the ensemble of different weak prediction models such as decision trees, and is therefore considered as one of the most important prediction models. Three components are required in GBM model, namely a loss operate, a weak learner prediction, and an optimization of the loss function in which an additive function is necessary to include weak learners within the model. In addition to the above mentioned components, three important tuning parameters (namely n-tree, tree depth and shrinkage, i.e., the maximum number of trees, highest possible interaction among the independent variables and the learning rate respectively) is also required to build a GBM model36. The advantage of such a model is that it has capacity to determine the loss function and weak learners in a precise way. It is complex to obtain the solution of optimal estimation applying the loss function of (y, f) and weak learner of (x, ). Thus, to solve this problem, a new operate (x, t) was planned to negative gradient {gt(xi)}i=1 along with the observed data:

$${g}_{t}(x) ={{E}_{y} [frac{apsi (y,f(x))}{af(x)}|x]}_{f(x)={f}^{t-1}(x)}$$

(5)

This new operate is greatly associated with(x). This algorithm can permit us to develop aleast square minimization from the method by applying the following equation:

$$(mathrm{rho t},mathrm{ theta t})=mathrm{arg min}sum_{i=1}^{N}{[-{text{gt}}(mathrm{xi }) +mathrm{ rho h}(mathrm{xi },uptheta ]}^{2}$$

(6)

Chen & Guestrin then went on to introduce the XGB algorithm. It indicates the advance machine learning method, and is more efficient than the others37. The algorithm of XGB is based on classification trees and the gradient boosting structure. Gradient boosting framework is used in an XGB model by the function of parallel tree boosting. This algorithm is chiefly applied for boosting the operation of different classification trees. A classification tree is usually made up of various regulations to classify each input factor as the function of prejudice variables in a plot construction. This plot is developed as a individual tree and leaves are appointed with respective scores, which convey and choose the respective factor class, i.e., categorical or ordinal. The loss function is used in the XGB algorithm to train the ensemble model; this is known as regularization, which deals specifically with the severity of complexity trees38. Therefore, this regularization method can significantly enhance the performance of prediction analysis through alleviating any over-fitting problems. The boosting method, with the combination of weak learners, is used in XGB algorithm to optimally predict the result. Three parameters (i.e., General, Task and Booster) are applied to separate XGB models. The weighted averages of several tree models are then combined to form the output result in XGB. The following optimization function was applied to form the XGBoost model:

$$OF(theta ) =sum_{i=1}^{n}lleft({{text{y}}}_{i}, {overline{y} }_{i}right)+sum_{k=1}^{k}upomega ({f}_{K})$$

(7)

where, (sum_{i=1}^{n}lleft({{text{y}}}_{i}, {overline{y} }_{i}right)) is the optimization loss function of training dataset, (sum_{k=1}^{k}upomega ({f}_{K})) is the regularization of the over-fitting phenomenon, K indicates the number of individual trees, fk is the ensemble of trees, and ({overline{y} }_{i}) and ({{text{y}}}_{i}) indicates the actual and predicted output variables respectively.

Kennedy, an American social psychologist, developed the PSO algorithm based on the vector depending of seeking food by birds and their eating behavior39. It is a meta-heuristic-based simulation of a social model, often applied in behavioral studies of fish schooling, birds and swarming theory. The non-linear problems in our day-to-day research study will be solved by applying this PSO method. The PSO algorithm has been widely applied to determine the greatest achievable direction or direction to collect food, specifically for bird and fish intelligence. Here, birds are treated as particles, and they always search for an optimal result to the issue. In this model, bird is considered an individual, and the swarm is treated as a group like other evolutionary algorithms. The particles always try to locate the best possible solution for a respective problem using n-dimensional space, where n indicates the respective problems several parameters40. PSO consists of two fundamental principles: position and speed. This is the basic principle for the movement of each particle.

Hence, xt=(xt, xt,, xt) and vt=(vt, vt, , vt) is the position and speed for the changing particle position which is designed for ith particle in tth iteration. The given formula are used for the ith particle position and speed in (t+1)th iteration.

Where, xt is the previous ith position; pt is the most excellent position; gt is the best position; r1 and r2 indicates the random numbers within 0 and 1; is weights of inertia; c1 is coefficient and c2 is the social coefficient. Several type of methods are presented to weight the assignment of respective particles. Among them, standard 2011 PSO is the most popular and has been widely used among previous researchers. Here, standard 2011 PSO was used to calculate particles weight assignment using the following formula:

$$omega =frac{1}{2ln2}and {c}_{1}={c}_{2}=0.5+ln2$$

(8)

Evaluation is an important action to quantify the accuracy of each output method. In other words, the superiority of the output model is specified through a validation assessment41. Studies indicate that several statistical techniques can be applied to evaluate the accuracy of the algorithms; among them, the most frequently used technique is receiver operating characteristics-area under curve (ROC-AUC). Here, statistical techniques of sensitivity (SST), specificity (SPF), positive predictive value (PPV), negative predictive value (NPV) and ROC- AUC were all applied to validate and assess the accuracy of the models. These statistical techniques were computed in accordance with the four indices, i.e., true positive (TP), true negative (TN), false positive (FP) and false negative (FN)42. In this, correctly and incorrectly identified NCL susceptibility zones are represented through TP and FP, and correctly and incorrectly identified non-NCL susceptibility zones are represented through TN and FN respectively. The ROC is mostly used as a standard process to evaluate the accuracy of the methods. It is based on even and non-even phenomena. The output result of these techniques is such that a higher value represents good performance by the model, and a lower value represents poor performance. Applied statistical techniques of this study were measured through the following formula:

$${text{SST}}=frac{{text{TP}}}{mathrm{TP }+mathrm{ FN}}$$

(9)

$${text{SPF}}=frac{{text{TN}}}{mathrm{FP }+mathrm{ TN}}$$

(10)

$${text{PPV}}=frac{{text{TP}}}{mathrm{FP }+mathrm{ TP}}$$

(11)

$${text{NPV}}=frac{{text{TP}}}{mathrm{TP }+mathrm{ FN}}$$

(12)

$$AUC=frac{mathrm{Sigma TP }+mathrm{ Sigma TN}}{mathrm{P }+mathrm{ N}}$$

(13)

See more here:

GIS-based non-grain cultivated land susceptibility prediction using data mining methods | Scientific Reports - Nature.com

Related Posts

Comments are closed.