Creation of a machine learning-based prognostic prediction model for various subtypes of laryngeal cancer | Scientific … – Nature.com

Data collection

The Surveillance, Epidemiology, and End Results Program (SEER) database was used to gather the study's data (Incidence-Seer Research Plus Data 17 Registries Nov 2021 Sub). Using the SEER*Stat program (version 8.4.2), we retrieved individuals who had been given a larynx carcinoma diagnosis by the third edition of the International Classification of Oncology Diseases (ICD-O-3). The period frame covers instances handled between 2000 and 2019. The following were the inclusion requirements: The behavior was identified as malignant and encoded by position and shape as "larynx".

In total, 54,613 patients with primary laryngeal malignant tumors were included. The median follow-up duration of the sample in this study is 38months. We used the following exclusion criteria to clean up the data: (1) Patients with limited follow-up information; (2) Patients without T stage (AJCC7), N stage (AJCC7), M stage (AJCC7), or AJCC stage grade information.

We selected variables that were directly related to the clinic, such as age, race, and gender, based on clinical experience. We chose the T stage, N stage, M stage, AJCC stage (AJCC stage 7), tumor size, and pathological categorization to assess the patient's health. Finally, to evaluate the patient's treatment plans, we also included radiation therapy, surgery, and chemotherapy.

A classic model for survival analysis, the Cox proportional hazards (CoxPH) model has been the most commonly applied multifactor analysis technique in survival analysis to date18,19.

CoxPH is a statistical technique for survival analysis, which is mainly used to study the relationship between survival time and one or more predictors. The core of the model is the proportional risk hypothesis.

It is expressed as h(t|x)=h0 (t) exp (|x), h(t|x) is the instantaneous risk function under the given covariable x, h0 (t) is the baseline risk function, on the other hand, exp ( x) represents the multiplicative effect of covariates on risk.

The random survival forest (RSF) model is an extremely efficient integrated learning model that can handle complex data linkages and is made up of numerous decision trees20.

RSF can improve the accuracy and robustness of the prediction, but it does not have a single expression because it is an integrated model consisting of multiple decision trees21. RSF constructs 1000 trees and calculates the importance of variables. To find the optimal model parameters, we adjust three key parameters: the maximum number of features of the tree (mtry), the minimum sample size of each node (nodesize), and the maximum depth of the tree (nodedepth). The values of these parameters are set to mtry from 1 to 10, nodesize from 3 to 30, and nodedepth from 3 to 6. We use a random search strategy (RandomSearch) to optimize the parameters. To evaluate the performance of the model under different parameter configurations, we use tenfold cross-validation and use C-index (ConcordanceIndex) as the evaluation index. The purpose of this process is to find the parameter configuration that can maximize the prediction accuracy of the model through many iterations.

One of the integrated learning methods called Boosting is the gradient boosting machine (GBM) model, which constructs a strong prediction model by combining several weak prediction models (usually decision trees). At each step, GBM adds a new weak learner by minimizing the loss function. The newly added model is trained to reduce the residual generated in the previous step, and the direction is determined by the gradient descent method. It can be expressed as Fm+1(x)=Fm(x)+mhm(x). Where the Fm(x) is a weak model newly added, and the m is the learning rate.

XGBoost is an efficient implementation of GBM, especially in optimizing computing speed and efficiency. To reuse the learner with the highest performance, it linearly combines the base learner with various weights22. eXtreme Gradient Boosting (XGBoost) is an optimization of the Gradient Boosting Decision Tree (GBDT), which boosts the algorithm's speed and effectiveness23. The neural network-based multi-task logic regression model developed by Deepsurv outperforms the conventional linear survival model in terms of performance24. DeepSurv uses a deep neural network to simulate the Cox proportional hazard model. Therefore, deepsurv can be expressed as h(t|x)=h0 (t) exp (g(x)), Where the g (x) is the output of the neural network, which represents the linear combination of the covariable x8.

We categorize five models to adapt to various variable screening techniques used with various models. The RSF, GBM, and XGBoost models are screened using the least absolute shrinkage and selection operator (LASSO) regression analysis, while the CoxPH model is screened using the traditional Univariate and multivariate Cox regression analysis25,26,27.

In contrast, the Deepsurv model can automatically extract features and handle high-dimensional data and nonlinear relationships, so variable screening is not necessary28. We randomly split the data set into t and v datasets (training set and validation set) and test set in the ratio of 9:1 using spss (version 26) to further illustrate the model's dependability. Randomly selected 10% of the data as external verification. Once more, the ratio of 7:3 is used to divide the training set and validation set, and for both splits, the log-rank test is used to evaluate any differences between the two cohorts. The mlr3 package of R (version 4.2.2) uses the grid search approach to fine-tune the hyperparameters in the RSF, GBM, and XGBoost models in the validation set and chooses the most beneficial hyperparameters to build the survival model once the variables have been filtered following the aforementioned stages. Finally, the Deepsurv model is constructed using the Python (version 3.9) sksurv package, and the model is additionally optimized using grid search.

We used the integrated Brier score (IBS), which is appropriate for 1-year, 3-year, and 5-year time points, as the major assessment metric when evaluating the prediction performance of the model in the test set. In addition, the calibration curve is drawn and the conventional time-dependent receiver operating characteristic (ROC) curve as well as the area under the curve (AUC) (1year, 3years, and 5years) are compared. By calculating the clinical net benefit to address the actual needs of clinical decisions, Decision Curve Analysis (DCA), a clinical evaluation prediction model, incorporates the preferences of patients or decision-makers into the analysis. Calculating the various clinicopathological characteristics is also required for the prognosis of contribution. We visualized the survival contribution of several clinicopathological characteristics for 1-year, 3-years, and 5-years using The Shapley Additive Explanations (SHAP) plot.

Clinically speaking, various individuals require personalized care. Consequently, it is crucial to estimate the likelihood that a single patient will survive. The survival probability of a certain patient is predicted using the ggh4x package of R (version 4.2.2), along with the contribution of several clinicopathological characteristics to survival. This has major clinical work implications.

Read more:
Creation of a machine learning-based prognostic prediction model for various subtypes of laryngeal cancer | Scientific ... - Nature.com

Related Posts

Comments are closed.