Application of ensemble machine learning algorithms on lifestyle factors and wearables for cardiovascular risk prediction | Scientific Reports -…

Data source and study population

Data used in this study was drawn from a SingHEART prospective longitudinal cohort study (ClinicalTrials.gov Identifier: NCT02791152). The study is a multi-ethnic population-based study conducted on healthy Asians, aged 2169years old without known diabetes mellitus or prior cardiovascular disease (Ischemic heart disease, stroke, peripheral vascular disease). The study complied with the Declaration of Helsinki and written informed consent were given by participants. The study was approved by the SingHealth Centralized Institutional Review Board.

We included 600 volunteers, aged of 30years with valid calcium score, into the main analysis of this study. Two hundred volunteers under the age of 30years, who did not have a calcium score were excluded, as the calcium score was the main outcome of our analysis.

Subset analysis for activity tracker data was performed on 430 out of the 600 volunteers who had adequate data. Although subjects recruited were issued an activity tracker to be worn over a period of five days with first and last days of the study being partial days, there was inconsistent wearing of the activity. Discounting the partial days, each subject would yield an activity log for three complete tracking days (or equivalent to days with>20 valid hours of steps and sleep data)24,25. For data consistency and quality, subjects with improper activity tracker usage i.e. activity reading log less than five days and/or sleep reading log less than three days were censored.

Coronary artery calcium (CAC) scoring was used as the modelling outcome. The coronary calcium is a specific marker of coronary atherosclerosis, a precursor for coronary artery disease26; it also reflects arterial age under the influence of underlying comorbidities and lifestyle. The CAC score was also regarded as the best marker for risk prediction of cardiovascular events27,28.

This study stratified subjects into two classes of CVD risk. Low risk if their coronary artery calcium score were 0, and high risk if calcium score were 100 and above. Subjects who did not fall into these 2 categories were considered intermediate risk.

The aim of this study is to look at how accurate the machine learning algorithm is in handling different data types, in the task of predicting high risk and low risk patients, based on calcium score.

Table 1 summarizes the data from SingHEART that was used in this study.

Data variables were categorized into four groups; lifestyle survey questionnaires, blood test data, 24-h ambulatory blood pressure, and activity tracking data by commercially available Fitbit Charge HR29.

Data pre-processing, transformation and imputation were performed on the raw data. Variables selected were based on their a priori knowledge from previous publications on cardiovascular risk assessment1,2,3, and physician expert advice. In total, there were 30, 17, 12 and 16 unique variables in the respective groups: survey questionnaire, 24h blood pressure and heart rate monitoring, blood tests and Fitbit data.

The Framingham 10-year risk score was computed using seven traditional risk factors: gender, age, single timepoint systolic blood pressure, Total Cholesterol (TC), High Density Lipoprotein (HDL), smoking status and presence of diabetes. A Framingham risk score of<10% is consider low risk, while20% is considered high risk30.

Figure1 shows the methodological framework of the present study. Exploratory analysis showed that ensemble MLA classifiers were superior at discriminating low risk individuals while ensemble MLA regressors performed better identifying individuals with high CVD risk. To leverage on the merits of both the classifiers and regressors MLA, we used both approaches for our model.

Modelling flow chart using ensemble MLA for cardiovascular risk prediction.

The ensemble classifiers produce a binary prediction outcome; low or non-low risk. The ensemble regressors makes a numerical prediction on the calcium score for individuals classified as non-low risk, and stratify into three bins of low, high, and intermediate risk. The predicted numerical values may range from negative to positive number. Negative predicted values were first converted to zero and subsequently the continuous predictions were converted to discrete bins using unique value percentile discretization ensuring records with the same numerical prediction are assigned the same risk category. Finally, the prediction outcome resides in a decision node build on a rule-based logic. The decision node assigns an outcome of low risk if classifiers predict an individual to be low in CVD risk, high risk if classifier predicts non-low risk and regressor predicts high risk. Patients with incongruent classifiers and regressor outcomes are considered unclassified.

The ensemble models in both classification and regression phase each fit three base learners (naive bayes (NB), random forest (RF) and support vector classifier (SVC) for classification prediction, and generalized linear regression (GLM), support vector regressor (SVR) and stochastic gradient descent (SGD) for regression prediction). These base learners were chosen based on preliminary analysis, where these models showed efficiency in handling missing values and outliers.

The ensemble model then uses majority vote to determine the class label in classification phase. For the regression phase, the ensemble model averages the normalized predictions from the base regressor models to form a numerical outcome.

All models were trained on a stratified five-fold cross-validation. As SingHEART data had an imbalanced CVD risk distribution of risk based on the calcium score (low risk 63.4%, high risk 8.3%, intermediate risk 18.7%) we oversampled the training set for the minority class labels to allow model to better learn features from the under-represented classes31. The data were first partitioned into five mutually exclusive subsets, with each subset sharing the same proportion of class label as original dataset. At each iteration, the MLAs trained on four parts (80%) and validated on the fifth, the holdout set (20%). The process repeats five times, with five different but overlapping training sets. The resulting metrics from each fold were averaged to produce a single estimate.

To simulate access to the different variable groups as per clinical workflow and ease of information availability, we assessed the performance of individual variable group, and in combination as per the following:

Model 1: Survey Questionnaire.

Model 2: 24h ambulatory blood pressure and heart rate.

Model 3: Clinical blood results.

Model 4: Model 1+Model 2.

Model 5: Model 1+Model 3.

Model 6: Model 1 to Model 3.

Model 6*: Model 1 to Model 3 with feature selection.

Model 7: Physical activity and sleep trackers (exploratory subset analysis).

Variables in model 6* were reduced using SVC recursive feature elimination with cross-validation (SVC-RFECV) method to automatically select the best set of predictors that yield the highest area under Receiver Operating Characteristic curves (AUC). Model 16 were trained using 600 subjects.

We also performed exploratory analysis using MLA on the Fitbit Charge HR data (Model 7). Model 7 was trained on a subset of 430 subjects constrained by availability of valid activity tracking data.

Since no single metric can objectively evaluate the cardiovascular risk prediction, we evaluate the performance of our models at CVD risk class level using a panel of metrics; sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), F1-score and Area under Receiver Operating Characteristic curves (AUC). Overall discriminative ability of the model was described by the area under received operating characteristic curve (ROC). All AUC metrics were accompanied by 95% confidence interval (CI) and standard deviation (SD).

To better understand the relative importanceof different risk factors, we conduct a post-hoc approach to rank the variables by their contribution to CVD risk prediction. Feature importance were obtained from the SVC algorithm where the relative importance was determined by the absolute size of the coefficients in relation to others. All statistical analyses were conducted on Python version 3.7 environment and all MLAs and evaluation metrics were constructed using Scikit-learn libraries.

Continue reading here:
Application of ensemble machine learning algorithms on lifestyle factors and wearables for cardiovascular risk prediction | Scientific Reports -...

Related Posts

Comments are closed.