Machine Learning Algorithms for Coronary Heart Disease Prediction
Coronary heart disease (CHD) or simply heart disease, involves the reduction of blood flow to the heart muscle due to build-up of plaque in the arteries of the heart. It is the most common of the cardiovascular diseases. Risk factors include high blood pressure, smoking, diabetes, sedentary life, obesity, high blood cholesterol, poor diet, depression, and alcohol.
Coronary heart disease (CHD) is a major cause of death in the UK and worldwide is the most common cause of death in the industrialized countries. CHD is sometimes called coronary artery disease.
The prediction of heart disease is considered one of the most important topics in health care because it often involves other pathology. With the help of data mining and machine learning algorithms and having large amounts of health data, it is possible to extrapolate information that can help doctors make more accurate decisions and predictions.
Predicting CHD is a very complex challenge, according to a WHO survey, medical professionals can correctly predict heart disease with only 67% accuracy.
The aim of this work has been to find the most effective Machine Learning models in predicting the risk probability of CHD using the South African Heart Disease dataset.
The data for this article can be found here, A sample of males in a region where many cases of heart disease have been recorded in region of the Western Cape in South Africa.
Build a binary classification model that predicts heart disease in people. In binary classification, we classifying data into one of two binary groups , in these case are 0’s and 1’s, the target column to predict is ‘chd’, where CHD= 1 is positive response and CHD=0 is negative response.
First step: Understanding the data
sbp: systolic blood pressure,when the heart is contracting.
tobacco: cumulative tobacco (kg)
ldl: low density lipoprotein cholesterol
adiposity:It is measured as percent of body fat.
famhist: family history of heart disease (Present=1, Absent=0).
typea: type-A behavior, It is characteristic of a person who is competitive.
obesity: It is represented as Body Mass Index (BMI).
alcohol: current alcohol consumption.
age: age at onset.
chd: coronary heart disease (yes=1 or no=0)
The data set provides the patients’ information. It includes over 462 records and 10 attributes as mentioned above. For sophisticated statistical analysis such as for the use of machine learning algorithms, 462 records are very few, but the purpose of this work is purely didactic.
Out of 462 people examined only 160 are positive for ‘chd’ (target feature when chd=1) which corresponds to 34.6%, the dataset is slightly unbalanced, but later i will show how to fix this problem.
From histograms I can understand better the distribution of the data.
In addition to our target variable ‘chd’, we have only one category variable, ‘famhist’, while the majority of the samples is between 40 and 60 years old and has low rates for alcohol and tobacco consumption. But let’s take a closer look at the statistics of our dataset.
As can be seen from the statistics in the table above, those who are positive for ‘chd’ also have high values of ‘obesity’ and ‘adiposity’, in addition to having a high average ‘age’ compared to the sample under examination.
Correlation matrix and heat map.
Let’s take a look at some of the more significant correlations. It is worth remembering that correlation coefficients only measure linear correlations. From the matrix, there are no characteristics with a correlation greater than 0.5 with CHD and this shows that the characteristics are poor predictors. However, the characteristics with the highest correlation with ‘chd’ are ‘age’, ‘tobacco’ and ‘famhist’. Additionally, there are a couple of highly related features such as: ‘obesity’ and ‘adiposity’ (obviously).
From the correlation matrix, I want to see better the density graph for tobacco and age, because they have higher correlation values with chd than the other features. Density Graph displays the distribution of data over a continuous interval or period of time. This graph is a variation of a histogram that uses kernel smoothing to plot values, allowing for smoother distributions by attenuating noise. The peaks of a density graph help to understand where the values are concentrated in the range.
Considering that we have little data available and only 9 attributes, in this case it is not advisable to use algorithms for feature selection and the analysis of the principal components because you could lose useful information.
Synthetic Minority Oversampling Technique for Imbalanced Classification
Imbalanced classification involves developing predictive models on classification datasets that have a severe class imbalance. One technique for dealing with unbalanced datasets is to oversample the minority class. The approach involves duplicating examples in the minority class (chd=1), without adding any new information to the model. Instead, new examples can be synthesized from existing examples. This is a type of data augmentation for the minority class and is referred to as SMOTE.
As previously mentioned, the number of positive cases, where chd = 1, covers 34.6% of the total. After using this technique, the resultant data set was much more balanced with 44,5% positive cases, although I remember that unfortunately we have a very small dataset available.
After balancing the dataset with SMOTE, the next step is to scaling the data to improve the training of the classifier, then I split the data into a training and test set with a ratio of 80% to 20% respectively which is a standard subdivision.
In this study, using our training set I trained 5 machine learning classification algorithms:
1) Logistic Regression is one of the most simple and commonly used Machine Learning algorithms for two-class classification. It is easy to implement and can be used as the baseline for any binary classification problemIt measures the relationship between the categorical dependent variable and one or more independent variables by estimating probabilities using a logistic function.
2) Random Forest creates decision trees on randomly selected data samples, gets prediction from each tree and selects the best solution by means of voting. It also provides a pretty good indicator of the features importance and avoid overfitting. Random Forest use averaging to improve the predictive accuracy and control over-fitting and can handle a large number of features, and is helpful for estimating which of your variables are important in the underlying data being modeled.
3) Support Vector Machines: SVM is a supervised machine learning technique that is widely used in pattern recognition and classification problems — when your data has exactly two classes. SVM in practice constructs a hyperplane or a set of hyperplanes that best divides a data set into two classes.
4) Naive Bayes is a classifier based on Bayes’ theorem and very easy to build and particularly useful for very large data sets. it is a probabilistic algorithm. and calculates the probability of each label for a given object by looking at its characteristics, and chooses the label with the greatest probability.
5) Gradient boosting is a machine learning algorithm used for regression and problems of statistical classification, it produces a predictive model in the form of a set of weak predictive models, typically decision trees. In each training cycle the weak learner learn from previous predictors and his predictions are compared with the correct result we expect. The distance between observation and prediction represents the error rate of our model.
Evaluation of the models under examination
I trained each model and fine-tuning their hyper-parameters using the grid search, each model has a series of hyper-parameters that must be set, it is not advisable to use the default values, they often give not good results. After I evaluated and compared their performance via their Accuracy, Confusion Matrix , Roc Curves and F1 score. But let’s see in detail what these metrics tell us, to do this we need to start from the Confusion Matrix.
The confusion matrix detects the count of TP (true positive), TN (true negative), FP (false positive), FN (false negative) in the predictions of a classifier.
From Confusion matrix we can derive the Accuracy which is given by the sum of the corrected predictions divided by the total number of predictions:
Accuracy = TP+TN/TP+FP+FN+TN
And F1score that is the harmonic mean of Precision and Recall where:
Precision = TP/TP+FP and Recall = TP/TP+FN
F1 Score = 2*(Recall * Precision) / (Recall + Precision)
F1 score is usually more useful than accuracy, especially if you have an uneven class distribution as in the above case.
AUC-ROC Curve is a performance measurement for classification problem at various thresholds settings. ROC is a probability curve and AUC (Area under the curve)represents degree or measure of separability. It tells how much model is capable of distinguishing between classes. Higher is the AUC, better is the model to predicting 0s as 0s and 1s as 1s.
Now let’s see the results of the confusion matrices and Area Under the Curve obtained from our models.
Comparing all the metrics we see that the best model is Random Forest for the dataset Coronary Heart Disease. Compared to the other models, it also records higher values of AUC and F1 Score, having less FP and FN as can be seen from the confusion matrices. In addition, the Random Forest recorded: Precision =83%, Sensitivity= 81%, Recall =73% and Specificity= 89%. Surely if we had more instances available, the models would have recorded better values, the trained models would have learned better from the data.
Artificial intelligence algorithms at the service of medicine are precious allies for doctors to be able to perform early screening and identification of diseases and to better manage resources within the health system. Furthermore, these algorithms could be implemented within smartwatch apps to allow people to have their health under control.
I hope you found this reading useful and understandable, suggestions are accepted.
Sources of inspiration for this work and related articles
§ Prediction of Coronary Heart Disease using Machine Learning:Experimental Analysis: Amanda H. Gonsalves, Fadi Thabtah, Rami Mustafa A. Mohammad
§ NHS https://www.nhs.uk/