Abstract
Objectives
Cardiovascular predictions are related to patients' quality of life and health. Therefore, a risk prediction model for cardiovascular conditions is needed.
Methods
In this paper, we propose a cardiovascular disease prediction model using the sixth Korea National Health and Nutrition Examination Survey (KNHANES-VI) 2013 dataset to analyze cardiovascular-related health data. First, statistical analysis was performed to find variables related to cardiovascular disease using health data related to cardiovascular disease. Second, a model of cardiovascular risk prediction by learning based on the deep belief network (DBN) was developed.
Cardiovascular diseases include hyperlipidemia, myocardial infarction, and angina pectoris. Cardiovascular disease is diagnosed by electrocardiography, ultrasound, blood tests, angiography, and so on. These methods are time-consuming and costly because they require many different tests. Recently, a cardiovascular disease prediction technique using machine learning has been developed to replace these diagnostic methods [1].
Medical IT combined with machine learning technology has increased the accuracy of disease prediction using predictive models generated from disease-related learning data [2]. However, since complex data is analyzed, a deep learning technique is required [34].
Many studies have been conducted on cardiovascular disease using machine learning. Khatib and Montazer [5] developed a heart disease risk prediction model based on the Dempster-Shafer evidence theory by designing a fuzzy-evidential hybrid inference engine. Krishnaiah et al. [6] developed a cardiovascular risk prediction system using fuzzy K-nearest neighbor (K-NN) classifiers for measured values to remove uncertainty. However, research on a prediction model for domestic cardiovascular disease is lacking [78].
In recent years, attention has focused on how to construct a prediction model based on big data and the development of deep learning technology.
Prediction models are based on artificial intelligence (AI), and many methods using machine learning, data mining, databases, and statistics have been proposed [9]. Prediction models using these cutting-edge techniques have been used in many fields, and their value in the medical industry is gradually increasing.
A deep belief network (DBN) is an advanced learning method using artificial neural networks which involves a high level of technology and performs well [10]. A DBN consists of several layers of controlling restricted Boltzmann machine (RBM). It then performs supervised learning using backpropagation after unsupervised learning [11]. DBN has broad applications in various medical fields and is widely used for medical research because it performs well [121314].
In this paper, we propose a cardiovascular disease prediction model. The sixth Korea National Health and Nutrition Examination Survey (KNHANES-VI) 2013 data set [15] was used to find cardiovascular-related health data. First, statistical analysis was performed to find variables related to cardiovascular disease using health data related to cardiovascular disease. Second, a model of cardiovascular risk prediction by learning based on the DBN was developed. Thus, variables were selected using statistical techniques, and learning with DBN was conducted using the selected variables.
The remainder of this article is structured as follows. Section II describes the dataset and proposes the method. Section III outlines the system implementation and compares its ability to discriminate cardiovascular risk and probability tables. Finally, Section IV presents conclusions and specifies further directions for future research.
The research structure of this study is presented in Figure 1. First, the dataset was defined and data was preprocessed. Second, the dataset was statistically analyzed. We uses statistical techniques to select variables to be used for learning. The analyzed dataset was divided into a training set (70%) and a testing set (30%) (Table 1). Third, learning was done based on the DBN using the training set. Finally, the performance of the cardiovascular prediction model generated through learning was measured.
The KNHANES-VI contains data from the Korea Centers for Disease Control and Prevention. KNHANES identifies the health and nutritional status of the population and selects the vulnerable groups that must be prioritized to calculate the statistics necessary to assess whether health policies and projects are being effectively delivered. It also provides statistical data on smoking, drinking, physical activity, and obesity requested by the World Health Organization (WHO) and the Organization for Economic Cooperation and Development (OECD) [15].
The Framingham risk score (FRS) has been used as a standard guideline for predicting cardiovascular risk for 10 years. Therefore, the attributes in these guidelines were used as a reference for data extraction [16].
Input variables for learning included age, gender, total cholesterol, high-density lipoprotein (HDL), systolic blood pressure (SBP), diastolic blood pressure (DBP), smoking, and diabetes. Output variables included cardiovascular diseases: hypertension, hyperlipidemia, myocardial infarction, and angina pectoris.
There are 8,108 experimental records in KNHANES-VI. Of these, 2,474 were from uncertain (non-respondent, null value) respondents, while 1,390 were records of people less than 30 years old. The final dataset comprised 4,244 records. Figure 2 shows the data preprocessing procedure.
The statistical techniques for feature selection used the nonparametric Mann-Whitney U-test and chi-square. The age, SBP, DBP, total cholesterol, and HDL cholesterol variables were analyzed using the U-test. Chi-square testing was used to analyze the gender, diabetes, and smoking variables. Here, any variable whose p-value was less than or equal to 0.05 was excluded.
IBM SPSS Statistics ver. 22.0 (IBM, Armonk, NY, USA) was used for statistical analysis. Several statistical analysis methods with several preoperative variables were compared to determine the most effective method to predict cardiovascular risk.
A confusion matrix and receiver operating characteristic (ROC) curve were used to compare predictive ability. A confusion matrix is a measure to evaluate the performance of the classifier. As shown in Figure 3, accuracy, sensitivity, and specificity were measured. The matrix was constructed for output variables (low risk, high risk) in the testing dataset for each analysis. The limit of significance for all tests was p < 0.05.
A DBN is a deep learning technique that learns by composing multiple RBM layers. MATLAB R2016b was used for the DBN in this research. The DeepLearnToolbox by R. B. Palm was used for the DBN library [17].
The RBM, which is based on the Hopfield network, employs the energy function and obtains unit values probabilistically (using a Boltzmann distribution). The RBM is shown in Figure 4. The RBM consists of a visible unit layer and a hidden unit layer, and its internal connection intensity is 0 [18]. The DBN, in which structures of the RBM are connected sequentially, is shown in Figure 5 [10].
In the structure, the forefront hidden unit layer acts as the previous visible unit layer. DBN learning is done by configuring the visible layer and hidden layer 1 into a single RBM [19]. Once learning is complete, hidden layers 1 and 2 are trained via the RBM by giving a new input as a value of hidden layer 1. As such, learning is sequential up to the last layer.
A supervised learning-based classification technique using the DBN is the back propagation algorithm, which is configured in the uppermost layer in the DBN [20]. A classification prediction model using the backpropagation-DBN was created for this paper.
The distribution of preoperative parameters among low- and high-risk records is shown in Table 2. The p-value >0.05 were gender and total cholesterol. In other words, there were six variables related to cardiovascular disease risk.
The DBN constructed a learning model using a training set. Six input variables (age, SBP, DBP, HDL, diabetes, smoking) and 1 output data were used. The DBN consisted of two steps. The first phase was the construction of the RBM network using unsupervised learning. The RBM settings were epoch, batch size, and momentum at 200, 12, and 0, respectively. In the second phase, the RBM network learned the backpropagation algorithm of supervised learning. The backpropagation options were epoch and batch size at 200 and 12, respectively.
The performance of the model differed depending on the number of nodes constituting the DBN. The error rate according to the number of nodes is shown in Table 3. Six nodes with one layer [4 8] showed the lowest error rate (0.2013). Therefore, it is best for construction of a DBN [4 8] (see Figure 6).
We compared the performance of the proposed DBN with that of various machine learning techniques. The comparison models were naïve Bayesian (NB), logistics regression (LR), back propagation network (BPN), support vector machine (SVM), random forest (RF), DBN (using nine-variable input), and the proposed statistical DBN (using six-variable input) method. The confusion matrix results appear in Table 4. The ROC curve results are shown in Table 5.
Experimental results show that the proposed statistical DBN achieved the highest sensitivity, accuracy, and ROC curve performance. Specificity was 100% for SVM, and that for all others was low. In other words, SVM was effective in measuring low risk, but it could not predict important high risk. Sensitivity to measure high-risk prediction was highest at 87.6% for the proposed statistical DBN. Also, the existing DBN showed low performance because it does not consider unnecessary variables. It can be seen that unnecessary variables have a great influence on the measurement of cardiovascular disease. Therefore, the proposed model is able to achieve higher performance because it considers important variables.
This paper investigated methods that can be applied to predict the risk of cardiovascular disease. The existing methods for diagnosing cardiovascular disease are time-consuming and costly. However, cardiovascular disease risk can be predicted using various types of measured data when machine learning is applied.
In this paper, we implemented a risk prediction model using KNHANES-VI data. A DBN was used to implement the cardiovascular disease risk prediction model. Data analysis using statistical techniques showed that age, SBP, DBP, HDL, smoking, and diabetes were associated with cardiovascular risk. In other words, the prediction system can predict risk using six variables. The prediction model utilizes a DBN. The DBN consists of four input variables and one output variable. The experimental results show that it performed better than other methods. The method proposed in this paper appears to be effective for the risk prediction of cardiovascular disease and is expected to be particularly applicable to cardiovascular disease prediction in Koreans.
Future research will focus on deep learning research to improve the performance of DBN node optimization and prediction.
References
1. Austin PC, Tu JV, Ho JE, Levy D, Lee DS. Using methods from the data-mining and machine-learning literature for disease classification and prediction: a case study examining classification of heart failure subtypes. J Clin Epidemiol. 2013; 66(4):398–407.
2. Song MH, Kim SH, Park DK, Lee YH. A multi-classifier based guideline sentence classification system. Healthc Inform Res. 2011; 17(4):224–231.
3. Tomar D, Agarwal S. Feature selection based least square twin support vector machine for diagnosis of heart disease. Int J Biosci Biotechnol. 2014; 6(2):69–82.
4. Kim JK, Rho MJ, Lee JS, Park YH, Lee JY, Choi IY. Improved prediction of the pathologic stage of patient with prostate cancer using the CART-PSO optimization analysis in the Korean population. Technol Cancer Res Treat. 2016; 12. 16. [Epub] http://journals.sagepub.com/doi/abs/10.1177/1533034616681396.
5. Khatibi V, Montazer GA. A fuzzy-evidential hybrid inference engine for coronary heart disease risk assessment. Expert Syst Appl. 2010; 37(12):8536–8542.
6. Krishnaiah V, Narsimha G, Chandra NS. Heart disease prediction system using data mining technique by fuzzy K-NN approach. In : Satapathy S, Govardhan A, Raju K, Mandal J, editors. Emerging ICT for Bridging the Future-Proceedings of the 49th Annual Convention of the Computer Society of India (CSI) Volume 1. Cham: Springer International Publishing;2015. p. 371–384.
7. Lee DY, Rhee EJ, Choi ES, Kim JH, Won JC, Park CY, et al. Comparison of the predictability of cardiovascular disease risk according to different metabolic syndrome criteria of American Heart Association/National Heart, Lung, and Blood Institute and International Diabetes Federation in Korean men. Korean Diabetes J. 2008; 32(4):317–327.
8. Kim JK, Lee JS, Park DK, Lim YS, Lee YH, Jung EY. Adaptive mining prediction model for content recommendation to coronary heart disease patients. Clust Comput. 2014; 17(3):881–891.
9. Litjens G, Kooi T, Bejnordi BE, Setio AA, Ciompi F, Ghafoorian M, et al. A survey on deep learning in medical image analysis. Ithaca (NY): arXiv.org;c2017. cited at 2017 Jul 1. Available: https://arxiv.org/abs/1702.05747.
10. Hinton GE, Osindero S, Teh YW. A fast learning algorithm for deep belief nets. Neural Comput. 2006; 18(7):1527–1554.
11. Dahl GE, Yu D, Deng L, Acero A. Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Trans Audio Speech Lang Process. 2012; 20(1):30–42.
12. Abdel-Zaher AM, Eldeib AM. Breast cancer classification using deep belief networks. Expert Syst Appl. 2016; 46:139–144.
13. Tamilselvan P, Wang P. Failure diagnosis using deep belief learning based health state classification. Reliab Eng Syst Saf. 2013; 115:124–135.
14. Fakoor R, Ladhak F, Nazi A, Huber M. Using deep learning to enhance cancer diagnosis and classification. In : Proceedings of the International Conference on Machine Learning; 2013 Jun 17-19; Atlanta, GA.
15. Korea Center for Disease Control and Prevention. The sixth Korea National Health & Nutrition Examination Survey (KNHANES-VI) 2013 [Internet]. Cheongju: Korea Center for Disease Control and Prevention;c2017. cited at 2017 Jul 1. Available: http://knhanes.cdc.go.kr/.
16. Ankle Brachial Index Collaboration. Fowkes FG, Murray GD, Butcher I, Heald CL, Lee RJ, et al. Ankle brachial index combined with Framingham Risk Score to predict cardiovascular events and mortality: a metaanalysis. JAMA. 2008; 300(2):197–208.
17. Palm RB. DeepLearnToolbox: a MATLAB toolbox for deep learning [Internet]. San Francisco (CA): GitHub Inc.;c2017. cited at 2017 Jul 1. Available:https://github.com/rasmusbergpalm/DeepLearnToolbox.
18. Hinton GE. A practical guide to training restricted Boltzmann machines. Toronto: University of Toronto;2010.
19. Hinton GE, Salakhutdinov RR. Reducing the dimensionality of data with neural networks. Science. 2006; 313(5786):504–507.
20. Salama MA, Hassanien AE, Fahmy AA. Deep belief network for clustering and classification of a continuous data. In : Proceedings of 2010 IEEE International Symposium on Signal Processing and Information Technology (ISSPIT); 2010 Dec 15-18; Luxor, Egypt. p. 473–477.