Journal List > J Korean Soc Med Inform > v.15(1) > 1035515

Kang, Chung, and Suh: Prediction of Hospital Charges for the Cancer Patients with Data Mining Techniques



Predictions of hospital charges for cancer patients are very important, because they provide a basis for allocating medical resources in the hospital and for establishing national medical policies. But previous studies to predict hospital charges were mainly based on statistical analysis, which has used only a small aspect among huge medical data so that the prediction power was limited. Thus we developed four data mining models, including two artificial neural network (ANN) models and two classification and regression tree (CART) models, to predict both the total amount of hospital charges and the amount paid by the insurance of cancer patients and compared their efficacies.


The data was generated from 400,625 medical records of 1,605 cancer patients who had been hospitalized to Kyung Hee University Hospital from March 1, 2003 to February 29, 2004. Clementine 8.1 programwas used to build four data mining prediction models, two for the total amount and two for the amount paid by insurance. The variables included all of the data fields of standard medical record form of Korea. The neural network model used feed-forward back propagation method, which had 2 hidden layers. For decision tree model, RELIEFF method was used and the maximum tree depth was set to 30.We divided the dataset into 67% of training dataset and 33% of test dataset, using stratified sampling. Linear correlation coefficient and gain chart were compared.


The ANN models showed better linear correlation coefficient than the CART models in predicting both the total amount (0.824 vs. 0.791) and the amount paid by insurance (0.838 vs. 0.699). The estimated accuracy of ANN model was more than 98% to predict both total amount and amount paid by insurance. The CART model for total amount showed that the relative importance of the variables were duration of admission(0.073), number of consultation(0.061), and treatment group 16(0.06). The CART model for the amount paid by insurance showed that the relative importance of the cariables were duration of admission (0.09), number of ICUadmission (0.063), and number of consultations (0.062). The percent gain of ANN model shows better %gain than CART to predict total amount but to predict amount paid by insurance, ANN showed similar pattern to CART.


The ANN models showed better prediction accuracy than CART models. However, the CART models, which serve different information from ANN model, can be used to allocate limited medical resources effectively and efficiently. For the purpose of establishing medical policies and strategies, using those models together is warranted.

I. Introduction

According to the 2007 report of Health Insurance Review Agency1), the estimated number of medical insurance claim was 967,735,494 and the total amount of expense was 32,258,975 million Won. The estimated number of claims associated with malignant neoplasm was 707,478 and the total amount of expense for that was 1,604,788 million Won. Although the number of claims related with cancer occupies only 0.07% of all the claims, the amount of expenses of that reaches about 4.97%. Moreover, the average expense per medical claim related to malignant neoplasm was 2,268,323 Won, which is 68 times greater than average expense 33,334 Won of all themedical claims. Accordingly, the hospital charges related to cancer show huge expansion2). Therefore, it has become very important to predict the hospital charge related to the cancer for the proper allocation of medical resources and establishment of medical policies in hospitals.
Meanwhile, the medical data is difficult to analyze because of its characteristics such as huge volume and heterogeneity, temporality of the data, and high frequency of null values. It is not uncommon that a patient data exceed more than 1000 fields when the patients' data collection is longer than one year3). In addition, medical data consists of various types of data such as image, numbers, videos and electrical signals (EKG, EEG), it ismore difficult to analyze than other domains. Moreover, the medical data collection occurs very irregularily because the disease breaks out unexpectedly. The irregularity of medical data leads to many null values in the aspect of time flow. And the null values influence negatively to build proper prediction models. For the reasons mentioned above, these researches have shown very low prediction accuracy, for they could not help using very limited parts of a huge amount of medical data which consists of various data types.
But, researches related to hospital expense are relatively small. Especially, the researches about the prediction of hospital expense using data mining techniques are not easy to find which are known to be better prediction results than other methods4). Thus we built prediction models to predict the expenses for the cancer patients using artificial neural network and decision trees methods and compared their efficacies.

II. Materials and Methods

In general, appropriate feature selection subset improves accuracy than using total feature set. The authors used RELIEFF suggested by Demsar5), which is an extension of RELIEF as a feature selection method. Further, a domain expert verified selected features. The features were selected for split criteria in decision trees and in naïve Bayes classifier for building models.
The dataset is based on the records of cancer patients who have been treated in Kyung Hee University Hospital from March 1, 2003 to February 29, 2004. The hospital had more than 130,000 admissions, 4,000,000 out-patients' visits and 5,000 newly diagnosed cancer patients during the period. Among them, the data from 1622 patients who had been hospitalized at least once for the treatment of cancer were enrolled. Data from 17 patients who have no personal identification were excluded. Finally, 400,625 records from 1605 patients were used for the analysis. The variables included all the fields based on the standard medical record form of Korea. Dataset were prepared totally with 66 (65 input variables and 1 output variable) variables (Table 1).
The output variable was set once to predict 'the total amount of hospital charge' and then to predict 'the amount paid by insurance'.
We removed null values, and performed variables selection using RELIEFF algorithm with the help of medical domain experts because building models with a subset of appropriate variables results in better accuracy than with a total set5). For example, original fields, 'operation_1' to 'operation_10', consist of two-digit code to identify surgeon and two-digit code to identify operation numbers. They had many null values, because a patient rarely receives more than five operations during an admission period. So, we derived a new field, no. of operations, which simply stores the number of times of operations performed on a patient, thereby reducing both the number of null values and the number of fields. And disease codes other than cancer were divided into 19 fields, each of which denotes the number of diseases in each disease group. As a consequence, 65 input fields were created in total. 19 disease groupswere generated according to the Korean Classification of Diseases and 16 treatment groups were also generated according to the ICD-9CM classification. Each kind of cancer was stored into one of the twelve fields.
Clementine 7.0 (SPSS, Chicago Illinois, USA) program was used to build data mining models. Having carried feature selection using the well-known RELIEFF method, we build models. A feed-forward back-propagation method was used to build neural network models. 70% of original dataset were set to be training dataset and the rest to be test dataset. Two neural network models were created using the training dataset: one to predict the total amount of hospital charge and another to predict the amount paid by insurance. Similarly, two CART models were built using the same input variables selected fromthe RELIEFF method, as were used to build neural network models. For the CART models, we set maximum tree depth to 30. Same training and test datasets were used as were used when building neural network models. Gini index which indicates a level of impurity of a node is used as a basis for splitting nodes. All the models were built using Clementine 8.1.

III. Results

To predict total amount, ANNmodel was createdwith 55 input neurons in its 3 hidden layers (Fig. 1). To predict total mount, theANNmodelwith feature selection showed better linear correlations than without feature selection. The linear correlation coefficient of ANN models with or without feature selection were 0.824 and 0.794, respectively. To predict the amount paid by insurance, ANN model was created with 53 input neurons in its 3 hidden layers. Also, the ANN model with feature selection showed better linear correlations than without feature selection. The linear correlation coefficient of ANN models with or without feature selection were 0.838 and 0.82, respectively (Table 2). The estimated accuracy of neural network model for total amount and the amount paid by insurance was 98.3% and 98.7% respectively. The relative weights of factors that affect hospital charge were analyzed. The most important factors in predicting the total amount were duration of admission (0.074), number of consultations (0.062) and treatment group 16 (0.061) (Table 3). Treatment group 16 is designated as the miscellaneous diagnostic and therapeutic procedures. The most important factors in predicting the amount paid by insurance were duration of admission (0.091), the number of ICU admission (0.063) and the number of consultation (0.063). Among the variables, physician relative variables such as Doctor ID did not influence on the relative importances.
The most important variable was the duration of admission, where the first branch was split at the point of 14.5 days of admission. The second important variable is the number of operations at the left branch of the tree. Then, the nodes at other levels were split based on number of operations, treatment group 16, and treatment group 9. The number of rules in the resulting rule set was eleven and these rules classified the part of high hospital expense well. For example, consider these rules: IF "(1) days of admission ≥14.5 and (2) days of admission <55.5 and (3) the number of ICU admission <0.5" THEN "3,125,038". The correlation coefficients of the CART models were 0.791 for the total amount of hospital charge and 0.699 for the amount paid by insurance regardless of feature selection. In the CART model for amount paid by insurance, the duration of admission was most important variable also but instead of the number of operation, department was demonstrated as second important variable. The other variables were number of ICU admission and treatment group 16(Fig. 3).
The percent gain of ANN model shows better %gain than CART to predict total amount but to predict amount paid by insurance, ANN showed similar pattern to CART (Fig. 4).

IV. Discussion

With the development of information technologies, it becomes possible to record and search the historical states of patients through database. As a result, tremendous medical data of various types have been accumulated into a database of a medical information system. But because of the complexity, the medical data is difficult to analyze. It frequently occurs that if the period of data collection is longer than one year, the record of such a patient has more than 1000 fields3). In addition, medical data consists of various types of data such as image, numbers, video, etc and thus it ismore difficult to analyze than simple data collected in other domains. The characteristics of medical data include 1) huge volume, 2) heterogeneity, 3) temporal (historical) data, and 4) relatively high frequency of null value.
There have been several researches related to the prediction of hospital charges of cancer patients using statistical analysis such as regression or ANOVA6-8). Since most of these researches were based on a small number of variables among many affecting the hospital charge, their prediction accuracy was not satisfactory. Therefore, these regression models can be hardly used for the prediction of hospital expense. In this aspect, datamining has emerged as an analytical method which can discover interesting knowledge from tremendous data from diverse domains using various techniques such as pattern recognition, statistics, database, machine learning and so forth9). Data mining can discover interesting knowledge from a large amount of data in the form of rules, patterns or trends, which may be difficult to obtain using traditional statistical methods10). In the medical field, data mining techniques such as association rules, artificial neural network, decision tree and genetic algorithm have been used to achieve various objectives and several data mining studies concerning medical cost were performed. Marshall et al.11) built conditional phase-type distribution model to predict elderly patient's outcome and duration of stay. In the research, they were able to identify that there is a strong relationship between Barthel grade, patient outcome and length of stay. Chae et al. examined the characteristics of the knowledge discovery and data mining algorithms to demonstrate how they can be used to predict health outcomes and provide policy information for hypertension management using the Korea Medical Insurance Corporation database12). They built logistic regression, CHAID and C5.0 models from a dataset related to hypertensive and non-hypertensive and compared their performance one another. They reported that the CHIAD algorithm performed better than the logistic regression in predicting hypertension, and C5.0 had the lowest predictive power. These researches are pioneers to introduce data mining models to predict medical costs. But still, data mining models for cancer patients' cost are rarely found.
Thus, we aimed the objective of this study to build prediction models for the hospital charge of cancer patients because there are very limited researches concerning for the cost of cancer patient inKorea, where themedical insurance systemis very unique and governed by the governance. The current research to build data mining models to predict cost of whole cancer patients may be the first in Korea. Although the authors have reported a dataminingmodels to predict cost of cancer patients, it was limited only for the colorectal cancer13). However, our current research has some limitations which should be taken into account in later researches. The dataset we have used to build the predictive models did not have records of all treatments and examinations which a patient has experienced, because they were not digitalized at the time of data collection. Also, we could notmake use of the information indicating the staging of cancer, the use of which may enable us to build a more exact predictive model in various aspects. If we have more digitalized records in a few years to come as we plan, we will make more accurate predictive models of the hospital expense. Further, as more cases of cancer patients are accumulated into the medical information systems, we may apply other data mining technique such as case-based reasoning to the medical data to get the similar results which may be better.
In this study, we included the all kinds of cancers as an input, so that prediction of hospital charge of cancer patients could bemade, independently of the type of cancer. And we used artificial neural networks and decision trees to build prediction models and compare their prediction accuracy because those two models are most commonly used data mining tools. Although, our results showed that bothmodels are efficient to predict cancer patient's hospital charge, the prediction accuracy of ANN model was slightly higher than that of the CART model and the ANN model shows better percent gain for predicting total amount (Fig. 3). Generally, ANN models have shown higher sensitivity, specificity and prediction accuracy than other data mining techniques. In the previous research, the authors have compared the performance of ANN model and that of CART model in predicting hospital charges of colorectal cancer patients13). The result showed that ANN model showed better performance than CART model. The current results demonstrate that with the complicated database, the ANN model shows better prediction than other models. Chien et al.14) applied three data mining techniques to improve prediction of post-operative complication of gastric cancer. The data mining techniques included Artificial Neural Networks (ANN), Decision Tree (DT) and Logistic Regression (LR). The results indicated that ANN was a better technique than DT and LR in predicting post-operative complication. Goss et al.15) compared traditional decision support system such as Binary Logit Regression (BLR) and non-parametric methodologies such as neural network (NN) model to provide objective measures of the likelihood of Intensive Care Unit (ICU) recovery. The study showed that the NN technique predicts mortality rates more correctly than BLR, and offers a promising non-parametric alternative to the parametric methodologies in hospital settings. For the cancer patients, Fogel et al. were first to apply neural networks and linear classifiers to breast cancer patients' dataset16). They used cross-validation to estimate error rate and relied on evolutionary computation to mitigate the black-box problem. However, the fact that neural network models could not give an adequate explanation of their results to doctors is very fatal because they hardly accept the result of 'black box' classifiers unless their performances overwhelm other classifiers17). To overcome such limitations and adjust weights in neural networks, genetic algorithm has been used. Bojarczuk et al.18) proposed a newconstrained-syntax genetic programming (GP) algorithm for discovering classification rules in five medical data sets: chest pain, Ljubljana breast cancer, dermatology, Wisconsin breast cancer, and pediatric adrenocortical tumor. The proposedGP algorithm obtained good results with respect to predictive accuracy and rule comprehensibility, by comparison with C4.5 and Boolean inputs (BGP).
Meanwhile, decision tree is one of the most frequently used techniques for classification and prediction tasks not only in medical data mining area but also in other areas. This enables one to predict prognoses and diagnoses using tree-structuredmodels and to identify useful features which play an important role in making such predictions. Demšar et al.5) built models which can be used to predict whether a severe trauma patient would survive or not. They found out that features selected as split criteria in the decision tree corresponded to factors which other researchers found to have an influence on the survival of a patient who suffered from severe trauma. But the size of their dataset was so small (68 cases) that their models could not be used as a prediction model for severe trauma patient's survival. Breault et al.19) have analyzed diabetes patients' data with CART and discovered that a patient's age rather than whether one has other diseases or not has an association with adjustment of blood sugar.
Although the ANN model showed better results to predict cost, the fact that neural network models could not give an adequate explanation for the result of 'black box' classifiers, CART have their own advantages and unique use so that bothmodels are needed to build proper strategies of hospital and national policies.

Figures and Tables

Figure 1
Structure of Artificial Neural Network. The ANN model had two hidden layers. The ANN model for total amount included 56 input neurons and the model for amount paid by insurance included 53 neurons.
Figure 2
Decision Tree Model for Total Amount. The duration of admission was the most important variable to split.
Figure 3
Decision Tree Model for Amount Paid by Insurance. The duration of admission was the most important variable to split.
Figure 4
The y-axis Shows the Percentage of Gain. The x-axis shows the percentage of samples selected based on the data mining model, which is a fraction of total samples selected. ANN model shows better %gain than CART to predict total amount (upper). But ANN and CART showed similar pattern to predict amount paid by insurance (lower).
Table 1
Input Variables Used for Analysis

*Patient group were classified by payment methods. Group 1 has insurance, group 2 has government warrant, group 3 has no insurance and group 4 has private insurance.

Disease group and treatment group denotes the number of diseases and treatments which belong to a patient according to the Korean Standard Classification of Disease-4 and ICD-9CM.

Table 2
Linear Correlations of Each Data Mining Models.

*FS: Feature Selection

Table 3
Relative Importance of Each Input Variables for the Amount

ICU: Intensive Care Unit

DG: Disease group according to the Korean Standard Classification of Disease (Appendix 1)

TG: Treatment group according to the ICD-9 (Appendix 2)


This research was supported by the Kyung Hee University Research Fund in 2008. (KHU-20080383)


Appendix 1

Disease Group according to KSD-4


Appendix 2

Classification of Treatment Group according to ICD-9

The CART models for total amount and amount paid by insurance were generated. Figure 3 and 4 represents the decision tree of CART for total amount and amount paid by insurance, respectively. The CART model showed same linear correlation coefficient regardless of feature selection. (Table 2) The CART model for total amount had 6 layers and the impurity estimation was measured using Gini index. Unlike the ANN model, CART model was not influence by feature selection. In the resulting decision tree for total amount, the root node was spilt based on the value of duration of admission (Fig. 2).


1. National Health Insurance Statistics 2007. Korea HIRaAS. 2008. updated 2008; cited 2008. Available from:
2. Yoon SJ, Lee H, Shin Y, Kim YI, Kim CY, Chang H. Estimation of the burden of major cancers in Korea. J Korean Med Sci. 2002. 10. 17(5):604–610.
3. Hirano S, Tsumoto S. Multiscale analysis of long time-seriesmedical databases. AMIA Annu Symp Proc. 2003. 289–293.
4. Ismael MB, Eisenstein EL, Hammond WE. Acomparison of neural network models for the prediction of the cost of care for acute coronary syndrome patients. Proc AMIA Symp. 1998. 533–537.
5. Demsar J, Zupan B, Aoki N, Wall MJ, Granchi TH, Robert Beck J. Feature mining and predictive model construction from severe trauma patient's data. Int J Med Inform. 2001. 09. 63(1-2):41–50.
6. Brooks SE, Ahn J, Mullins CD, Baquet CR, D'Andrea A. Health care cost and utilization project analysis of comorbid illness and complications for patients undergoing hysterectomy for endometrial carcinoma. Cancer. 2001. 08. 15. 92(4):950–958.
7. Penberthy L, Retchin SM, McDonald MK, McClish DK, Desch CE, Riley GF, et al. Predictors ofMedicare costs in elderly beneficiaries with breast, colorectal, lung, or prostate cancer. Health Care Manag Sci. 1999. 07. 2(3):149–160.
8. Tollestrup K, Frost FJ, Stidley CA, Bedrick E, McMillan G, Kunde T, et al. The excess costs of breast cancer health care in Hispanic and non-Hispanic female members of a managed care organization. Breast Cancer Res Treat. 2001. 03. 66(1):25–31.
9. Dayhoff JE, DeLeo JM. Artificial neural networks: opening the black box. Cancer. 2001. 04. 91(8):Suppl. 1615–1635.
10. Goss E, Vozikis G. Improving Health Care Organizational Management Through Neural Network Learning. Health Care Manag Sci. 2002. 5(3):221–227.
11. Marshall AH, McClean SI, Millard PH. Addressing bed costs for the elderly: a new methodology for modelling patient outcomes and length of stay. Health Care Manag Sci. 2004. 02. 7(1):27–33.
12. Chae YM, Ho SH, Cho KW, Lee DH, Ji SH. Data mining approach to policy analysis in a health insurance domain. Int JMed Inform. 2001. 07. 62(2-3):103–111.
13. Lee SM, Kang JO, Suh YM. Comparison of hospital charge prediction models for colorectal cancer patients: neural network vs. decision tree models. J Korean Med Sci. 2004. 10. 19(5):677–681.
14. Chien CW, Lee YC, Ma T, Lee TS, Lin YC, Wang W, et al. The application of artificial neural networks and decision treemodel in predicting post-operative complication for gastric cancer patients. Hepatogastroenterology. 2008. May-Jun. 55(84):1140–1145.
15. Goss EP, Vozikis GS. Improving health care organizational management through neural network learning. Health Care Manag Sci. 2002. 08. 5(3):221–227.
16. Fogel DB, Wasson EC 3rd, Boughton EM, Porto VW. Evolving artificial neural networks for screening features from mammograms. Artif Intell Med. 1998. 11. 14(3):317–326.
17. Kononenko I. Machine learning for medical diagnosis: history, state of the art and perspective. Artif Intell Med. 2001. 08. 23(1):89–109.
18. Bojarczuk CC, Lopes HS, Freitas AA, Michalkiewicz EL. A constrained-syntax genetic programming system for discovering classification rules: application to medical data sets. Artif Intell Med. 2004. 01. 30(1):27–48.
19. Breault JL, Goodall CR, Fos PJ. Data mining a diabetic data warehouse. Artif Intell Med. 2002. Sep-Oct. 26(1-2):37–54.
Similar articles