Abstract
Recently, many hospitals have been adopting clinical data warehouses (CDW) as well as electronic medical records. These new hospital information systems are inevitably introducing very large amounts of clinical data that might be useful for further analysis. However, the electronic clinical data in the CDW are usually byproducts of clinical practice rather than the product of research. Therefore, they include inconsistent and sometimes erroneous information that might not have the specific context of the clinical situations. Data miners usually have various academic backgrounds such as electronics, informatics, statistics, biomedicine, and public health. If the complex situations surrounding the clinical data are not well understood, investigators performing data mining in clinical fields may have problems assessing the information they are confronted with. Here, we would like to introduce some basic concepts on the principles of data mining in clinical fields including legal and ethical considerations as well as technical concerns.
Figures and Tables
Acknowledgement
The authors would like to acknowledge Professor Kyi Young Lee, Dept. of Biomedical Informatics, School of Medicine, Ajou University for his detailed review of the manuscript.
References
1. Cios KJ, William Moore G. Uniqueness of medical data mining. Artificial Intelligence in Medicine. 2002. 26(1-2):1–24.


2. Lavrac N, Keravnou E, Zupan B. Lavrac N, Keravnou E, Zupan B, editors. An overview. Intelligent data analysis in medicine and pharmacology. 1997. Boston: Kluwer;1–13.
3. Simon SR, Kaushal R, Cleary PD, Jenter CA, Volk LA, Orav EJ, et al. Physicians and electronic health records: a statewide survey. Archives of Internal Medicine. 2007. 167(5):507–512.
4. Menachemi N, Perkins RM, van Durme DJ, Brooks RG. Examining the adoption of electronic health records and personal digital assistants by family physicians in Florida. Inform Prim Care. 2006. 14(1):1–9.


5. Park RW, Shin SS, Choi YI, Ahn JO, Hwang SC. Computerized physician order entry and electronic medical record systems in Korean teaching and general hospitals: results of a 2004 survey. J Am Med Inform Assoc. 2005. 12(6):642–647.


6. Sittig F, Guappone K, Campbell E, Dykstra R, Ash J. A survey of USA acute care hospitals' computer-based provider order entry system infusion levels. Stud Health Technol Inform. 2007. 129(1):252.
7. DesRoches CM, Campbell EG, Rao SR, Donelan K, Ferris TG, Jha A, et al. Electronic health records in ambulatory care--a national survey of physicians. The New England Journal of Medicine. 2008. 359(1):50–60.


8. Dewitt JG, Hampton PM. Development of a data warehouse at an academic health system: knowing a place for the first time. Acad Med. 2005. 80(11):1019–1025.


9. Schubart JR, Einbinder JS. Evaluation of a data warehouse in an academic health sciences center. International Journal of Medical Informatics. 2000. 60(3):319–333.


10. Silver M, Sakata T, Su HC, Herman C, Dolins SB, O'Shea MJ. Case study: how to apply data mining techniques in a healthcare data warehouse. J Healthc Inf Manag. 2001. 15(2):155–164.
11. Zhang Q, Matsumura Y, Teratani T, Yoshimoto S, Mineno T, Nakagawa K, et al. The application of an institutional clinical data warehouse to the assessment of adverse drug reactions (ADRs). Evaluation of aminoglycoside and cephalosporin associated nephrotoxicity. Methods Inf Med. 2007. 46(5):516–522.


12. Kononenko I. Machine learning for medical diagnosis: history, state of the art and perspective. Artif Intell Med. 2001. 23(1):89–109.


14. Kopelman LM. Minimal risk as an international ethical standard in research. The Journal of Medicine and Philosophy. 2004. 29(3):351–378.


15. Cios KJ. Medical data mining and knowledge discovery. IEEE Eng Med Biol Mag. 2000. 19(4):15–16.
16. Cios KJ, Teresinska A, Konieczna S, Potocka J, Sharma S. A knowledge discovery approach to diagnosing myocardial perfusion. IEEE Eng Med Biol Mag. 2000. 19(4):17–25.


17. Yuan YC. Multiple imputation for missing data: concepts and new development. In : Twenty-Fifth Annual SAS Users Group International Conference 2000;
18. Schafer JL, Graham JW. Missing data: our view of the state of the art. Psychol Methods. 2002. 7(2):147–177.


19. Harel O, Zhou XH. Multiple imputation: review of theory, implementation and software. Stat Med. 2007. 26(16):3057–3077.


20. Haykin S. Neural networks and learning machines. 2008. 3rd ed. New York: Prentice Hall.
21. Bishop CM. Pattern recognition and machine learning. 2005. 2nd ed. New York: Springer;291–358.
22. Rokach L, Maimon O. Data mining with decision trees: theroy and applications. 2008. Danvers, MA: World Scientific Publishing Company.
23. Heckerman DE. MSR-TR-94-09. Learning Bayesian networks: The combination of knowledge and statistical data. 1995. Redmond, WA: Microsoft Research.
24. Heckerman DE. Bayesian networks for data mining. Data Mining and Knowledge Discovery. 1997. 1:79–119.
25. Heckerman DE, Fayyad UM, Piatetsky-Shapiro G, Smyth P, Uthurusamy R. Bayesian networks for knowledge discovery. Advances in knowledge discovery and data mining. 1996. Menlo Park, CA: The MIT Press;273–305.
26. Lee SM, Abbott P. Bayesian networks for knowledge discovery in large datasets: basics for nurse researchers. Journal of Biomedical Informatic. 2003. 36(4/5):389–399.


27. SPSS. Clementine 12.0 modeling nodes. 2007. Chicago: SPSS.
28. SPSS. Clementine manual-Basic. 2007. Seoul: SPSS.
29. Menard SW. Applied logistic regression analysis. 2001. 2nd ed. London: Sage Publications.
30. Lee SM, Abbott P, Johantgen M. Logistic regression and bayesian networks to study outcomes using large data sets. Nursing Research. 2005. 54(2):133–138.


31. Tu JV. Advantages and disadvantages of using artificial neural networks versus logistic regression for predicting medical outcomes. Journal of Clinical Epidemiology. 1996. 49:1225–1232.


32. Ramaswamy S, Tamayo P, Rifkin R, Mukherjee S, Yeang CH, Angelo M, et al. Multiclass cancer diagnosis using tumor gene expression signatures. Proc Natl Acad Sci U S A. 2001. 98(26):15149–15154.


33. Furey TS, Cristianini N, Duffy N, Bednarski DW, Schummer M, Haussler D. Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics. 2000. 16:906–914.


34. Lauritzen SL, Spiegelhalter DJ. Local computations with probabilities on graphical structures and their application to expert systems. Journal of the Royal Statistical Society Series B. 1988. 50(2):157–194.


35. Eisenstein EL, Alemi F. A comparison of three techniques for rapid model development: an application in patient risk-stratification. Proceedings/AMIA Annual Fall Symposium. 1996. 443–447.
36. Hanley JA, McNeil BJ. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology. 1982. 143(1):29–36.


37. Hanley JA, McNeil BJ. A method of comparing the areas under receiver operating characteristic curves derived from the same cases. Radiology. 1983. 148(3):839–843.


38. Rowland T, Ohno-Machado L, Ohrn A. Comparison of multiple prediction models for ambulation following spinal cord injury. Proceedings/AMIA Annual Symposium. 1998. 528–532.
39. Hosmer DW, Lemeshow S. Goodness of fit tests for the multiple logistic regression model. Communications in Statistics. 1980. A9(10):1043–1069.


40. Lemeshow S, Hosmer DW. A review of goodness of fit statistics for use in the development of logistic regression models. American Journal of Epidemiology. 1982. 115(1):92–106.


41. Blum RL. Displaying clinical data from a time-oriented database. Computers in Biology and Medicine. 1981. 11(4):197–210.


42. Elomaa T HN. An experimental comparison of inducing decision trees and decision lists in noisy domains. In : 4th European Working Session on Learning; Dec 4-6, 1989.
43. Lesmo L SL, Torasso P. Gupta MM SE, editor. Learning of fuzzy production rules for medical diagnoses. Approximate reasoning in decision analysis. 1982. Amsterdam: North-Holland;249–260.
44. Hojker S KI, Jauk A, Fidler V, Porenta M. Expert system's development in the management of thyroid diseases. 1988. Sep. In : European Congress for Nuclear Medicine; Milano. Milano:
45. Horn W. AI in medicine on its way from knowledge-intensive to data-intensive systems. Artificial Intelligence in Medicine. 2001. 23(1):5–12.


46. Quinlan R CP, Horn KA, Lazarus L. JR Q, editor. Inductive knowledge acquisition: a case study. Applications of expert systems. 1987. Boston: Addison-Wesley;137–156.
47. Zupan B, Dzeroski S. Acquiring background knowledge for machine learning using function decomposition: a case study in rheumatology. Artif Intell Med. 1998. 14(1-2):101–117.


48. Cohen ME, Hudson DL. Neural network models for biosignal analysis. Conf Proc IEEE Eng Med Biol Soc. 2006. 1:3537–3540.


49. Chun FK, Karakiewicz PI, Briganti A, Walz J, Kattan MW, Huland H, et al. A critical appraisal of logistic regression-based nomograms, artificial neural networks, classification and regression-tree models, look-up tables and risk-group stratification models for prostate cancer. BJU Int. 2007. 99(4):794–800.


50. Rodriguez Alonso A, Pertega Diaz S, Gonzalez Blanco A, Pita Fernandez S, Suarez Pascual G, Cuerpo Perez MA. The utility of artificial neural networks in the prediction of prostate cancer on transrectal biopsy. Actas Urol Esp. 2006. 30(1):18–24.
51. Stephan C, Cammann H, Jung K. Artificial neural networks: has the time come for their use in prostate cancer patients? Nat Clin Pract Urol. 2005. 2(6):262–263.


52. Gamito EJ, Crawford ED. Artificial neural networks for predictive modeling in prostate cancer. Curr Oncol Rep. 2004. 6(3):216–221.


53. Porter CR, Crawford ED. Combining artificial neural networks and transrectal ultrasound in the diagnosis of prostate cancer. Oncology (Williston Park). 2003. 17(10):1395–1399. discussion 1399, 1403-1396.
54. Schwarzer G, Schumacher M. Artificial neural networks for diagnosis and prognosis in prostate cancer. Semin Urol Oncol. 2002. 20(2):89–95.


55. Errejon A, Crawford ED, Dayhoff J, O'Donnell C, Tewari A, Finkelstein J, et al. Use of artificial neural networks in prostate cancer. Mol Urol. 2001. 5(4):153–158.


56. Murphy GP, Snow P, Simmons SJ, Tjoa BA, Rogers MK, Brandt J, et al. Use of artificial neural networks in evaluating prognostic factors determining the response to dendritic cells pulsed with PSMA peptides in prostate cancer patients. Prostate. 2000. 42(1):67–72.


57. Gamito EJ, Stone NN, Batuello JT, Crawford ED. Use of artificial neural networks in the clinical staging of prostate cancer: implications for prostate brachytherapy. Tech Urol. 2000. 6(2):60–63.
58. Snow PB, Smith DS, Catalona WJ. Artificial neural networks in the diagnosis and prognosis of prostate cancer: a pilot study. J Urol. 1994. 152(5 Pt 2):1923–1926.


59. Giles LC, Whitehead CH, Jeffers L, McErlean B, Thompson D, Crotty M. Falls in hospitalized patients: can nursing information systems data predict falls? Computers, Informatics, Nursing. 2006. 24(3):167–172.
60. Tiet Q, Ilgen MA, Byrnes HF, Moos RH. Suicide attempts among substance use disorder patients: an initial step toward a decision tree for suicide management. Alcoholism: Clinical and Experimental Research. 2006. 30(6):998–1005.


61. Modai I, Valevski A, Solomish A, Kurs R, Hines IL, Ritsner M, et al. Neural network detection of files of suicidal patients and suicidal profiles. Medical Informatics and the Internet in Medicine. 1999. 24(4):249–256.


62. Anthony D, Clark M, Dallender J. An optimization of the Waterlow score using regression and artificial neural networks. Clinical Rehabilitation. 2000. 14(1):102–109.

