Identifying Disease of Interest With Deep Learning Using Diagnosis Code

Yoon-Sik Cho; Eunsun Kim; Patrick L. Stafford; Min-hwan Oh; Younghoon Kwon

doi:10.3346/jkms.2023.38.e77

Journal List > J Korean Med Sci > v.38(11) > 1516081830

Go to TopGo to Top Go to BottomGo to Bottom

TOOLS

Cho, Kim, Stafford, Oh, and Kwon: Identifying Disease of Interest With Deep Learning Using Diagnosis Code

Original Article

Medical Informatics

Journal of Korean Medical Science 2023; 38(11): e77.

Published online: 3 March 2023

DOI: https://doi.org/10.3346/jkms.2023.38.e77

Identifying Disease of Interest With Deep Learning Using Diagnosis Code

Yoon-Sik Cho¹

, Eunsun Kim²

, Patrick L. Stafford³

, Min-hwan Oh⁴

, Younghoon Kwon⁵

¹Department of Artificial Intelligence, Chung-Ang University, Seoul, Korea.

²Department of Data Science, Sejong University, Seoul, Korea.

³Department of Medicine, University of Virginia, Charlottesville, VA, USA.

⁴Graduate School of Data Science, Seoul National University, Seoul, Korea.

⁵Department of Medicine, University of Washington, Seattle, WA, USA.

Address for Correspondence: Yoon-Sik Cho, PhD. Department of Artificial Intelligence, School of Computer Science and Engineering, Chung-Ang University, 84 Heukseok-ro, Dongjak-gu, Seoul 06974, Korea. yoonsik@cau.ac.kr

Received 11 August 2022 Accepted 18 December 2022

The Korean Academy of Medical Sciences

https://creativecommons.org/licenses/by-nc/4.0/

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (https://creativecommons.org/licenses/by-nc/4.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

Background

Autoencoder (AE) is one of the deep learning techniques that uses an artificial neural network to reconstruct its input data in the output layer. We constructed a novel supervised AE model and tested its performance in the prediction of a co-existence of the disease of interest only using diagnostic codes.

Methods

Diagnostic codes of one million randomly sampled patients listed in the Korean National Health Information Database in 2019 were used to train, validate, and test the prediction model. The first used AE solely for a feature engineering tool for an input of a classifier. Supervised Multi-Layer Perceptron (sMLP) was added to train a classifier to predict a binary level with latent representation as an input (AE + sMLP). The second model simultaneously updated the parameters in the AE and the connected MLP classifier during the learning process (End-to-End Supervised AE [EEsAE]). We tested the performances of these two models against baseline models, eXtreme Gradient Boosting (XGB) and naïve Bayes, in the prediction of co-existing gastric cancer diagnosis.

Results

The proposed EEsAE model yielded the highest F1-score and highest area under the curve (0.86). The EEsAE and AE + sMLP gave the highest recalls. XGB yielded the highest precision. Ablation study revealed that iron deficiency anemia, gastroesophageal reflux disease, essential hypertension, gastric ulcers, benign prostate hyperplasia, and shoulder lesion were the top 6 most influential diagnoses on performance.

Conclusion

A novel EEsAE model showed promising performance in the prediction of a disease of interest.

Graphical Abstract

Keywords: Deep Learning, Gastric Cancer, Machine Learning, Prediction, Diagnosis Code

INTRODUCTION

Deep learning (DL) is a type of machine learning-based analysis that employs a layered algorithmic architecture. Recent advances in DL have yielded marked success across various domains including, but not limited to, speech recognition, natural language processing, computer vision, and recommendation systems. In the healthcare domain, DL techniques are increasingly applied to medical image processing and analysis, natural language processing of large-scale medical text data, precision medicine, clinical decision support, and predictive analytics.1 DL-based fast computation has dramatically reduced the time for genome analysis facilitating faster drug discovery and development.2 3 The utilization of machine learning tools to predict disease has been increasingly described.4

Autoencoder (AE) is one of the newer DL techniques that uses an artificial neural network to reconstruct its input data as an output while learning the latent encoding of unlabeled input data. AE’s capability in dimensionality reduction, feature extraction, and reconstruction of the data makes it a particularly effective DL technique to learn complex latent representation.5 Leveraging this strength, one application of AE is DL-based recommender systems (RS) where RS can utilize latent features of recommended items (or latent user-item pair representation) which are learned by an AE. RS has gained popularity in commercial domains such as personalized movie recommendations in streaming services.6 7

In healthcare, by analyzing patient-level data, RS can be applied to inferring diagnosis or recommending treatments.8 9 RS can be extended to predictive modeling on a fixed problem such as predicting the relationship between the users and the given item. When built from a large patient dataset, DL-based RS can potentially allow healthcare providers to predict how likely a disease or diagnosis of interest would coexist or occur in the future in a given patient. Such information may prompt healthcare providers and patients to be more vigilant to screen for a disease of interest or to take preventative measures.

However, a challenge might stem from the fact that patient-level data is not uniformly and readily available, as they are highly subject to patient’s utilization of healthcare. For example, if a person rarely utilizes healthcare service, thereby not providing sufficient data input, the use of DL or other machine learning techniques, even if well trained and validated, prediction of disease or condition of interest would not be suitable. Further, even if data is available, complex logistics of accessing and pre-processing of raw healthcare data often make the DL application impractical. Given these challenges, our intention was to experiment the feasibility of prediction of the outcome of interest (disease of interest) using data that are more readily available.

Specifically, we sought to demonstrate how the AE-based RS concept can be applied to predicting a disease of interest. In this study, we tested the hypothesis that gastric cancer (GC) can be predicted by AE-based RS solely based on other documented comorbid conditions available in the National Health Information Database (NHIS). The motivation for GC as a disease of interest was to investigate a disease that is common and relevant to the community from which data originates. GC is the second most prevalent cancer and is the third leading cause of cancer-related death in South Korea.10 South Korea has the highest age-standardized rate per 100,000 of GC in the world. Approximately 244,000 new cases of GC were reported in South Korea in 2018.11 We describe the construction and the performance of a novel AE-based DL model in the prediction of the GC diagnostic code using other diagnostic codes provided by the NHIS.

METHODS

Dataset

We used a medical diagnosis history dataset provided by the Korean NHIS, the government agency in Korea which provides universal healthcare to every South Korean citizen. All medical diagnosis, claims, and bills are submitted to the Korean NHIS, where the data is managed and governed.12 This open data spans over 18 years (from 2002 to 2019) with sample sets differing each year. More recently, the Korean NHIS has begun releasing the dataset to the public yearly.13 This is achieved by random sampling of one million patients and extracting medical diagnosis history data for each patient for a given year. For this study, the latest data from fiscal year 2019 was used. The medical diagnosis history dataset includes demographic profiles and diagnoses based on the International Statistical Classification of Diseases and Related Health Problems (ICD) codes. ICD includes codes for diseases, signs and symptoms, abnormal findings, complaints, social circumstances, and external causes of injury or diseases. ICD codes used in this study were from the Korean Standard Classification of Disease (KCD), the Korean translation of the ICD-10.14 Table 1 summarizes to what each column of the dataset pertains.

Table 1

Medical diagnosis history column from the Korean National Health Information Database

Column name	Description
IDV_ID	Unique patient ID
KEY_SEQ	Unique ID for each diagnosis
SEX	Gender of the subject
AGE_GROUP	Age-group (5 year-window) of the subject
DSBJT_CD	Medical department information
...	...
MAIN_SICK	Disease classification code (main)
SUB_SICK	Disease classification code (other than main)
...	...
RECU_FR_DT	Date of patient’s visit

We used randomly sampled one million patients from fiscal year 2019 dataset. The ICD codes collected in 2019 are based on the KCD7 (KCD 7th edition). KCD7 follows the latest ICD-10 from 2014 and has been used since 2016. KCD7 has 22 chapters, which are divided into 267 categories and 2,093 sub-categories. The KCD, analogous to ICD, includes primary diagnosis (“MAIN_SICK”) and secondary diagnosis (“SUB_SICK”). Both primary and secondary ICD codes were considered in the model. The rationale for using only ICD codes in our model was to test whether the model would be sufficiently robust to prediction using only readily available objective data that are uniformly documented and simple to use. Since the ICD code does not specify the timing of the ‘onset’ of the diagnosis, the proposed model does not consider the temporal sequence of the predictors (other diagnoses from year 2019 ICD codes) and outcome (GC diagnosis from year 2019 ICD code). Therefore, the “prediction” of the model described herein refers to the “identification” of the co-existing GC diagnosis using other ICD codes listed in the database (i.e., without considering temporal sequence of the predictors and the outcome).

DL model

AE

AE is one of the most widely used unsupervised deep neural network models that learns the compression of the input data. It consists of an encoder and a decoder as shown in Fig. 1. The encoder learns how to compress the input data into an encoded representation, while the decoder learns how to reconstruct the original data from the encoded representation. In the simplest form of AE with one hidden layer, the encoder maps the input x to the bottleneck z as shown below:

z = f_θ(x) = s_f(W_ex + b_e) ∈ R^d_z

Fig. 1

Autoencoder architecture.

, where θ represents the parameters of the encoder.

Input x passes through the encoder, which is multiplied by the weight matrix W with the bias b, followed by the activation function s_f(·). The output from the encoder, z, has lower dimensionality than the input, x, and in this sense, can be regarded as a compressed representation of x, or a bottleneck. As shown in Fig. 1, the encoder is paired with the decoder which tries to reconstruct the original input data from compressed data z. The function for the decoder is provided below:

y = g_Φ(z) = s_g(W_dz + b_d)

, where W_d and b_d represent the parameters of the decoder.

The decoder maps the latent representation (z) to y. An AE is optimized by minimizing the error between the input data x and the reconstructed data y. In the DL framework, this can be achieved using the backpropagation by minimizing the error or loss given below:

L(x,y) = ||x − y||²

A single layer AE with a linear activation function is nearly equivalent to Principal Component Analysis (PCA), another popular dimensional reduction method. The non-linear activation function in AE permits flexibility allowing higher performance than PCA. Moreover, AE can be extended in several ways by adding more layers or jointly optimized with other objective functions.

Extension of AE

Deep AE

To achieve additional performance gain, we added more layers both to the encoder and decoder. More hidden layers generally yield better compression with smaller error than the vanilla AE. To avoid over-fitting caused by the deep dense layers, various techniques have been proposed including regularizations and drop-out layers. Fig. 2A shows the structure of the deep AE we used throughout our experiments. The output of the encoder (or the input of the decoder) with dimension 20 is the bottleneck layer in Fig. 1.

Fig. 2

Structure and flow of autoencoder. (A) Deep autoencoder. (B) End-to-End Supervised Autoencoder. The ????represents the batch size.

Supervised AE

Another way of extending the AE is to use AE to learn feature representations, which can be used as an input for the supervised learning model (Fig. 2B).15 16 Multi-Layer Perceptron (MLP) can be used for supervised learning using neural network and have shown promising results in many areas.17 The AE and the MLP are optimized to minimize losses in their own network respectively. The AE typically learns the latent representations without any knowledge of its associated labels in unsupervised learning. In our settings, MLP was used to train a classifier to predict a binary label with latent representations as input (AE + supervised MLP [sMLP]). Another novel way of extending AE in supervised learning is to associate the AE and the classifier and thereby minimizing the custom loss function. The custom loss function should include both the loss function from the classifier and the loss function from the AE. The classifier and AE are updated simultaneously in an end-to-end fashion (End-to-End Supervised AE [EEsAE]). Both supervised AE approaches were used and the results were compared.

Model architecture

Our proposed model is based on a supervised AE framework as shown in Fig. 3. Each patient’s ICD codes are fed into the input of the AE and compared to its output of the decoder. The occurrence of ICDs are represented as 0 or 1 so that each patient u ∈{1,...,U} has a binary vector (or multi-hot vector) of size V, where U and V are the total numbers of patients and ICD codes, respectively. As depicted in Fig. 3, the supervised AE consists of two major components: AE and the classifier connected to the bottleneck layer. AE learns the latent representation from the input data, where the latent space dimension is smaller than the input data dimension, V. Our main objective was to test the performance of the model in predicting GC (KCD C16 series). To perform (true vs. false) prediction on a given disease, we designed a classifier using an MLP with one output neuron for prediction. We designed our classifiers by extending the AE with MLP in two ways. In our first approach, we utilized the AE as a tool for constructing features for the MLP input. Instead of feeding the raw multi-hot vectors to the MLP, we performed dimensional reduction using AE as a feature engineering tool (AE + sMLP). Our second approach combined the AE and the MLP by updating the model simultaneously minimizing reconstruction error and prediction error from end-to-end (EEsAE). In this second approach, we updated our model by minimizing the custom loss function which is a linear sum of two loss functions from the AE and the MLP classifier. We also optimized the model with an extra parameter that controls the balance between the two losses, where the ratio between the AE and the classifier works the best when set to 0.8 and 0.2.

Fig. 3

Supervised autoencoder with two loss functions.

Ethics statement

The present study was Institutional Review Board exempt by fitting both of the exempt criteria of 45 CFR 46.101(b)(4). These exempt criteria are: 1) the research involves the collection or study of existing data, documents, records, pathological specimens or diagnostic specimens; and 2) the data sources are publicly available or the data is recorded by the investigator in an anonymous manner such that subjects cannot be identified directly or through identifiers linked to the subject.

RESULTS

Model application to dataset

For each patient, all diagnoses from MAIN_SICK and SUB _SICK were aggregated into a list for building binary patient-diagnosis matrix similar to the context of user-item matrix in RS. We deleted duplicated disease codes for each patient to make a binary value. For example, if a patient had a GC (C16) ICD code in more than one encounter during the year 2019, the model counted it as a GC (yes). For data pre-processing, we included patients with at least 6 different ICD codes with each code found in at least 50 different patients. After this procedure, the total number of patients became 712,050 with 910 distinct ICD codes. We then constructed a binary user-item (or a patient-code in our setting) matrix, where each row and column represent patient and disease code respectively. The matrix M (i; j) (0,1) encodes the individual diagnosis record (true or false) of patient i for disease code j. In machine learning and artificial intelligence literature, the lower dimensional representation of user-item matrix has been extensively studied for RS, which aims to better predict the unobserved user-item interactions. We extended this framework to predict patient’s unobserved ICD codes in a manner that is similar to RS. We built an ICD code-based predictive model using the AE which is analogous to item-based RS, where we implemented using the open-source Tensorflow2. The dataset used for our experiments were randomly split into training, validation, and test sets with the ratio of (0:8; 0:1; 0:1). In other words, 80% of the patients were used to learn the latent representation of patients and ICD codes. We evaluated two model variants. The first model (AE + sMLP) is a supervised learning model with the input learned from the unsupervised learning model. This model used AE as feature engineering tool for an input of a classifier. The classifier predicts whether or not the patient would be diagnosed with C16. After data selection, C16 ICD codes were found in 0.44% of patients. Given the excessive data imbalance between those with and without C16 (i.e., 0.44% vs. 99.56%), we optimized our MLP-classifier using binary cross entropy with weights. The second model, on the other hand, simultaneously updated the parameters in the AE and the connected MLP classifier during the learning process (EEsAE). This model was tested under the setting similar to the first model to compare the predictive performance. With the 10% validation data, we used various dimensions of hidden layers, where each layer is densely connected. Scaled Exponential Linear Units, an activation function that induces self-normalizing properties, was found to be most effective. The output layer consists of a single neuron which is activated by a sigmoid function. The output value from this neuron can be interpreted as the probability of the occurrence of GC: C16. To avoid over-fitting, we added a dropout layer with 0.5 dropout rate and L1 regularization. In the other 10% test data, we hid the information of C16 during testing and compared our prediction with the ground-truth.

We additionally included two other baseline models the eXtreme Gradient Boosting (XGB) and naïve Bayes to be compared with our proposed model. XGB is a scalable tree boosting algorithm which has been favorably used by many winning teams of machine learning competitions. The naïve Bayes algorithm is one of the traditional machine learning algorithms which is interpretable. In naïve Bayes, one can compute the probability of occurrence of a disease of interest given other diagnosis codes through independence assumption. Table 2 summarizes the 4 included models. We evaluated the performance of the four models using various metrics. When testing, we sub-sampled the patients without C16 in a way the number of positive and negative samples are the same. With the sampled test data, we computed precision (positive predictive value), recall (sensitivity), and F1-score by comparing our prediction to the ground-truth. The F1-score is the harmonic mean of precision and recall and is a measure of a model's accuracy on a dataset. We also computed the receiver operating characteristic (ROC) curve area. The area under the curve (AUC) was generated by various thresholds for a classifier and the true-positive rate and false-positive rate. Finally, to better understand the effect of each disease code in the model, an ablation study was performed. Each disease code was removed one at a time from the input neuron of the AE. Based on the degree of impact on the performance measure, we derived the 6 top most influential ICD codes.

Table 2

Summary of 4 included models

Models	Description
AE + sMLP	Feature engineering with Autoencoder + supervised Multi-Layer Perceptron
EEsAE	Supervised Autoencoder with End-to-End Learning
XGB	Regularizing gradient boosting framework
Naïve Bayes	Probabilistic classifier based on Bayes’ theorem

Model performance

The performance of each model in the testing dataset (N = 1,000,000) is shown in Table 3. The proposed EEsAE model yielded the highest F1-score. AE + sMLP and EEsAE were the best two models with the highest recalls. XGB yielded the highest precision. The AUC of the AE + sMLP concurrent model was 0.862, closely followed by XGB. Fig. 4 shows ROC and AUC, respectively. This demonstrates how our proposed EEsAE outperforms other baseline models in every region. Ablation study results showed the top 6 codes affecting the performance the most (Table 4) including iron deficiency anemia (IDA), gastroesophageal reflux disease (GERD), essential hypertension, gastric ulcers, benign prostate hyperplasia, and shoulder lesion by order of impact on the model performance.

Table 3

Performance of tested models

Models	ROC-AUC	Recall	Precision	F1-score
[EEsAE]	0.862	0.817	0.739	0.776
AE + sMLP	0.787	0.790	0.705	0.775
XGB	0.840	0.602	0.812	0.692
Naïve Bayes	0.734	0.714	0.744	0.729

ROC = receiver operating characteristic, AUC = area under the curve, EEsAE = End-to-End Supervised Autoencoder, AE = autoencoder, sMLP = supervised Multi-Layer Perceptron, XGB = eXtreme Gradient Boosting.

The bold number in each performance metric indicates the best one among all the models.

Fig. 4

ROC-AUC comparison.

ROC = receiver operating characteristic, AUC = area under the curve, XGB = eXtreme Gradient Boosting.

Table 4

ICD codes that most influence gastric cancer prediction

ICD code	Disease	Importance
D50	Iron deficiency anemia	1st
K21	Gastroesophageal reflux	2nd
I10	Essential hypertension	3rd
K25	Gastric ulcer	4th
N40	Hyperplasia of prostate	5th
M75	Shoulder lesions	6th

ICD = International Statistical Classification of Diseases and Related Health Problems.

DISCUSSION

We demonstrated the construction and the performance of a novel supervised AE model in combination with MLP classifier in the prediction of disease of interest. The proposed model resulted in superior performances for the prediction of GC compared to other baseline models. Using documented ICD codes as the only input, the model yielded a recall of up to 0.82 in its identification of GC diagnosis. To the best of our knowledge, this is the first application of AE in the prediction of medical diagnosis based on diagnosis code inputs only.

Diagnostic codes are widely used in healthcare systems for the purpose of reimbursement, quality evaluation, public health reporting, and outcome research. Thus, a set of ICD codes typically represents a clinician’s culminating impression that results from various combinations of patient interviews, examination, and objective tests, such as laboratory or radiological studies. Despite the rich information that each ICD code may contain, the complex construct of how each ICD code is linked to other ICD codes requires DL process.

We posited that our proposed model would learn the complex construct and extract salient information sufficiently robust to predict the disease of interest. In doing so, we apply a RS concept of predicting relations of items and users to the problem of predicting the occurrence of disease (item) among patients (users). Herein, the medical history (neighboring ICD codes) represents the input for the RS and the disease of interest represents the fixed problem. The two AE models demonstrated herein are in line with a classic DL technique in which a model is fed with raw data and develops its own representations consisting of multiple layers of representations. Traditionally AE takes the form of unsupervised deep neural network model that learns the compression of the input data. We modified this by extending the AE in a supervised learning environment and combining with sMLP feature. Herein, the model takes advantage of each feature that learns to optimize the network. This enables better capturing of the complex latent representation of the raw data by lower-dimensional representations. We further refined the model by updating the classified and the AE simultaneously in an end-to-end fashion aiming to minimize custom loss function from both the AE and MLP.

The superior performance of our proposed model in recall over baseline models points to the effectiveness of lower-dimensional representations. The AE mapped similar ICD codes into the low dimensional latent space where they are close to each other, and let the classifier discover patterns for predicting the target disease of interest, GC in this study. We also found when the AE and MLP are updated simultaneously as in our proposed EEsAE, the latent representation was even better learned and led to superior performances. The demonstration of F1 score of 0.78, recall of 0.82, and AUC of 0.86 epitomizes the potential of its application in similar predictions. Recall function (sensitivity) is a more relevant metric in prediction of a disease of interest, particularly where screening is important. This implies that about 4 out of 5 people with GC would be detected by the model using coexisting ICD codes alone. One can speculate that such a predictive model can help clinicians raise a high index of suspicion for a certain disease of interest, which is in this case, GC. For example, the model can prompt a clinician to perform more proactive screening of GC. Alternatively, the model may lead to more proactive prevention efforts.

Currently, screening for GC is controversial, and recommendations for screening is dependent on the incidence of GC. In a country where the incidence of GC is excessively high, such as Korea, universal population-based screening is being implemented (e.g., upper endoscopy every 2 years) for individuals aged 40 to 75 years. However, in many other countries where the incidence of GC is low, selective screening of high risk subgroups is advocated. Such high risk groups include those with gastric adenomas, pernicious anemia or familiar adenomatous polyposis, some of which require endoscopy. In that sense, the findings of our study provide a glimpse into how DL, specifically an AE model, might help identify potentially high risk group of patients of GC, who may benefit from more proactive screening with endoscopy especially where the incidence of GC is low.

Prior works have indeed used AE for medical diagnosis.18 19 However, the majority of these prior studies utilize AE analysis of medical imaging rather than ICD code. Our model, particularly the use of the EEsAE model, and its application to diagnosis just using ICD code is a first to our knowledge. In the testing phase in the evaluation of our model, all the occurrences and non-occurrences of GC code have been hidden. The disease codes have been aggregated by each patient ID without the information of temporal sequence in occurrences. Therefore, such an inference is still considered prediction. The prediction model is still meaningful in two ways. First, our goal was to better understand the similarity across different diagnostic codes, not the causal relationship in reference to the GC diagnostic code. Given the major clinical implication of the diagnosis like GC, and the challenge in early diagnosis, an AE model that “recommends” GC in this setting can be valuable for clinicians. Second, despite not being able to consider the temporal sequence due to the nature of the data, it actually reflects real world clinical practice where diagnosis is often delayed or even missed. Indeed, the timing of the diagnosis (i.e., first appearance of diagnostic codes) does not equate to the timing of the true onset of the condition.

Despite the potential implication, the findings of the study should be interpreted in the context of how the data was pre-processed. Since a model construction involved only people with at least 6 ICD codes with each code found in at least more than 50 people in the dataset were included, model performance is expected to be poor in people with the scarce number of ICD codes or with relatively rare ICD codes. Moreover, we validated our data in the subcohort from the same year, but have not tested the performance in data from a different year due to the limited availability of the data. Future studies should examine whether a similar level of prediction can be achieved when the model is applied to dataset from another year and for another disease of interest from the data of the same year, as well as different years.

Another limitation of the dataset used in this study is inherent to the limitation of the administrative data. The validity of ICD code varies across different conditions and healthcare systems and depends on how data is collected.20 ICD based diagnosis information is not comprehensive. Moreover, an ICD code does not necessarily reveal the onset of the condition. Code assignment is a demanding process with many potential sources of errors.21 Moreover, ICD code assignment practice is variable among clinicians.

It is equally important to understand the inherent challenge in the interpretation of the results beyond the prediction. While some of the top influencing ICD codes identified in the ablation study appear to have associations with GC, the exact relationship is difficult to interpret. For example, the ablation study revealed IDA as the top influencer in the model meaning that the largest performance reduction was noted when this particular ICD code was excluded. While it seems plausible that there might be a link between the IDA and GC possibly via loss of blood in gastrointestinal track in the setting of gastric pathology, such a link has rarely been documented. One recent study revealed the high prevalence of IDA up to 40% at the time of GC diagnosis.22 Among those who received chemotherapy for GC about 20% developed IDA along their treatment course.23 Prior studies in Korea have simply reported the prevalence of IDA in patients with GC before and after gastrectomy.24 Despite the potential link, specifically how this particular ICD code stood out as the top influencing variable outperforming other variables is unclear. GERD is most closely related to esophageal adenocarcinoma and to a lesser extent to GC arising from the cardia of the stomach.25 Since only the main category of GC, C16, was used without consideration of subcategory information, such as area of the cancer, it is impossible to further speculate how GERD is linked to GC. Unlike other predictors, an association between gastric ulcer and GC can be more readily inferred through common risk factors (i.e., mainly Helicobacter pylori infection).26 Essential hypertension, benign prostate hyperplasia, and shoulder lesions are rather unanticipated diagnoses with little known association with GC. It is very possible that these entities are somehow linked through their associations with common risk factors, confounders, or the consequences of GC treatment. It is important to note that these factors would be more pertinent to the population where the model was trained. Nonetheless, elucidation of influencing factors can potentially enable researchers to explore associations of unanticipated factors with the disease of interest.

This study represents a benchmarking framework for future studies in which additional patient-level characteristics can be added to further enhance the model performance. For example, while this study only included ICD codes in the model, various data including demographics, medical history beyond the diagnosis, and test results can be included to construct even higher performing models.

In conclusion, we describe a novel supervised AE in its application for the prediction of disease of interest. We showed that neighboring ICD code information alone predicted the co-existence of GC with high accuracy. Specifically, the proposed EEsAE model and AE-sMLP outperformed XGB and Naïve Bayes models. While the utility of the AE methods in the prediction of disease of interest (vs. identification of coexisting disease of interest) needs to be evaluated in a prospective study design, this high performance of the proposed model encourages us to further explore its utility and performance in other healthcare domains including future disease prediction.

Notes

Funding: Yoon-Sik Cho was partly supported by the Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korean Government (MSIT) (No.2021-0-01341, Artificial Intelligence Graduate School Program of Chung-Ang University). Yoon-Sik Cho was also partly supported by the National Research Foundation of Korea (NRF) Grant funded by the Korean Government (MSIT) (No. 2021R1F1A1063389). Younghoon Kwon was supported by NIH R21HL150502, R21AG070576, and R01HL158765.

Disclosure: The authors have no conflicts of interest to disclose.

Data Availability Statement: Individual participant data that underlie the results reported in this article, after deidentification, will be made available on reasonable request to achieve aims in the proposal if approved. This will be granted to researchers who provide a methodologically sound proposal. Proposals should be directed to the corresponding author.

Author Contributions:

Conceptualization: Cho YS, Kim E, Kwon Y.
Data curation: Cho YS.
Formal analysis: Cho YS, Kim E, Oh MH.
Funding acquisition: Cho YS.
Investigation: Cho YS, Kim E, Oh MH, Kwon Y, Stafford PL.
Methodology: Cho YS, Kwon Y, Stafford PL.
Project administration: Cho YS.
Resources: Cho YS.
Software: Cho YS.
Supervision: Cho YS.
Validation: Cho YS, Kim E, Oh MH.
Visualization: Cho YS.
Writing - original draft: Cho YS, Kim E, Oh MH, Kwon Y, Stafford PL.
Writing - review & editing: Cho YS, Kwon Y, Stafford PL.

References

1. Esteva A, Robicquet A, Ramsundar B, Kuleshov V, DePristo M, Chou K, et al. A guide to deep learning in healthcare. Nat Med. 2019; 25(1):24–29. PMID: 30617335.

2. Trabelsi A, Chaabane M, Ben-Hur A. Comprehensive evaluation of deep learning architectures for prediction of DNA/RNA sequence binding specificities. Bioinformatics. 2019; 35(14):i269–i277. PMID: 31510640.

3. Montesinos-López OA, Montesinos-López A, Pérez-Rodríguez P, Barrón-López JA, Martini JW, Fajardo-Flores SB, et al. A review of deep learning applications for genomic selection. BMC Genomics. 2021; 22(1):19. PMID: 33407114.

4. Uddin S, Khan A, Hossain ME, Moni MA. Comparing different supervised machine learning algorithms for disease prediction. BMC Med Inform Decis Mak. 2019; 19(1):281. PMID: 31864346.

5. Singhal A, Sinha P, Pant R. Use of deep learning in modern recommendation system: a summary of recent works. Int J Comput Appl. 2017; 108(7):17–22.

6. Wang Z, Yu X, Feng N, Wang Z. An improved collaborative movie recommendation system using computational intelligence. J Vis Lang Comput. 2014; 25(6):667–675.

7. Davidson J, Liebald B, Liu J, Nandy P, Van Vleet T, Gargi U, et al. The YouTube video recommendation system. In : Proceedings of the Fourth ACM Conference on Recommender; 2010 Sep 26–30; Barcelona, Spain. New York, NY, USA: ACM Press;2010. p. 293–296.

8. Chen RC, Huang YH, Bau CT, Chen SM. A recommendation system based on domain ontology and SWRL for anti-diabetic drugs selection. Expert Syst Appl. 2012; 39(4):3995–4006.

9. Doulaverakis C, Nikolaidis G, Kleontas A, Kompatsiaris I. Panacea, a semantic-enabled drug recommendations discovery framework. J Biomed Semantics. 2014; 5(1):13. PMID: 24602515.

10. Shin A, Kim J, Park S. Gastric cancer epidemiology in Korea. J Gastric Cancer. 2011; 11(3):135–140. PMID: 22076217.

11. Kweon SS. Updates on cancer epidemiology in Korea, 2018. Chonnam Med J. 2018; 54(2):90–100. PMID: 29854674.

12. Shin DW, Cho B, Guallar E.. Korean National Health Insurance Database. JAMA Intern Med. 2016; 176(1):138.

13. Lee J, Lee JS, Park SH, Shin SA, Kim K. Cohort profile: the National Health Insurance Service-National Sample Cohort (NHIS-NSC), South Korea. Int J Epidemiol. 2017; 46(2):e15. PMID: 26822938.

14. Lee YS, Lee YR, Chae Y, Park SY, Oh IH, Jang BH. Translation of Korean medicine use to ICD-codes using National Health Insurance Service-National Sample Cohort. Evid Based Complement Alternat Med. 2016; 2016:8160838. PMID: 27069494.

15. Simidjievski N, Bodnar C, Tariq I, Scherer P, Andres Terre H, Shams Z, et al. Variational autoencoders for cancer data integration: design principles and computational practice. Front Genet. 2019; 10:1205. PMID: 31921281.

16. Miotto R, Li L, Kidd BA, Dudley JT. Deep patient: an unsupervised representation to predict the future of patients from the electronic health records. Sci Rep. 2016; 6(1):26094. PMID: 27185194.

17. Azimi-Sadjadi MR, Citrin S, Sheedvash S. Supervised learning process of multi-layer perceptron neural networks using fast least squares. Proc IEEE Int Conf Acoust Speech Signal Process. 1990; 3:1381–1384.

18. Chen R, Song Y, Huang J, Wang J, Sun H, Wang H. Rapid diagnosis and continuous monitoring of intracerebral hemorrhage with magnetic induction tomography based on stacked autoencoder. Rev Sci Instrum. 2021; 92(8):084707. PMID: 34470442.

19. Li D, Fu Z, Xu J. Stacked-autoencoder-based model for COVID-19 diagnosis on CT images. Appl Intell. 2021; 51(5):2805–2817.

20. Quan H, Li B, Saunders LD, Parsons GA, Nilsson CI, Alibhai A, et al. Assessing validity of ICD-9-CM and ICD-10 administrative data in recording clinical conditions in a unique dually coded database. Health Serv Res. 2008; 43(4):1424–1441. PMID: 18756617.

21. O’Malley KJ, Cook KF, Price MD, Wildes KR, Hurdle JF, Ashton CM. Measuring diagnoses: ICD code accuracy. Health Serv Res. 2005; 40(5 Pt 2):1620–1639. PMID: 16178999.

22. Tang GH, Hart R, Sholzberg M, Brezden-Masley C. Iron deficiency anemia in gastric cancer: a Canadian retrospective review. Eur J Gastroenterol Hepatol. 2018; 30(12):1497–1501. PMID: 30179903.

23. Jeong O, Park YK, Ryu SY. Prevalence, severity, and evolution of postsurgical anemia after gastrectomy, and clinicopathological factors affecting its recovery. J Korean Surg Soc. 2012; 82(2):79–86. PMID: 22347709.

24. Jung MJ, Kim HI, Cho HW, Yoon HY, Kim CB. Pre- and post-gastrectomy anemia in gastric cancer patients. Korean J Clin Oncol. 2011; 7(2):88–95.

25. Kim JJ. Upper gastrointestinal cancer and reflux disease. J Gastric Cancer. 2013; 13(2):79–85. PMID: 23844321.

26. Hansson LE. Risk of stomach cancer in patients with peptic ulcer disease. World J Surg. 2000; 24(3):315–320. PMID: 10658066.

TOOLS

Similar articles