Introduction
Recently, the increasing value of big data on cancer, driven by advancements in information technology, has increased its demand in cancer research [
1]. However, the various health and medical information being accumulated includes numerous sensitive individual health information, leading to many privacy protection restrictions on the use of healthcare big data. The Personal Information Protection Act was revised to promote data usage in 2020. Under the revised act, pseudonymized data that cannot identify individuals can be used for statistics, scientific research, and public records without individual consent. Furthermore, an amendment to the Cancer Control Act was implemented in 2021 to reinforce cancer data collection and sharing, with tasks delegated to the National Cancer Data Center (NCDC). The National Cancer Center was designated in the same year as the NCDC.
The Korean Ministry of Health and Welfare initiated the Korean Clinical Data Utilization Network for Research Excellence (K-CURE) project in 2022 based on the Personal Information Protection and Cancer Control Acts. This project aims to establish an ecosystem for combining and utilizing clinical and public cancer data. The Cancer Public Library Database (CPLD), established under the K-CURE project, combines data from four major population-based public sources: the Korea National Cancer Incidence Database (KNCI DB) in the Korea Central Cancer Registry (KCCR), cause-of-death data in Statistics Korea, National Health Information Database (NHID) in the National Health Insurance Service (NHIS), and National Health Insurance Research Database (NHIRD) in the Health Insurance Review & Assessment Service (HIRA).
This study aimed to offer a comprehensive profile of CPLD data, highlighting its representation of the entire patient population with cancer in Korea. We presented descriptive statistics detailing the number of patients included in the CPLD, their demographics, medical usage, and mortality. Furthermore, this study emphasized the potential CPLD value in cancer research by presenting its available data.
Results
Table 2 presents the number of patients with cancer based on their sociodemographic characteristics and diagnosis year. Of the 1,983,488 patients, the majority were in their 60s (23%), followed by the 70-79 age group and 50-59 age group. Individuals in the 8-10 decile group were the most prevalent decile group of health insurance premiums at cancer diagnosis. The distribution of patients with cancer based on the SEER summary stage was as follows: 40.9% had localized cancer, 27.1% belonged to the regional group, 16.1% belonged to the distant group, and 15.8% were categorized as unknown.
Fig. 2 shows the top five cancers by sex from 2012 to 2019. Among the 996,209 men, stomach, lung, colorectal, prostate, and liver cancer were the top five cancers, accounting for 16.1%, 14.0%, 13.3%, 9.6%, and 9.3% of all cancer cases diagnosed, respectively. The proportion of lung and prostate cancers in men steadily increased from 2012 to 2019, while stomach, colorectal, and liver cancer decreased. The most common cancers in women were thyroid (20.4%), breast (16.6%), colorectal (9.0%), stomach (7.8%), and lung (6.2%) cancers. The proportion of breast and lung cancers in women has steadily increased from 2012 to 2019, while gastric cancer has steadily decreased. Thyroid cancer accounted for about 30% of cancer cases in 2012, but it has gradually decreased since then, accounting for about 16.9% of all cancer cases in 2019. The number of incident cancer cases from 2012 to 2019 by cancer type in men and women is available in
S1 Table.
Among these patients with cancer, 571,285 died between 2012 and 2020, with 89.2% of the deaths attributed to cancer and 10.8% to other causes (
Table 3). Lung cancer caused the highest number of deaths in both sexes, with 91,437 deaths in men and 29,707 in women. Liver (14.4%), stomach (9.6%), colorectal (8.3%), and pancreatic (6.1%) cancers had the highest number of deaths among the men after lung cancer. Colorectal (11.3%), pancreatic (9.5%), stomach (9.1%), and liver (8.9%) cancers caused the most deaths among the women.
Table 4 presents the medical service utilization patterns during the 1 year before and after cancer diagnosis, as well as during the 1 year before death. Regarding medical services, 93% of the patients with cancer had outpatient claims and 43% had inpatient hospitalization claims during 1-year before cancer diagnosis. Almost all patients with cancer (92%) had at least one outpatient claim, and the majority (89%) had at least one inpatient claim during the year after diagnosis. Furthermore, of the 571,285 patients who died between 2012 and 2020, 98% had outpatient and inpatient hospitalization claims in the 1 year before death. The average number of outpatient visits and inpatient hospitalizations per patient was higher during the 1 year after diagnosis than the 1 year before. The frequency of inpatient hospitalization claims increased from 1.9 to 4.5. Medical care use increased during the last year of life, with an average of 38.7 outpatient visits and 7.8 inpatient hospitalizations. Furthermore, 41% and 34% of patients used dental and oriental medicines, respectively, in the 1-year before cancer diagnosis. However, fewer patients with cancer used dental and oriental medicines in the 1 year after diagnosis and before their death.
Discussion
The CPLD has several strengths. The CPLD encompasses 96.7% of all cancer incidence cases, as published in the annual report of cancer statistics of KCCR [
9], ensuring a comprehensive representation of the population. This is advantageous because previous studies using NHIS claims data faced challenges in accurately defining patients with cancer using disease and procedure codes, which led to the underestimation or overestimation of cancer incidence or prevalence [
7,
10,
11]. Consequently, the CPLD is a valuable resource for overcoming the limitations of defining cancer diagnoses in research.
The key features include patient demographics (including age and sex), detailed clinical cancer characteristics (including diagnosis date, site, histology, and summary stage), extensive healthcare service utilization, and cost information. These features facilitate the identification and comparison of cancer treatments and outcomes among the included populations. Moreover, the longitudinal nature of the CPLD, covering before and after cancer diagnosis periods, facilitates the calculation of time-dependent measures such as comorbidity indices, a comprehensive analysis of various treatments (including surgery, radiation, chemotherapy, immunotherapy, and other treatments), and outcomes (including time to subsequent events or death). Additionally, these longitudinal data offer valuable insights into the long-term outcomes of cancer survivors.
The CPLD is similar to the SEER-Medicare database in the United States, which combines SEER cancer registry data with Medicare enrollment and claims data [
12]. The SEER-Medicare database offers advantages, including a substantial number of cancer cases, detailed tumor characteristics, population-based data sources, longitudinal Medicare data, an extensive range of covered services, and biennial linkage updates [
12]. Additionally, the SEER-Medicare linkage encompasses non-cancer control groups and incorporates ancillary linkage data sources, such as the Medicare Health Outcome survey and the Medicare Consumer Assessment of Healthcare Providers and Systems survey. However, findings from the SEER-Medicare analyses may not be generalizable to younger populations owing to its focus on linking with Medicare data, primarily including individuals aged 65 years and older [
12].
The CPLD has some limitations. First, a time lag of 2-3 years exists between the generation of individual data and their availability for research. The CPLD released in 2023 included patients with cancer through 2019, cases of death through 2020, and claims through 2021. This time lag is primarily driven by the KNCI DB, which is necessary for the completeness of the cancer registration [
3]. Therefore, researchers should be cautious when designing studies using the CPLD, considering its unique characteristics.
Second, claims data from the NHID and NHIRD do not encompass all health-related information. For example, clinically observed information, which may be present in medical records, is excluded from the CPLD. Furthermore, services such as cosmetic surgical procedures or over-thecounter drugs not covered by the NHIS are absent in the CPLD because claims data are generated to reimburse healthcare services covered by the NHIS. The CPLD includes medical procedure codes to indicate that specific tests are conducted; however, the CPLD lacks information on test results (such as imaging test results, biomarker data, and laboratory values). Additionally, certain health conditions, such as mental illness, suicide, sexually transmitted diseases, and miscarriage, are not available because of privacy concerns. Therefore, researchers should consider these constraints when selecting study topics.
Third, researchers should understand CPLD structure and characteristics. Claims data in the CPLD comprises diverse file types, each with one-to-many linkage relationships. Furthermore, the CPLD contains left- and right-truncated data. Therefore, caution should be exercised when interpreting trends in cancer incidence, prevalence, and mortality rates. Additionally, specialized knowledge of NHIS billing and coding is essential for properly manipulating and interpreting data.
Finally, information related to diagnoses and diseases, excluding cancer, may not accurately reflect disease occurrence and prevalence because it primarily comes from the claims data used for reimbursement [
8]. Moreover, administrative claims data alone do not provide insight into the decision-making process for cancer care and other patientreported outcomes. These limitations are not exclusive to the CPLD, but are common in databases relying on claims data, which are primarily gathered for administrative rather than research purposes.
In conclusion, the CPLD provides a unique resource for various cancer research, enabling the investigation of medical usage patterns before a cancer diagnosis, during the period of initial diagnosis and treatment, and long-term follow-up. This facilitates expanded insights into healthcare delivery across the cancer continuum, from screening to endof-life care. Partners from the NCDC, Statistics Korea, KCCR, NHIS, and HIRA ensure the continual enhancement and maintenance of the CPLD. The CPLD plans to add data on newly diagnosed cancer patients and update data on existing cancer patients annually. Furthermore, there are plans to expand the range of public agency data based on researchers’ needs, which includes the coronavirus disease 2019 DB of the Korea Disease Control and Prevention Agency. Finally, with continuous cooperation and efforts, the CPLD can contribute to the development of future insights into cancer research in South Korea.