Journal List > Ann Lab Med > v.44(6) > 1516088438

Shin and Kim: Customized Quality Assessment of Healthcare Data
Data are a valuable resource in industrial societies and are becoming increasingly important in the healthcare field. The relevance of big data analysis in the healthcare industry is growing because of paradigm shifts in healthcare services, increasing social needs, and technological advancements. Unlike typical big data characterized by the “5Vs” (volume, variety, velocity, value, and veracity), healthcare data are characterized by heterogeneity, incompleteness, timeliness, longevity, privacy, and ownership [1]. In particular, healthcare data comprise heterogeneous (multimodal) data types, including structured data as well as unstructured data, such as computed tomography scans, magnetic resonance images, and X-ray images [1]. Additionally, numerous challenges exist in terms of data collection in the healthcare domain, including issues related to privacy protection and ownership, and in terms of healthcare data management, including data storage, sharing, and security [1]. Given the inherent complexity of healthcare data, which are not collected for research or industrial purposes, the lack of systematic data collection methods makes it difficult to ensure data quality.
The definition of “data quality” varies, but the most widely accepted definition is “fitness for use” [2], which refers to data that are current, accurate, interconnected, and provide tangible value to users. Low-quality data can lead to operational losses, delays, and increased costs associated with data cleansing, potentially resulting in higher prices for products or services in data-related industries [3]. Additionally, the quality of data can significantly impact the performance of artificial intelligence (AI) models [4]. Particularly in healthcare, the use of low-quality data in areas such as treatment, surgery, research, and policy decisions can result in significant losses. In the United States, the use of big data in healthcare may save approximately $300 billion in economic costs annually [5]. In this context, low-quality healthcare data pose a threat to public life and health and can lead to inefficiencies in the healthcare system. Therefore, high-quality healthcare data are essential for ongoing healthcare data research.
Various data management quality methods are being researched and reported. In 2003, the WHO published guidelines to enhance the quality of healthcare data [6]. These guidelines provide directives for healthcare professionals and information managers, enhancing understanding of all aspects of data collection and creation. In Korea, organizations such as the National Information Society Agency (NIA) [7] have established guidelines and directives for data quality management, and the Korea Data Industry Promotion Agency (KData) (http://dataqual.co.kr/bbs/content.php?co_id=quality_concept) has been operating a data quality certification system since 2006. This highlights the national and social recognition of the need for systematic management to consistently maintain and improve data quality from the user’s perspective.
The NIA is building and releasing AI learning data. After quality assessment by the Telecommunications Technology Association (TTA)—an external organization that does not build the data—according to the guidelines and directives for data quality management, the data are released via AI-Hub (https://aihub.or.kr). In big data research using laboratory data, external quality assessment results are used to evaluate data quality or accuracy [8, 9]. However, current national and international research on healthcare data quality lacks clearly defined standards or criteria. Specifically, standards and quality management criteria reflecting the unique characteristics of healthcare data, such as imaging information and biometric signals, remain insufficient. This gap highlights the need to develop quality management measures and ensure the production and utilization of high-quality healthcare data.
Research on quality indicators and systems that align with the specific nature of healthcare data are limited [10, 11]. Furthermore, healthcare data are not only voluminous but also varied in type, form, and attributes [12]. Therefore, a quality management system suited to the characteristics of healthcare data is necessary for effective management and utilization. We propose three stages of quality management direction (Fig. 1), as follows.
First, customized quality assessment indicators for healthcare data must be established. Numerous studies on quality indicators for measuring data quality have been conducted [11]. However, the terminology and definitions of quality assessment indicators lack consistency, leading to confusion and trial-and-error because of the adoption of different quality elements and measurement items. Hwang, et al. [12] searched the literature from 1990 to 2023 using the search terms “data quality,” “data quality assessment,” and “data quality dimensions” in Google Scholar to examine the diversity of quality indicators. Their search yielded 23 publications with 80 different data quality assessment indicators with various definitions. This implies that the data quality assessment indicators used in research use the same term with different definitions or different terms with the same definition (Table 1). Therefore, a consensus on the definition of data quality assessment indicators is needed to quantify data quality. In particular, quality assessment indicator research that reflects the characteristics of healthcare data is needed. In other words, to ensure the validity and reliability of healthcare data quality assessment, it is first necessary to clearly conceptualize the definitions of terms used in measuring quality. Therefore, further studies are required to redefine customized quality assessment criteria for healthcare data.
Second, a method for quantifying customized quality indicators for healthcare data and benchmarks for assigning quality evaluation scores must be established. For example, Lee and Shin [13] reported that labeling accuracy, one of the quality indicators, can be set at different levels depending on the similarity among the characteristic variables of the data and the level of class imbalance. This implies that the level of quality indicators for healthcare data quality assessment should be determined based on characteristics such as data similarity, sample size, and imbalance. Data quality indicators are evaluated both quantitatively and qualitatively. Quantitative evaluation is based on factors such as data completeness, validity, consistency, and accuracy, and quality levels are defined in four classes (ace, high, middle, low) using the Six Sigma concept [7]. In qualitative evaluation, evaluators subjectively judge criteria through a checklist for each indicator by answering yes-or-no questions [7].
Additionally, the measurement of quality indicators varies depending on the data type (structured/unstructured). As healthcare data are multimodal, it is necessary to review the appropriateness of the criteria for quantifying quality indicators considering the data characteristics. Therefore, further studies are required to establish customized quality assessment criteria for healthcare data.
Third, a unified quality score presentation should be established to facilitate an intuitive understanding of data quality by the end user. Moges, et al. [14] analyzed the relative importance of quality indicators by surveying private companies in various countries, which revealed that the importance of quality items varies by industry. The relative importance of quality indicators for healthcare data can be analyzed to establish a unified quality score. Hence, studies on calculating a unified score through the quantification of quality assessments are needed.
Data quality assessment requires a multidisciplinary approach to reflect the demand for sustainable, high-quality data across various domains. Therefore, research on healthcare data quality should not be limited to specific domain knowledge areas but can be conducted as a joint study by experts with knowledge in various fields such as medicine, data science, statistics, and information technology. A creative approach to data quality necessitates effective interdisciplinary collaboration among experts from various fields [15].
In conclusion, we emphasize the necessity of high-quality data and present a direction for establishing quality indicators and systems suitable for the characteristics of healthcare data. Developing a clear and consistent definition of data quality and systematic methods and approaches for data quality assessment require more extensive research.

ACKNOWLEDGEMENTS

None.

Notes

AUTHOR CONTRIBUTIONS

Shin J and Kim JY contributed to the study conceptualization, methodology, investigation, visualization, and project administration; Kim JY acquired funding and supervised the study; Shin J wrote the original draft; and Kim JY reviewed and edited the paper. Both authors have read and approved the final manuscript.

CONFLICTS OF INTEREST

None declared.

RESEARCH FUNDING

This study was conducted as part of the National Balanced Development Special Account K-Health National Medical AI Service and Industrial Ecosystem Construction Project funded by the Ministry of Science and ICT and the Korea Information and Communications Promotion Agency (grant No. H0503-24-1001).

References

1. Hong L, Luo M, Wang R, Lu P, Lu W, Lu L. 2018; Big data in health care: applications and challenges. 2:175–97. DOI: 10.2478/dim-2018-0014.
2. Nikiforova A. 2020; Definition and evaluation of data quality: user-oriented data object-driven approach to data quality assessment. Balt J Mod Comput. 3:391–432. DOI: 10.22364/bjmc.2020.8.3.02.
3. Dasu T, Johnson T. Shewart WA, Wilks SS, editors. 2003. Data quality: techniques and algorithms. Exploratory data mining and data cleaning. John Wiley & Sons;New York: p. 139–88. DOI: 10.1002/0471448354.ch5.
4. Bernardi FA, Alves D, Crepaldi N, Yamada DB, Lima VC, Rijo R. 2023; Data quality in health research: integrative literature review. J Med Internet Res. 25:e41446. DOI: 10.2196/41446. PMID: 37906223. PMCID: PMC10646672. PMID: 559cc8a2d8ac424e8bbb7575a56326fc.
5. Kim J, Kim H, Son K, Song Y, Yoon J, Lim H, et al. 2014; Medical utilization of big data. Inf Sci Manag. 32:18–26. DOI: 10.3139/9783446441774.002.
6. WHO. 2003. Improving data quality: a guide for developing countries. https://iris.who.int/handle/10665/206974.
7. National Information Society Agency (NIA). 2024. Big data platform and center data quality management guideline v3.1. https://aihub.or.kr/aihubnews/qlityguidance/view.do?pageIndex=1&nttSn=10269&currMenu=&topMenu=&searchCondition=&searchKeyword=.
8. Cho EJ, Jeong TD, Kim S, Park HD, Yun YM, Chun S, et al. 2023; A new strategy for evaluating the quality of laboratory results for big data research: using external quality assessment survey data (2010-2020). Ann Lab Med. 43:425–33. DOI: 10.3343/alm.2023.43.5.425. PMID: 37080743. PMCID: PMC10151270.
9. Kim S, Cho EJ, Jeong TD, Park HD, Yun YM, Lee K, et al. 2023; Proposed model for evaluating real-world laboratory results for big data research. Ann Lab Med. 43:104–7. DOI: 10.3343/alm.2023.43.1.104. PMID: 36045065. PMCID: PMC9467825.
10. Bae SH, Lim IH. 2012; A study on 3G networked pulse measurement system using optical sensor. J Kor Inst Electron Commun Sci. 7:1555–60.
11. Hinrichs H. 2002. Datenqualitätsmanagement in data warehouse-systemen. Doctoral dissertation. Universität Oldenburg;DOI: 10.1007/978-3-642-56687-5_15.
12. Hwang P, Lee W, Ryu K, Jung W, Shim S, Kim JY, et al. Research for Data Quality Dimensions of Medical Data. In : 2023 Fall Academic Conference of the Korean Society of Medical Informatics; Nov 30, 2023; Gyeonggi-do, Korea. https://www.kosmi.org/bbs/download.php?bo_table=sub4_2&wr_id=86&no=4. Updated on Feb 2024. DOI: 10.1007/3-540-33173-5_2.
13. Lee JH, Shin J. 2024; AI performance based on learning-data labeling accuracy. J Ind Converg. 22:177–83. DOI: 10.22678/JIC.2024.22.1.177.
14. Moges HT, Dejaeger K, Lemahieu W, Baesens B. 2012; A multidimensional analysis of data quality for credit risk management: new insights and challenges. Inf Manag. 50:43–58. DOI: 10.1016/j.im.2012.10.001.
15. Keller S, Korkmaz G, Orr M, Schroeder A, Shipp S. 2017; The evolution of data quality: understanding the transdisciplinary origins of data quality concepts and approaches. Annu Rev Stat Appl. 4:85–108. DOI: 10.1146/annurev-statistics-060116-054114.

Fig. 1

Three stages of quality management direction for customized healthcare data.

alm-44-6-472-f1.tif
Table 1

Data quality evaluation indicators collected by Hwang, et al. [12] via a literature search

Group* Indicators Definitions
1 Coherence The extent to which data are consistent over time and across providers
Compliance The extent to which data adhere to standards or regulations
Conformity The extent to which data are presented following a standard format
Consistency The extent to which data are presented following the same rule, format, and/or structure
Directionality The extent to which data is consistently represented in the graph
Identifiability The extent to which data have an identifier, such as a primary key
Integrability The extent to which data follow the same definitions so that they can be integrated
Integrity The extent to which the data format adheres to criteria
Isomorphism The extent to which data are modeled in a compatible way
Joinability Whether a table contains a primary key of another table
Punctuality Whether the data are available or reported within the promised time frame
Referential integrity Whether the data have unique and valid identifiers
Representational adequacy The extent to which operationalization is consistent
Structuredness The extent to which data are structured in the correct format and structure
Validity The extent to which data conform to appropriate standards
2 Ambiguity The extent to which data are presented properly to prevent data from being interpreted in more than one way
Clarity The extent to which data are clear and easy to understand
Comprehensibility The extent to which data concepts are understandable
Definition The extent to which data are interpreted
Granularity The extent to which data are detailed
Interpretability The extent to which data are defined clearly and presented appropriately
Naturalness The extent to which data are expressed using conventional, typified terms and forms according to a general-purpose reference source
Presentation, Readability The extent to which data are clear and understandable
Understandability The extent to which data have attributes that enable them to be read, interpreted, and understood easily
Vagueness The extent to which data are unclear or unspecific
3 Accuracy The extent to which data are close to the real-world or correct value(by experts)
Believability The extent to which data are credible
Correctness The extent to which data are true
Credibility The extent to which data are true and correct to the content
Plausibility The extent to which the data make sense based on external knowledge
Precision The extent to which data are exact
Reliability Whether the data represent reality accurately
Transformation The error rate due to data transformation
Typing Whether the data are typed properly
Verifiability The extent to which data can be demonstrated to be correct
4 Concise representation The extent to which data are represented in a compact manner
Complexity The extent of data complexity
Redundancy The extent to which data have a minimum content that represents the reality
5 Currency The extent to which data are old
Freshness The extent to which replica of data are up-to-date
Timeliness The extent to which data are up-to-date
Distinctness The extent to which duplicate values exist
Duplication The extent to which data contain the same entity more than once
Uniqueness The extent to which data have duplicates
6 Ease of manipulation The extent to which data are applicable according to a task
Rectifiability Whether data can be corrected
Versatility The extent to which data can be presented using alternative representations
7 Accessibility The extent to which data are retrieved easily and quickly
Availability The extent to which data can be accessed
8 Authority The extent to which the data source is credible
License Whether the data source license is clearly defined
Reputation The extent to which data are highly regarded in terms of their source or content
9 Cohesiveness The extent to which the data content is focused on one topic
Fitness The extent to which data match the theme
10 Confidentiality The extent to which data are for authorized users only
Security The extent to which data are restricted in terms of access
11 Performance The latency time and throughput for coping with data with increasing requests
Storage penalty The time spent for storage
12 History The extent to which the data user can be traced
Traceability The extent to which access to and changes made to data can be traced
13 Appropriate amount of data The extent to which the data volume is appropriate for the task
14 Completeness The extent to which data do not contain missing values
15 Concordance The extent to which there is agreement between data elements (E.g., diagnosis of diabetes, but all A1C results are normal)
16 Connectedness The extent to which datasets are combined at the correct resource
17 Fragmentation The extent to which data are in one place in the record
18 Objectivity The extent to which data are not biased
19 Provenance Whether data contain sufficient metadata
20 Volatility How long the information is valid in the context of a specific activity
21 Volume Percentage of values contained in data with respect to the source from which they are extracted
22 Cleanness The extent to which data are clean and not polluted with irrelevant information, not duplicated, and formed in a consistent way
23 Normalization Whether data are compatible and interpretable
24 Referential correspondence Whether the data are described using accurate labels, without duplication
25 Appropriateness The extent to which data are appropriate for the task
26 Efficiency The extent to which data can be processed and provide the expected level of performance
27 Portability The extent to which data can be preserved in existing quality under any circumstance
28 Recoverability The extent to which data have attributes that allow the preservation of quality under any circumstance
29 Relevancy The extent to which data match the user requirements
30 Usability The extent to which data satisfy the user requirements
31 Value-added The extent to which data are beneficial

*Indicators with similar meaning were grouped into one group.

Searched using the search terms “data quality,” “data quality assessment,” and “data quality dimensions” in Google Scholar (publication period: 1990–2023). The 80 data quality assessment indicators were obtained from 23 reports [12].

TOOLS
Similar articles