Statistical Methods: Reliability Assessment and Method Comparison

Kyoung Ae Kong

doi:10.12771/emj.2017.40.1.9

Journal List > Ewha Med J > v.40(1) > 1058714

Go to TopGo to Top Go to BottomGo to Bottom

TOOLS

Kong: Statistical Methods: Reliability Assessment and Method Comparison

Review Article

The Ewha Medical Journal 2017; 40(1): 9-16.

Published online: 31 January 2017

DOI: https://doi.org/10.12771/emj.2017.40.1.9

Statistical Methods: Reliability Assessment and Method Comparison

Kyoung Ae Kong

Clinical Trial Center, Ewha Womans University Mokdong Hospital, Seoul, Korea.

Corresponding author: Kyoung Ae Kong. Clinical Trial Center, Ewha Womans University Medical Center, 1071 Anyangcheon-ro, Yangcheon-gu, Seoul 07985, Korea. Tel: 82-2-2650-2069, Fax: 82-2-2650-6141, kkong@ewha.ac.kr

Received 29 December 2016 Accepted 4 January 2017

(open-access):

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

The reliability of clinical measurements is critical to medical research and clinical practice. Newly proposed methods are assessed in terms of their reliability, which includes their repeatability, intra- and interobserver reproducibility. In general, new methods that provide repeatable and reproducible results are compared with established methods used clinically. This paper describes common statistical methods for assessing reliability and agreement between methods, including the intraclass correlation coefficient, coefficient of variation, Bland-Altman plot, limits of agreement, percent agreement, and the kappa statistic. These methods are more appropriate for estimating reliability than hypothesis testing or simple correlation methods. However, some methods of reliability, especially unscaled ones, do not clearly define the acceptable level of error in real size and unit. The Bland-Altman plot is more useful for method comparison studies as it assesses the relationship between the differences and the magnitude of paired measurements, bias (as mean difference), and degree of agreement (as limits of agreement) between two methods or conditions (e.g., observers). Caution should be used when handling heteroscedasticity of difference between two measurements, employing the means of repeated measurements by method in methods comparison studies, and comparing reliability between different studies. Additionally, independence in the measuring processes, the combined use of different forms of estimating, clear descriptions of the calculations used to produce indices, and clinical acceptability should be emphasized when assessing reliability and method comparison studies.

Keywords: Validation studies, Reliability, Reproducibility of results, Agreement, Method comparison

Figures and Tables

Fig. 1

Intraclass correlation coefficient and Pearson's correlation coefficient as indices for intra- or interobserver reliability. ICC, intraclass correlation coefficient; correlation coefficient, Pearson's correlation coefficient.

Fig. 2

Graphical presentation of agreement. A case where the greater magnitude of measurements has the greater difference.

Fig. 3

Graphical presentation of agreement. A case where an increase in the variability of the differences is based on an increase in the magnitude of measurements.

Fig. 4

Measurements of pulmonary nodule size using two radiological methods (shown is a Bland-Altman plot).

Table 1

Agreement between observers A and B on binary measurements

Table 2

Agreement between methods A and B on measurements with four-category results

Number in parentheses indicates the weight used for calculation of the weighted kappa.

References

1. Korean Society for Preventive Medicine. Preventive medicine and public health. 2nd ed. Seoul: Gyechuk Munwhasa;2013.

2. Szklo M, Nieto FJ. Epidemiology: beyond the basics. 2nd ed. Sudbury, MA: Jones and Bartlett Publishers;2007.

3. Atkinson G, Nevill AM. Statistical methods for assessing measurement error (reliability) in variables relevant to sports medicine. Sports Med. 1998; 26:217–238.

4. Bartlett JW, Frost C. Reliability, repeatability and reproducibility: analysis of measurement errors in continuous variables. Ultrasound Obstet Gynecol. 2008; 31:466–475.

5. Petrie A, Sabin C. Medical statistics at a glance. 3rd ed. Chichester, UK: John Wiley & Sons;2009.

6. Bland JM, Altman DG. Measuring agreement in method comparison studies. Stat Methods Med Res. 1999; 8:135–160.

7. Shrout PE, Fleiss JL. Intraclass correlations: uses in assessing rater reliability. Psychol Bull. 1979; 86:420–428.

8. Cicchetti DV. Guidelines, criteria, and rules of thumb for evaluating normed and standardized assessment instruments in psychology. Psychol Assess. 1994; 6:284–290.

9. Rosner B. Fundamentals of biostatistics. 7th ed. Boston, MA: Duxbury Press;2006.

10. Hirschmann MT, Konala P, Amsler F, Iranpour F, Friederich NF, Cobb JP. The position and orientation of total knee replacement components: a comparison of conventional radiographs, transverse 2D-CT slices and 3D-CT reconstruction. J Bone Joint Surg Br. 2011; 93:629–633.

11. Kim CH, Chung CK, Hong HS, Kim EH, Kim MJ, Park BJ. Validation of a simple computerized tool for measuring spinal and pelvic parameters. J Neurosurg Spine. 2012; 16:154–162.

12. Donner A, Zou G. Testing the equality of dependent intraclass correlation coefficients. J R Stat Soc Ser D Stat. 2002; 51:367–379.

13. Bland JM, Altman DG. Statistical methods for assessing agreement between two methods of clinical measurement. Lancet. 1986; 1:307–310.

14. Bland JM, Altman DG. Applying the right statistics: analyses of measurement studies. Ultrasound Obstet Gynecol. 2003; 22:85–93.

15. Johnsson AA, Fagman E, Vikgren J, Fisichella VA, Boijsen M, Flinck A, et al. Pulmonary nodule size evaluation with chest tomosynthesis. Radiology. 2012; 265:273–282.

16. Bland M. Correction to section “Measuring agreement using repeated measurements” in Bland and Altman (1986) [Internet]. 2009. 07. 03. cited 2016 Dec 19. Available from: https://www.users.york.ac.uk/~mb55/meas/repeated.htm.

17. Hanneman SK. Design, analysis, and interpretation of method-comparison studies. AACN Adv Crit Care. 2008; 19:223–234.

18. Bruton A, Conway JH, Holgate ST. Reliability: what is it, and how is it measured? Physiotherapy. 2000; 86:94–99.

19. Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics. 1977; 33:159–174.

20. Fleiss JL. Statistical methods for rates and proportions. 2nd ed. New York, NY: John Wiley and Sons;1981.

21. Altman DG. Practical statistics for medical research. London, UK: Chapman & Hall/CRC;1991.

22. Fleiss JL, Levin B, Paik MC. Statistical methods for rates and proportions. 3rd ed. Hoboken, NJ: John Wiley & Sons;2003.

23. StataCorp. STATA base reference manual (release 13). College Station, TX: Stata Press;2013.

TOOLS

Similar articles