The Kappa Statistic in Reliability Studies: Use, Interpretation, and Sample Size Requirements

Physical Therapy - Tập 85 Số 3 - Trang 257-268 - 2005
Julius Sim1, Christine Wright2
1J Sim, PhD, is Professor, Primary Care Sciences Research Centre, Keele University, Keele, Staffordshire ST5 5BG, United Kingdom
2CC Wright, BSc, is Principal Lecturer, School of Health and Social Sciences, Coventry University, Coventry, United Kingdom

Tóm tắt

Abstract Purpose. This article examines and illustrates the use and interpretation of the kappa statistic in musculoskeletal research. Summary of Key Points. The reliability of clinicians' ratings is an important consideration in areas such as diagnosis and the interpretation of examination findings. Often, these ratings lie on a nominal or an ordinal scale. For such data, the kappa coefficient is an appropriate measure of reliability. Kappa is defined, in both weighted and unweighted forms, and its use is illustrated with examples from musculoskeletal research. Factors that can influence the magnitude of kappa (prevalence, bias, and nonindependent ratings) are discussed, and ways of evaluating the magnitude of an obtained kappa are considered. The issue of statistical testing of kappa is considered, including the use of confidence intervals, and appropriate sample sizes for reliability studies using kappa are tabulated. Conclusions. The article concludes with recommendations for the use and interpretation of kappa.

Từ khóa


Tài liệu tham khảo

Toussaint, 1999, Sacroiliac joint diagnostics in the Hamburg Construction Workers Study, J Manipulative Physiol Ther, 22, 139, 10.1016/S0161-4754(99)70126-0

Fritz, 2000, The use of a classification approach to identify subgroups of patients with acute low back pain, Spine, 25, 106, 10.1097/00007632-200001010-00018

Riddle, 2002, Evaluation of the presence of sacroiliac joint region dysfunction using a combination of tests: a multicenter intertester reliability study, Phys Ther, 82, 772, 10.1093/ptj/82.8.772

Petersen, 2004, Inter-tester reliability of a new diagnostic classification system for patients with non-specific low back pain, Aust J Physiother, 50, 85, 10.1016/S0004-9514(14)60100-8

Fjellner, 1999, Interexaminer reliability in physical examination of the cervical spine, J Manipulative Physiol Ther, 22, 511, 10.1016/S0161-4754(99)70002-3

Hawk, 1999, Preliminary study of the reliability of assessment procedures for indications for chiropractic adjustments of the lumbar spine, J Manipulative Physiol Ther, 2, 382, 10.1016/S0161-4754(99)70083-7

Smedmark, 2000, Inter-examiner reliability in assessing passive intervertebral motion of the cervical spine, Man Ther, 5, 97, 10.1054/math.2000.0234

Hayes, 2001, Reliability of assessing end-feel and pain and resistance sequences in subjects with painful shoulders and knees, J Orthop Sports Phys Ther, 31, 432, 10.2519/jospt.2001.31.8.432

Kilpikoski, 2002, Interexaminer reliability of low back pain assessment using the McKenzie method, Spine, 27, E207, 10.1097/00007632-200204150-00016

Hannes, 2002, Multisurgeon assessment of coronal pattern classifications systems for adolescent idiopathic scoliosis, Spine, 27, 762, 10.1097/00007632-200204010-00015

Speciale, 2002, Observer variability in assessing lumbar spinal stenosis severity on magnetic resonance imaging and its relation to cross-sectional spinal canal area, Spine, 27, 1082, 10.1097/00007632-200205150-00014

Richards, 2003, Comparison of reliability between the Lenke and King classification systems for adolescent idiopathic scoliosis using radiographs that were not premeasured, Spine, 28, 1148, 10.1097/01.BRS.0000067265.52473.C3

Sim, 2000, Research in Health Care: Concepts, Designs and Methods

Cohen, 1960, A coefficient of agreement for nominal scales, Educ Psychol Meas, 20, 37, 10.1177/001316446002000104

Daly, 2000, Interpretation and Uses of Medical Statistics, 10.1002/9780470696750

Conger, 1980, Integration and generalization of kappas for multiple raters, Psychol Bull, 88, 322, 10.1037/0033-2909.88.2.322

Haley, 1989, Kappa coefficient calculation using multiple ratings per subject: a special communication, Phys Ther, 69, 970, 10.1093/ptj/69.11.970

Fleiss, 1971, Measuring nominal scale agreement among many raters, Psychol Bull, 76, 378, 10.1037/h0031619

Fritz, 2001, Examining diagnostic tests: an evidence-based perspective, Phys Ther, 81, 1546, 10.1093/ptj/81.9.1546

Feinstein, 1990, High agreement but low kappa, I: the problems of two paradoxes, J Clin Epidemiol, 43, 543, 10.1016/0895-4356(90)90158-L

Bartko, 1976, On the methods and theory of reliability, J Nerv Ment Dis, 163, 307, 10.1097/00005053-197611000-00003

Fleiss, 1979, Large sample variance of kappa in the case of different sets of raters, Psychol Bull, 86, 974, 10.1037/0033-2909.86.5.974

Hartmann, 1977, Considerations in the choice of interobserver reliability estimates, J Appl Behav Anal, 10, 103, 10.1901/jaba.1977.10-103

Rigby, 2000, Statistical methods in epidemiology, V: towards an understanding of the kappa coefficient, Disabil Rehabil, 22, 339, 10.1080/096382800296575

Cohen, 1968, Weighted kappa: nominal scale agreement with provision for scaled disagreement or partial credit, Psychol Bull, 70, 213, 10.1037/h0026256

Lantz, 1997, Application and evaluation of the kappa statistic in the design and interpretation of chiropractic clinical research, J Manipulative Physiol Ther, 20, 521

McKenzie, 1981, The Lumbar Spine: Mechanical Diagnosis and Therapy

Kraemer, 2002, Kappa coefficients in medical research, Stat Med, 21, 2109, 10.1002/sim.1180

Brennan, 1992, Statistical methods for assessing observer variability in clinical measures, BMJ, 304, 1491, 10.1136/bmj.304.6840.1491

Donner, 1994, Statistical implications of the choice between a dichotomous or continuous trait in studies of interobserver agreement, Biometrics, 50, 550, 10.2307/2533400

Bartfay, 2000, The effect of collapsing multinomial data when assessing agreement, Int J Epidemiol, 29, 1070, 10.1093/ije/29.6.1070

Maclure, 1987, Misinterpretation and misuse of the kappa statistic, Am J Epidemiol, 126, 161, 10.1093/aje/126.2.161

Streiner, 2003, Health Measurement Scales: A Practical Guide to their Development and Use

Stratford, 1997, Use of the standard error as a reliability index: an applied example using elbow flexor strength, Phys Ther, 77, 745, 10.1093/ptj/77.7.745

Bland, 1986, Statistical methods for assessing agreement between two methods of clinical measurement, Lancet, 1, 307, 10.1016/S0140-6736(86)90837-8

Byrt, 1993, Bias, prevalence and kappa, J Clin Epidemiol, 46, 423, 10.1016/0895-4356(93)90018-V

Bannerjee, 1997, Interpreting kappa values for two-observer nursing diagnosis data, Res Nurs Health, 20, 465, 10.1002/(SICI)1098-240X(199710)20:5<465::AID-NUR10>3.0.CO;2-8

Thompson, 1988, A reappraisal of the kappa coefficient, J Clin Epidemiol, 41, 949, 10.1016/0895-4356(88)90031-5

Shoukri, 2004, Measures of Interobserver Agreement

Brennan, 1981, Coefficient kappa: some uses, misuses, and alternatives, Educ Psychol Meas, 41, 687, 10.1177/001316448104100307

Hoehler, 2000, Bias and prevalence effects on kappa viewed in terms of sensitivity and specificity, J Clin Epidemiol, 53, 499, 10.1016/S0895-4356(99)00174-2

Cicchetti, 1990, High agreement but low kappa, II: resolving the paradoxes, J Clin Epidemiol, 43, 551, 10.1016/0895-4356(90)90159-M

Lantz, 1996, Behavior and interpretation of the κ statistic: resolution of the two paradoxes, J Clin Epidemiol, 49, 431, 10.1016/0895-4356(95)00571-4

Gjørup, 1988, The kappa coefficient and the prevalence of a diagnosis, Methods Inf Med, 27, 184, 10.1055/s-0038-1635539

Landis, 1977, The measurement of observer agreement for categorical data, Biometrics, 33, 159, 10.2307/2529310

Fleiss, 1981, Statistical Methods for Rates and Proportions

Altman, 1991, Practical Statistics for Medical Research

Shrout, 1998, Measurement reliability and agreement in psychiatry, Stat Methods Med Res, 7, 301, 10.1177/096228029800700306

Dunn, 1989, Design and Analysis of Reliability Studies: The Statistical Evaluation of Measurement Errors

Brenner, 1996, Dependence of weighted kappa coefficients on the number of categories, Epidemiology, 7, 199, 10.1097/00001648-199603000-00016

Haas, 1991, Statistical methodology for reliability studies, J Manipulative Physiol Ther, 14, 119

Soeken, 1986, Issues in the use of kappa to estimate reliability, Med Care, 24, 733, 10.1097/00005650-198608000-00008

Knight, 1998, The validity of self-reported cocaine use in a criminal justice treatment sample, Am J Drug Alcohol Abuse, 24, 647, 10.3109/00952999809019614

Posner, 1990, Measuring interrater reliability among multiple raters: an example of methods for nominal data, Stat Med, 9, 1103, 10.1002/sim.4780090917

Petersen, 1998, Using the kappa coefficient as a measure of reliability or reproducibility, Chest, 114, 946, 10.1378/chest.114.3.946-a

Main, 1992, The distress and risk assessment method: a simple patient classification to identify distress and evaluate the risk of poor outcome, Spine, 17, 42, 10.1097/00007632-199201000-00007

Sim, 1999, Statistical inference by confidence intervals: issues of interpretation and utilization, Phys Ther, 79, 186, 10.1093/ptj/79.2.186

Flack, 1988, Sample size determinations for the two rater kappa statistic, Psychometrika, 53, 321, 10.1007/BF02294215

Donner, 1992, A goodness-of-fit approach to inference procedures for the kappa statistic: confidence interval construction, significance-testing and sample size estimation, Stat Med, 11, 1511, 10.1002/sim.4780111109

Walter, 1998, Sample size and optimal designs for reliability studies, Stat Med, 17, 101, 10.1002/(SICI)1097-0258(19980115)17:1<101::AID-SIM727>3.0.CO;2-E