Abstract
Most inter-rater reliability studies using nominal scales suggest the existence of two populations of inference: the population of subjects (collection of objects or persons to be rated) and that of raters. Consequently, the sampling variance of the inter-rater reliability coefficient can be seen as a result of the combined effect of the sampling of subjects and raters. However, all inter-rater reliability variance estimators proposed in the literature only account for the subject sampling variability, ignoring the extra sampling variance due to the sampling of raters, even though the latter may be the biggest of the variance components. Such variance estimators make statistical inference possible only to the subject universe. This paper proposes variance estimators that will make it possible to infer to both universes of subjects and raters. The consistency of these variance estimators is proved as well as their validity for confidence interval construction. These results are applicable only to fully crossed designs where each rater must rate each subject. A small Monte Carlo simulation study is presented to demonstrate the accuracy of large-sample approximations on reasonably small samples.
Similar content being viewed by others
References
Bartfay, E., & Donner, A. (2001). Statistical inferences for inter-observer agreement studies with nominal outcome data. The Statistician, 50, 135–146.
Bennet, E.M., Alpert, R., & Goldstein, A.C. (1954). Communications through limited response questioning. Public Opinion Quarterly, 18, 303–308.
Berry, K.J., & Mielke, P.W. Jr. (1988). A generalization of Cohen’s kappa agreement measure to interval measurement and multiple raters. Educational and Psychological Measurement, 48, 921–933.
Brennan, R.L., & Prediger, D.J. (1981). Coefficient kappa: Some uses, misuses, and alternatives. Educational and Psychological Measurement, 41, 687–699.
Byrt, T., Bishop, J., & Carlin, J.B. (1993). Bias, prevalence and kappa. Journal of Clinical Epidemiology, 46, 423–429.
Cicchetti, D.V., & Feinstein, A.R. (1990). High agreement but low kappa: II. Resolving the paradoxes. Journal of Clinical Epidemiology, 43, 551–558.
Cochran, W.G. (1977). Sampling techniques (3rd ed.). New York: Wiley.
Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20, 37–46.
Conger, A.J. (1980). Integration and generalization of kappas for multiple raters. Psychological Bulletin, 88, 322–328.
Cook, R.J. (1998). Kappa and its dependence on marginal rates. In P. Armitage & T. Colton (Eds.), Encyclopedia of biostatistics (pp. 2166–2168). New York: Wiley.
Donner, A., & Eliasziw, M. (1992). A goodness-of-fit approach to inference procedures for the kappa statistic: Confidence interval construction, significance-testing and sample size estimation. Statistics in Medicine, 11, 1511–1519.
Feinstein, A.R., & Cicchetti, D.V. (1990). High agreement but low kappa: I. The problems of two paradoxes. Journal of Clinical Epidemiology, 43, 543–549.
Fleiss, J.L. (1971). Measuring nominal scale agreement among many raters. Psychological Bulletin, 76, 378–382.
Fuller, W.A., & Isaki, C.T. (1981). Survey design under superpopulation models. In D. Krewski, J.N.K. Rao, & R. Platek (Eds.), Current topics in survey sampling (pp. 199–226). New York: Academic Press.
Goodman, L.A., & Kruskal, W.H. (1954). Measures of association for cross classifications. Journal of the American Statistical Association, 49, 1732–1769.
Gwet, K. (2008). Computing inter-rater reliability and its variance in the presence of high agreement. British Journal of Mathematical and Statistical Psychology, 61(1).
Holley, J.W., & Guilford, J.P. (1964). A note on the G index of agreement. Educational and Psychological Measurement, 24, 749–753.
Isaki, C.T., & Fuller, W.A. (1982). Survey design under the regression superpopulation model. Journal of the American Statistical Association, 77, 89–96.
Janson, H., & Olsson, U. (2001). A measure of agreement for interval or nominal multivariate observations. Educational and Psychological Measurement, 61, 277–289.
Janson, H., & Olsson, U. (2004). A measure of agreement for interval or nominal multivariate observations by different sets of judges. Educational and Psychological Measurement, 64, 62–70.
Janson, S., & Vegelius, J. (1979). On generalizations of the G index and the PHI coefficient to nominal scales. Multivariate Behavioral Research, 14, 255–269.
Kraemer, H.C., Periyakoil, V.S., & Noda, A. (2002). Kappa coefficients in medical research. Statistics in Medicine, 21, 2109–2129.
Landis, R.J., & Koch, G.G. (1977). An application of hierarchical kappa-type statistics in the assessment of majority agreement among multiple observers. Biometrics, 33, 363–374.
Light, R.J. (1971). Measures of response agreement for qualitative data: Some generalizations and alternatives. Psychological Bulletin, 76, 365–377.
Maxwell, A.E. (1977). Coefficients of agreement between observers and their interpretation. British Journal of Psychiatry, 130, 79–83.
McGraw, K.O., & Wong, S.P. (1996). Forming inferences about some intraclass correlation coefficients. Psychological Methods, 1, 30–46.
Nam, J.M. (2000). Interval estimation of the kappa coefficient with binary classification and an equal marginal probability model. Biometrics, 56, 583–585.
Rao, C.R. (2002). Wiley series in probability and statistics. Linear statistical inference and its applications (2nd ed.).
Schuster, C. (2004). A note on the interpretation of weighted kappa and its relations to other rater agreement statistics for metric scales. Educational and Psychological Measurement, 64, 243–253.
Schuster, C., & Smith, D.A. (2006). Estimating with a latent class model the reliability of nominal judgments upon which two raters agree. Educational and Psychological Measurement, 66, 739–747.
Scott, W.A. (1955). Reliability of content analysis: the case of nominal scale coding. Public Opinion Quarterly, XIX, 321–325.
Simon, P. (2006). Including omission mistakes in the calculation of Cohen’s kappa and an analysis of the coefficient’s paradox features. Educational and Psychological Measurement, 66, 765–777.
Uebersax, J.S., & Grove, W.M. (1990). Latent class analysis of diagnostic agreement. Statistics in Medicine, 9, 559–572.
Uebersax, J.S., & Grove, W.M. (1993). A latent trait finite mixture analysis of rating agreement. Biometrics, 49, 823–835.
Zou, G., & Klar, N. (2005). A non-iterative confidence interval estimating procedure for the intraclass kappa statistic with multinomial outcomes. Biometrical Journal, 5, 682–690.
Zwick, R. (1988). Another look at interrater agreement. Psychological Bulletin, 103, 374–378.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Gwet, K.L. Variance Estimation of Nominal-Scale Inter-Rater Reliability with Random Selection of Raters. Psychometrika 73, 407–430 (2008). https://doi.org/10.1007/s11336-007-9054-8
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11336-007-9054-8