Variance Estimation of Nominal-Scale Inter-Rater Reliability with Random Selection of Raters

Gwet, Kilem Li

doi:10.1007/s11336-007-9054-8

Variance Estimation of Nominal-Scale Inter-Rater Reliability with Random Selection of Raters

Theory and Methods
Published: 17 January 2008

Volume 73, pages 407–430, (2008)
Cite this article

Psychometrika Aims and scope Submit manuscript

Kilem Li Gwet¹

873 Accesses
54 Citations
1 Altmetric
Explore all metrics

Abstract

Most inter-rater reliability studies using nominal scales suggest the existence of two populations of inference: the population of subjects (collection of objects or persons to be rated) and that of raters. Consequently, the sampling variance of the inter-rater reliability coefficient can be seen as a result of the combined effect of the sampling of subjects and raters. However, all inter-rater reliability variance estimators proposed in the literature only account for the subject sampling variability, ignoring the extra sampling variance due to the sampling of raters, even though the latter may be the biggest of the variance components. Such variance estimators make statistical inference possible only to the subject universe. This paper proposes variance estimators that will make it possible to infer to both universes of subjects and raters. The consistency of these variance estimators is proved as well as their validity for confidence interval construction. These results are applicable only to fully crossed designs where each rater must rate each subject. A small Monte Carlo simulation study is presented to demonstrate the accuracy of large-sample approximations on reasonably small samples.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Bartfay, E., & Donner, A. (2001). Statistical inferences for inter-observer agreement studies with nominal outcome data. The Statistician, 50, 135–146.
Google Scholar
Bennet, E.M., Alpert, R., & Goldstein, A.C. (1954). Communications through limited response questioning. Public Opinion Quarterly, 18, 303–308.
Article Google Scholar
Berry, K.J., & Mielke, P.W. Jr. (1988). A generalization of Cohen’s kappa agreement measure to interval measurement and multiple raters. Educational and Psychological Measurement, 48, 921–933.
Article Google Scholar
Brennan, R.L., & Prediger, D.J. (1981). Coefficient kappa: Some uses, misuses, and alternatives. Educational and Psychological Measurement, 41, 687–699.
Article Google Scholar
Byrt, T., Bishop, J., & Carlin, J.B. (1993). Bias, prevalence and kappa. Journal of Clinical Epidemiology, 46, 423–429.
Article PubMed Google Scholar
Cicchetti, D.V., & Feinstein, A.R. (1990). High agreement but low kappa: II. Resolving the paradoxes. Journal of Clinical Epidemiology, 43, 551–558.
Article PubMed Google Scholar
Cochran, W.G. (1977). Sampling techniques (3rd ed.). New York: Wiley.
Google Scholar
Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20, 37–46.
Article Google Scholar
Conger, A.J. (1980). Integration and generalization of kappas for multiple raters. Psychological Bulletin, 88, 322–328.
Article Google Scholar
Cook, R.J. (1998). Kappa and its dependence on marginal rates. In P. Armitage & T. Colton (Eds.), Encyclopedia of biostatistics (pp. 2166–2168). New York: Wiley.
Google Scholar
Donner, A., & Eliasziw, M. (1992). A goodness-of-fit approach to inference procedures for the kappa statistic: Confidence interval construction, significance-testing and sample size estimation. Statistics in Medicine, 11, 1511–1519.
Article PubMed Google Scholar
Feinstein, A.R., & Cicchetti, D.V. (1990). High agreement but low kappa: I. The problems of two paradoxes. Journal of Clinical Epidemiology, 43, 543–549.
Article PubMed Google Scholar
Fleiss, J.L. (1971). Measuring nominal scale agreement among many raters. Psychological Bulletin, 76, 378–382.
Article Google Scholar
Fuller, W.A., & Isaki, C.T. (1981). Survey design under superpopulation models. In D. Krewski, J.N.K. Rao, & R. Platek (Eds.), Current topics in survey sampling (pp. 199–226). New York: Academic Press.
Google Scholar
Goodman, L.A., & Kruskal, W.H. (1954). Measures of association for cross classifications. Journal of the American Statistical Association, 49, 1732–1769.
Google Scholar
Gwet, K. (2008). Computing inter-rater reliability and its variance in the presence of high agreement. British Journal of Mathematical and Statistical Psychology, 61(1).
Holley, J.W., & Guilford, J.P. (1964). A note on the G index of agreement. Educational and Psychological Measurement, 24, 749–753.
Article Google Scholar
Isaki, C.T., & Fuller, W.A. (1982). Survey design under the regression superpopulation model. Journal of the American Statistical Association, 77, 89–96.
Article Google Scholar
Janson, H., & Olsson, U. (2001). A measure of agreement for interval or nominal multivariate observations. Educational and Psychological Measurement, 61, 277–289.
Article Google Scholar
Janson, H., & Olsson, U. (2004). A measure of agreement for interval or nominal multivariate observations by different sets of judges. Educational and Psychological Measurement, 64, 62–70.
Article Google Scholar
Janson, S., & Vegelius, J. (1979). On generalizations of the G index and the PHI coefficient to nominal scales. Multivariate Behavioral Research, 14, 255–269.
Article Google Scholar
Kraemer, H.C., Periyakoil, V.S., & Noda, A. (2002). Kappa coefficients in medical research. Statistics in Medicine, 21, 2109–2129.
Article Google Scholar
Landis, R.J., & Koch, G.G. (1977). An application of hierarchical kappa-type statistics in the assessment of majority agreement among multiple observers. Biometrics, 33, 363–374.
Article PubMed Google Scholar
Light, R.J. (1971). Measures of response agreement for qualitative data: Some generalizations and alternatives. Psychological Bulletin, 76, 365–377.
Article Google Scholar
Maxwell, A.E. (1977). Coefficients of agreement between observers and their interpretation. British Journal of Psychiatry, 130, 79–83.
Article PubMed Google Scholar
McGraw, K.O., & Wong, S.P. (1996). Forming inferences about some intraclass correlation coefficients. Psychological Methods, 1, 30–46.
Article Google Scholar
Nam, J.M. (2000). Interval estimation of the kappa coefficient with binary classification and an equal marginal probability model. Biometrics, 56, 583–585.
Article PubMed Google Scholar
Rao, C.R. (2002). Wiley series in probability and statistics. Linear statistical inference and its applications (2nd ed.).
Schuster, C. (2004). A note on the interpretation of weighted kappa and its relations to other rater agreement statistics for metric scales. Educational and Psychological Measurement, 64, 243–253.
Article Google Scholar
Schuster, C., & Smith, D.A. (2006). Estimating with a latent class model the reliability of nominal judgments upon which two raters agree. Educational and Psychological Measurement, 66, 739–747.
Article Google Scholar
Scott, W.A. (1955). Reliability of content analysis: the case of nominal scale coding. Public Opinion Quarterly, XIX, 321–325.
Article Google Scholar
Simon, P. (2006). Including omission mistakes in the calculation of Cohen’s kappa and an analysis of the coefficient’s paradox features. Educational and Psychological Measurement, 66, 765–777.
Article Google Scholar
Uebersax, J.S., & Grove, W.M. (1990). Latent class analysis of diagnostic agreement. Statistics in Medicine, 9, 559–572.
Article PubMed Google Scholar
Uebersax, J.S., & Grove, W.M. (1993). A latent trait finite mixture analysis of rating agreement. Biometrics, 49, 823–835.
Article PubMed Google Scholar
Zou, G., & Klar, N. (2005). A non-iterative confidence interval estimating procedure for the intraclass kappa statistic with multinomial outcomes. Biometrical Journal, 5, 682–690.
Article Google Scholar
Zwick, R. (1988). Another look at interrater agreement. Psychological Bulletin, 103, 374–378.
Article PubMed Google Scholar

Download references

Author information

Authors and Affiliations

STATAXIS Consulting, Sr. Statistical Consultant, 20315 Marketree Place, Montgomery Village, MD, 20886, USA
Kilem Li Gwet

Authors

Kilem Li Gwet
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kilem Li Gwet.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Gwet, K.L. Variance Estimation of Nominal-Scale Inter-Rater Reliability with Random Selection of Raters. Psychometrika 73, 407–430 (2008). https://doi.org/10.1007/s11336-007-9054-8

Download citation

Received: 28 October 2004
Revised: 28 October 2007
Published: 17 January 2008
Issue Date: September 2008
DOI: https://doi.org/10.1007/s11336-007-9054-8

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Variance Estimation of Nominal-Scale Inter-Rater Reliability with Random Selection of Raters

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Interrater reliability estimators tested against true interrater reliabilities

Evaluation of a Confidence Interval Approach for Relative Agreement in a Crossed Three-Way Random Effects Model

Inferring Rater Agreement with Ordinal Classification

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Variance Estimation of Nominal-Scale Inter-Rater Reliability with Random Selection of Raters

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Interrater reliability estimators tested against true interrater reliabilities

Evaluation of a Confidence Interval Approach for Relative Agreement in a Crossed Three-Way Random Effects Model

Inferring Rater Agreement with Ordinal Classification

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation