Introduction

Item response theory (IRT), also known as latent trait theory, is a statistical model/paradigm used for analyzing tests [1,2,3]. The IRT can be used to evaluate test items and test takers quantitatively and to perform quality assurance of tests. In addition, IRT can be used to equate the test results. Statistical analysis of the IRT is performed using the results of binary classifications (e.g., correct and incorrect answers to test items).

IRT can be implemented using probabilistic programming languages (e.g., WinBUGS, JAGS, and Stan) [4,5,6]. Extension of IRT is also possible using probabilistic programming languages. For example, the graded response model and nominal response model (NRM) as multiclass extensions of IRT were implemented in Stan [7]. In previous studies, the Bayesian IRT and NRM models were implemented in Stan, and statistical analyses were performed using them [7, 8].

In medical diagnosis, the results of various diagnostic procedures are frequently defined as binary classifications: the presence or absence of a disease, the presence or absence of distant metastasis of cancer, the prediction results of cancer death, etc. However, not all medical diagnoses can be expressed as binary classifications, and multiclass classifications are sometimes necessary. For example, the T, N, and M factors of the TNM classification system are expressed as multiclass classifications for many cancer types [9,10,11]. The statistical analysis of these multiclass classification results requires multiclass analysis (such as NRM).

IRT can be applied to medical diagnosis, by considering (i) the patient as the test item, (ii) the doctor as the test taker, and (iii) the results of the binary classification obtained through medical diagnosis as test results. For example, a previous study used Bayesian IRT to statistically evaluate improvements in radiologists’ diagnostic performance [8]. However, to the best of our knowledge, Bayesian NRM has not been applied to the medical diagnosis of multiclass classifications.

The purpose of this study was to use the Bayesian NRM for multiclass medical diagnosis. Although a Stan implementation of the Bayesian NRM has already been reported [7], it does not work stably [12]. Therefore, for the Bayesian NRM, we extended the conventional NRM in this study. There are two major differences between conventional and our extended NRM. First, while conventional NRM is frequently implemented by extending the 2PL-IRT, our NRM is implemented by extending the 1PL-IRT. We speculate that, because the existing Stan implementation of NRM extends the 2PL-IRT, the existing implementation produced unstable results. The second change in our NRM is to evaluate the ability of test takers using multidimensional parameters rather than a single parameter. Since this study used the results of multiclass classifications, it is natural for test-takers’ abilities to be evaluated based on multidimensional parameters rather than a single parameter. Our Stan implementation of the Bayesian NRM proposed in this study is disclosed as open source through GitHub (https://github.com/jurader/MDNRM).

Materials and methods

Binary and multiclass classification

Generally, the results of binary classification are summarized by a 2 × 2 confusion matrix between the ground truth and prediction. This 2 × 2 confusion matrix is composed of true positive, true negative, false positive, and false negative. Figure 1 shows the 2 × 2 confusion matrix. In a previous study [8], Bayesian IRT was applied by considering true positive and true negative as "correct" and false positive and false negative as "incorrect".

Fig. 1
figure 1

2 × 2 confusion matrix between ground truth and prediction. TP true positive, TN true negative, FP false positive, FN false negative

In general, the results of multiclass classification are summarized by an N × N confusion matrix. As an example, a 3 × 3 confusion matrix is shown in Fig. 2. Here, it is possible to define the diagonal values of the N × N confusion matrix as "correct" and the other values as "incorrect" and apply Bayesian IRT to these values. However, this method does not allow class-by-class evaluations. To solve this problem, conventional NRM was extended in this study.

Fig. 2
figure 2

3 × 3 confusion matrix between ground truth and prediction for 3-class classifications. (ai) represent the frequencies of cells

IRT and NRM

IRT is a statistical model for analyzing the results of binary classifications. There are several types of IRT models. Here, 1PL-IRT and 2PL-IRT are shown [1]. In addition, conventional NRM, as a multiclass extension of 2PL-IRT, is also described [7]. Finally, an extended NRM is proposed. Hereafter, “case” and “radiologist” are used instead of “test item” and “test taker” of IRT, respectively.

1PL-IRT

In 1PL-IRT, one parameter (\({\beta }_{i}\)) is used to represent case i, and another parameter (\({\theta }_{j}\)) is used to represent radiologist j. The following equations represent the 1PL-IRT.

$$\mathrm{Pr}\left({r}_{ij}=1\right)= \frac{1}{1+\mathrm{exp}({-z}_{ij })}$$
$${z}_{ij }= {\theta }_{j}-{\beta }_{i}$$

Here,

  • \({r}_{ij}\) is the response (prediction) of radiologist j to case i,

  • \({r}_{ij}=1\) means that the response \({r}_{ij}\) is correct,

  • \(\mathrm{Pr}\left({r}_{ij}=1\right)\) represents the probability that the response of radiologist j to case i is correct,

  • \({\beta }_{i}\) is the difficulty parameter of case i,

  • \({\theta }_{j}\) is the ability parameter of the radiologist j.

The equation of 1PL-IRT indicates that for converting logit (\({z}_{ij}\)) to probability, 1PL-IRT uses the sigmoid function, which is used for logistic regression. In the 1PL-IRT, \({\theta }_{j}\,\, \text{and} \,\, {\beta }_{i}\) are estimated based on the prediction results of the radiologists.

2PL-IRT

In the 2PL-IRT, two parameters (\({\alpha }_{i}\) and \({\beta }_{i}\)) are used to represent case i. The following equations represent 2PL-IRT.

$$\mathrm{Pr}\left({r}_{ij}=1\right)= \frac{1}{1+\mathrm{exp}({-z}_{ij })}$$
$${z}_{ij }= {{\alpha }_{i}(\theta }_{j}-{\beta }_{i})$$

Here,

  • \({\alpha }_{i}\) and \({\beta }_{i}\) are the discrimination and difficulty parameters of case i.

In 2PL-IRT, \({\theta }_{j}, {\alpha }_{i}\), and \({\beta }_{i}\) are estimated based on the prediction results of radiologists as in 1PL-IRT.

Conventional NRM

Conventional NRM can be viewed as an extension of 2PL-IRT for multiclass classifications. The following equation represents the conventional NRM.

$$\mathrm{Pr}\left({r}_{ij}=k\right)= \frac{\mathrm{exp}\left({\alpha }_{ik}{\theta }_{j}+{\beta }_{ik}\right)}{\sum_{h=1}^{c}\mathrm{exp}\left({\alpha }_{ih}{\theta }_{j}+{\beta }_{ih}\right)}$$

Here,

  • \(\mathrm{Pr}\left({r}_{ij}=k\right)\) represents the probability that the response of radiologist j to case i is class k,

  • The number of classes is c,

  • \({\alpha }_{ik}\) and \({\beta }_{ik}\) are the two parameters of case i on class k,

  • \({\theta }_{j}\) is the ability parameter of radiologist j.

The equation of conventional NRM can be rewritten as follows:

$$\mathrm{Pr}\left({r}_{ij}=k\right)= \frac{\mathrm{exp}\left({z}_{ijk}\right)}{\sum_{h=1}^{c}\mathrm{exp}\left({z}_{ijh}\right)}$$
$${z}_{ijk }= {\alpha }_{ik}{\theta }_{j}+{\beta }_{ik}$$

This means that the NRM uses the softmax function to convert logit (\({z}_{ijk}\)) to probability. Based on the results of the multiclass classification, \({\alpha }_{ik}, {\beta }_{ik}, \,\, \text{and} \,\,{\theta }_{j}\) are estimated. Although a previous study provided the Stan code for conventional NRM (Bayesian NRM) [7], its calculation results were unstable [12].

Extended NRM (1PL-NRM and multidimensional NRM (MD-NRM))

To stabilize the results of the Bayesian NRM, we extended the conventional NRM. First, the discrimination parameter (α) is removed from conventional NRM. In other words, our extended NRM is based on 1PL-IRT. This deletion stabilizes the calculation results of our Bayesian NRM. Hereafter, this extension of NRM is referred to as 1PL-NRM. The following equations represent 1PL-NRM.

$$\mathrm{Pr}\left({r}_{ij}=k\right)= \frac{\mathrm{exp}\left({z}_{ijk}\right)}{\sum_{h=1}^{c}\mathrm{exp}\left({z}_{ijh}\right)}$$
$${z}_{ijk }= {\theta }_{j}-{\beta }_{ik}$$

In this equation, \({\theta }_{j}-{\beta }_{ik}\) is used instead of \({\theta }_{j}+{\beta }_{ik}\), to strengthen the meaning of the difficulty parameter of \({\beta }_{ik}\).

In both conventional NRM and 1PL-NRM, the probability of \({r}_{ij}=k\) (the response of radiologist j to case i is class k) depends on only one parameter of the radiologist (\({\theta }_{j}\)). This assumption is unnatural, and the equations of both conventional NRM and 1PL-NRM indicate that the difficulty parameter of the case has a greater influence on which class a radiologist chooses for the diagnosis than their ability parameter. Because there are c classes in multiclass classification, the probability should be dependent on several parameters of the radiologists. To address this problem, we extended 1PL-NRM.

Herein, we propose a multidimensional NRM (MD-NRM) based on 1PL-NRM, which is represented by the following equations:

$$\mathrm{Pr}\left({r}_{ij}=t| ground\, truth\, of\, case\, i=s\right)= \frac{\mathrm{exp}\left({z}_{ijst}\right)}{\sum_{h=1}^{c}\mathrm{exp}\left({z}_{ijsh}\right)}$$
$${z}_{ijst }= {\theta }_{jst}-{\beta }_{is}$$
  • The number of classes is c,

  • s is the ground truth of case i,

  • \({r}_{ij}=t\) means that the response of radiologist j to case i is t,

  • \({\theta }_{jst}\) is the ability parameter of radiologist j in class t when the ground truth of the case is s (\({\theta }_{j..}\) can be represented by a matrix (c × c)),

  • \({\beta }_{is}\) is the difficulty parameter of case i on class s.

This extended NRM (MD-NRM) means that the multidimensional ability parameters of a radiologist are represented by a matrix, which corresponds to the confusion matrix of multiclass classification.

Relationship between IRT and NRM

The 2PL-IRT can be derived from the conventional NRM as a special case [13]. Here, we use only two classes (k or k′) for the NRM. Additionally, we consider the conditional probability for a response in class k given that the response is one of classes k or k′. Generally, the conditional probability is given by the following equation:

$$\mathrm{Pr}\left(r=k \right|k, k^{\prime})= \frac{\mathrm{Pr}(r=k)}{\mathrm{Pr}\left(r=k\right)+\mathrm{Pr}\left(r={k}^{\prime}\right)}$$

Here, the conventional NRM is used for Pr(k) or Pr(k’). The conditional probability is represented by the following equation (case i and radiologist j are omitted for brevity):

$$\mathrm{Pr}\left(k \right|k, k{^\prime})=\frac{1}{1+\mathrm{exp}(-{\alpha }_{k}^{c}\theta +{\beta {^\prime}}_{k}^{c} )}$$
$${\beta {^\prime}}_{k}^{c}= {\beta }_{{k}^{\prime}}-{\beta }_{k}$$
$${\alpha }_{k}^{c}= {\alpha }_{k}-{\alpha }_{{k}^{\prime}}$$

Further, this equation can be converted to the following equations.

$$\mathrm{Pr}\left(k \right|k, k{^{\prime}})=\frac{1}{1+\mathrm{exp}(-{\alpha }_{k}^{c}(\theta -{\beta }_{k}^{c} ))}$$
$${\beta }_{k}^{c}= \frac{{\beta ^{\prime}}_{k}^{c}}{{\alpha }_{k}^{c}}$$

These equations represent the 2PL-IRT. This derivation (conventional NRM to 2PL-IRT) can be applied to our 1PL-NRM; In our 1PL-NRM, the conditional probability is represented by the 1PL-IRT. This is one of the major reasons for using the extended NRM.

Experiments

Our institutional review board approved this retrospective study and waived the requirement for informed consent.

It is assumed that MD-NRM can be applied to data from medical diagnoses of multiclass classification. We applied MD-NRM to the classification results obtained by radiologists in a previous study [14]. In the previous study, there were three classes of medical diagnosis: novel coronavirus pneumonia (COVID), non-novel-coronavirus pneumonia (PNEUMONIA), and normal (NORMAL). Therefore, the classification results were summarized as a 3 × 3 confusion matrix. In total, 150 cases (50 COVID, 50 PNEUMONIA, and 50 NORMAL) were reported in the previous study. From the three classes, six radiologists determined their diagnoses based on visual evaluation of chest X-ray images. Therefore, 900 (150 × 6) nominal responses were obtained. The Supplementary material shows the ground truth of the case and the classification results of the six radiologists.

In this study, MD-NRM was applied to multiclass classification results to analyze the ability of six radiologists for the three classes. For all ability and difficulty parameters, a normal distribution with mean = 0 and standard deviation = 2 was used as the prior distribution. The following parameters were used for sampling in Stan: chains = 8, iter = 8000, warmup = 4000, thin = 1, adapt_delta = 0.9, and max_treedepth = 15. The convergence check of the MD-NRM was performed by evaluating the Rhat values of all parameters [7, 8, 15]. Since we focused on the ability parameters, the estimation results of the difficulty parameters were omitted in this study. We used the following software to implement MD-NRM: R, version 4.1.1; Stan, version 2.21.0; rstan, version 2.21.2; shinystan, version 2.5.0; and tidybayes, version 3.0.1.

Results

Figure 3 shows the convergence check of the Stan results for all the parameters. The Rhat values for all parameters (difficulty and ability parameters) were less than 1.10, which means that the MD-NRM converged successfully [7, 8, 15]. Thus, 1PL-NRM is useful for stabilizing Bayesian NRM.

Fig. 3
figure 3

Evaluation of convergence in MD-NRM. For all parameters, their Rhat values are less than 1.10

Table 1 shows the estimation results of the MD-NRM for the ability parameters of Radiologist 1. In the MD-NRM, radiologist ability is represented in a 3 × 3 matrix, so one radiologist has nine ability parameters. In Table 1, if \(\theta \) 111, \(\theta \) 122, and \(\theta \) 133 are high, the diagnostic performance of radiologist 1 is high for NORMAL, PNEUMONIA, and COVID.

Table 1 Result of MD-NRM for ability parameters of radiologist 1

Figure 4 shows a 3 × 3 ability matrix created from the median values of the ability parameters of radiologist 1 in Table 1. In Fig. 4, if the diagonal values of the ability matrix are high, the diagnostic ability of radiologist 1 is also high. In contrast, if the non-diagonal values of the ability matrix are low, the diagnostic ability is high.

Fig. 4
figure 4

Ability matrix of radiologist 1. This matrix was created from the median values of the ability parameters of radiologist 1 in Table 1

Figure 5 (A)–(C) show the summary of \(\theta \) i11 (i = 1, 2, …, 6), \(\theta \) i22 (i = 1, 2, …, 6), and \(\theta \) i33 (i = 1, 2, …, 6) for radiologists 1–6 for NORMAL, PNEUMONIA, and COVID, respectively. The figures in the Supplementary Materials show all the ability parameters of radiologists 1–6.

Fig. 5
figure 5

Summary of ability parameters of six radiologist. (A) \(\theta \) i11 (i = 1, 2, …, 6) for NORMAL, (B) \(\theta \) i22 (i = 1, 2, …, 6) for PNEUMONIA, (C) \(\theta \) i33 (i = 1, 2, …, 6) for COVID. Note: In (AC), "theta[i,j,k]” represents \(\theta \) ijk. For example, “theta [1,1,1]” represents \(\theta \) 111. Circles, thick bars, and thin bars represent the median values, 50% credible interval (interquartile range), and 95% credible interval, respectively

Our Stan and R codes of the MD-NRM are shown in the Supplementary materials. In addition, these codes were disclosed as open source through GitHub (https://github.com/jurader/MDNRM).

Discussion

Using the MD-NRM, stable convergence in the Bayesian NRM calculation was achieved with our Stan code. This is a significant improvement because stable convergence cannot be obtained with the conventional NRM Stan code [7]. In addition, compared to the conventional NRM, MD-NRM makes it possible to determine the multidimensional ability parameters of a doctor. Furthermore, this extension of multidimensional parameters allows MD-NRM to provide a more detailed evaluation of a doctor's ability rather than a single parameter.

In IRT and NRM, parameters (ability and difficulty) are evaluated in the latent space [1, 8]. Because of this characteristic, IRT is known as the latent trait theory. Logistic regression is frequently used to analyze the results of medical diagnoses [16]. In logistic regression, the effect of covariates on the latent space is statistically evaluated. Therefore, IRT, NRM, and logistic regression share similar functionalities in the evaluation of the parameters.

In receiver operating characteristic (ROC) analysis, it is common to assign a multilevel score for each case to statistically analyze the results of binary classifications [17, 18]. For example, a radiologist’s ability to differentiate between benign and malignant tumors is frequently analyzed by assigning a score to each case on a 5-point scale [19]. However, there is no commonly used method of ROC analysis for multiclass classification results. Instead, for the analysis of multiclass classifications, it is relatively frequent to perform multiple ROC analyses of binary classifications in a one-vs-rest fashion [20]. On the other hand, MD-NRM can analyze the results of multiclass classifications by a single analysis, which is a major difference between MD-NRM and ROC analysis. In addition, MD-NRM does not require the multilevel scoring (please see the data of 150 cases in the Supplementary material), which is another major difference from ROC analysis.

As in a previous study [8], it is possible to introduce covariates into MD-NRM. This extension makes it possible to quantitatively evaluate the effect of covariates using MD-NRM; for example, the improvement in diagnostic performance with various experiences of doctors or with different diagnostic modalities/examinations.

This study had several limitations. First, the data from 150 patients were used. It has not been evaluated whether stable convergence will be obtained in MD-NRM with smaller data. Second, although an extension of the MD-NRM (MD-NRM with covariates) is proposed in this paper, no actual experiments have been conducted. We expect that the extension of MD-NRM will be investigated in the future based on our open-source code. Third, because MD-NRM has more latent parameters than 1PL-NRM, it will be difficult to obtain stable convergence in MD-NRM, compared with 1PL-NRM.

In conclusion, the MD-NRM was proposed as a statistical analysis method for multiclass classification. The MD-NRM achieved successful convergence of parameter estimation in the Bayesian NRM. In addition, using the MD-NRM, a more detailed evaluation of the doctor's ability was possible using multidimensional ability parameters.