1 Introduction

In many scientific disciplines, quantitative information is collected on a set of variables for a number of objects or participants. When the set of variables can be divided into a set of P predictors and a set of R responses, we need regression type of models. Researchers often neglect the multivariate nature of the response set and consequently univariate multiple regression models are fitted, one for each response variable separately. Such an approach does not take into account that the response variables might be correlated and does not provide insight into the relationships between the response variables. In this paper, the interest lies in the case where the R response variables are binary.

For binary response variables, the typical regression model is a logistic regression (Agresti 2013), that is, a generalized linear model (McCullagh and Nelder 1989) with logit link function and a Bernoulli or binomial distribution. In logistic regression models, the probability that person \(i\) answers yes (or 1) on the response variable Y is defined as \(\pi _{i} = P(Y_{i} = 1)\). These probabilities (or estimated values) are commonly defined in terms of the log-odds form (in GLMs the “linear predictor’ ’), denoted by \(\theta _{i}\), that is,

$$\begin{aligned} \pi _{i} = \frac{\exp (\theta _{i})}{1 + \exp (\theta _{i})} = \frac{1}{1 + \exp (-\theta _{i})}, \end{aligned}$$

and similarly

$$\begin{aligned} 1 - \pi _{i} = \frac{1}{1 + \exp (\theta _{i})} = \frac{\exp (-\theta _{i})}{1 + \exp (-\theta _{i})}. \end{aligned}$$

Finally, the log-odds form is a (linear) function of the P predictor variables, that is

$$\begin{aligned} \theta _i = m + \sum _{p=1}^P x_{ip}a_p, \end{aligned}$$

where \(x_{ip}\) is the observed value for person i on predictor variable p and m and \(a_p\) are the parameters which need to be estimated. The intercept m is the expected log-odds when all predictor variables equal zero, and the regression weights \(a_p\) indicate the difference in log-odds for two observations that differ one unit in predictor variable p and have equal values for all other predictor variables.

Having R outcome variables (\(Y_r\), \(r = 1,\ldots , R\)) a multivariate model can be defined with probabilities \(\pi _{ir} = P(Y_{ir} = 1)\) that are parameterized as

$$\begin{aligned} \pi _{ir} = \frac{\exp (\theta _{ir})}{1 + \exp (\theta _{ir})} = \frac{1}{1 + \exp (-\theta _{ir})}, \end{aligned}$$

and the log-odds form is

$$\begin{aligned} \theta _{ir} = m_r + \sum _{p=1}^P x_{ip}a_{pr}. \end{aligned}$$

The intercepts can be collected in a vector \({\varvec{m}}\) and the regression weights can be collected in a matrix \({\varvec{A}}\) of size \(P \times R\).

For multivariate outcomes, Yee and Hastie (2003) proposed reduced rank vector generalized linear models, that are multivariate models with a rank constraint on the matrix with regression weights, that is,

$$\begin{aligned} {\varvec{A}} = {\varvec{BV}}' \end{aligned}$$

where \({\varvec{B}}\) is a matrix of size \(P \times S\) and \({\varvec{V}}\) a matrix of size \(R \times S\). The rank of the matrix \({\varvec{A}}\) is S, a number in the range 1 to \(\min (P, R)\), and when \(S < \min (P, R)\) the rank is reduced, hence the name reduced rank regression. The matrix \({\varvec{B}}\) has elements \(b_{ps}\) for \(s = 1,\ldots , S\) and \({\varvec{V}}\) has elements \(v_{rs}\). The S elements for the predictor variable p (i.e., the p-th row of \({\varvec{B}}\)), are collected in the column vector \({\varvec{b}}_p\); similarly, the S elements for response variable r are collected in the column vector \({\varvec{v}}_r\).

At this point, it is instructive to take a step back to models for continuous outcomes. Reduced rank regression (Anderson 1951; Izenman 1975; Tso 1981; Davies and Tso 1982), also called redundancy analysis (Van den Wollenberg 1977), has been proposed as a multivariate tool for simultaneously predicting the responses from the set of predictors. Reduced rank regression can be motivated from a regression point of view, but also from a principal component point of view.

From the regression point of view of reduced rank regression, the goal is to predict the response variables using the set of predictor variables. We, therefore, set up a multivariate regression model

$$\begin{aligned} {\varvec{Y}} = {\varvec{1}}{\varvec{m}}' + {\varvec{XA}} + {\varvec{E}}, \end{aligned}$$

where \({\varvec{m}}\) denote the vector with the R intercepts, \({\varvec{A}}\) is a \(P \times R\) matrix with regression weights, and \({\varvec{E}}\) is a matrix with residuals. In the usual multivariate regression model the matrix of regression coefficients is unrestricted. In reduced rank regression, the restriction \({\varvec{A}} = {\varvec{B}}{\varvec{V}}'\) is imposed.

Reduced rank regression can also be cast as a constrained principal component analysis (PCA, Takane (2013)). In PCA, the matrix with multivariate responses (\({\varvec{Y}}\)) of size \(N \times R\) is decomposed into a matrix with object scores (\({\varvec{U}}\)) of dimension \(N \times S\), and a matrix with variable loadings (\({\varvec{V}}\)) of size \(R \times S\). The rank, or dimensionality, should be smaller or equal to \(\min (N, R)\). We can write PCA as the following expression

$$\begin{aligned} {\varvec{Y}} = {\varvec{1}}{\varvec{m}}' + {\varvec{UV}}' + {\varvec{E}}, \end{aligned}$$

where usually, identifiability constraints are imposed such as \({\varvec{V}}'{\varvec{V}} = {\varvec{I}}\) or \({\varvec{U}}'{\varvec{U}} = N{\varvec{I}}\). To estimate the PCA parameters, usually the means of the responses are computed, that is \(\hat{{\varvec{m}}} = N^{-1} {\varvec{Y}}'{\varvec{1}}\). Then, the centered response matrix \({\varvec{Y}}_c = {\varvec{Y}} - {\varvec{1}}\hat{{\varvec{m}}}'\) is computed, and subsequently \({\varvec{U}}\) and \({\varvec{V}}\) are estimated by minimizing the least squares loss function

$$\begin{aligned} L({\varvec{U}}, {\varvec{V}}) = \Vert {\varvec{Y}}_c - {\varvec{UV}}' \Vert ^2, \end{aligned}$$

where \(\Vert \cdot \Vert ^2\) denotes the squared Frobenius norm of a matrix. Eckart and Young (1936) show that this can be achieved by a singular value decomposition. With predictor variables, the scores (\({\varvec{U}}\)) can be restricted to be a linear combination of these predictor variables, sometimes called external variables (i.e., \({\varvec{X}}\)), that is, \({\varvec{U}} = {\varvec{XB}}\), with \({\varvec{B}}\) a \(P \times S\) matrix, to be estimated. Ten Berge (1993) shows that \({\varvec{B}}\) and \({\varvec{V}}\) can be estimated using a generalized singular value decomposition in the metrics \({\varvec{X}}'{\varvec{X}}\) and \({\varvec{I}}\) (see Appendix A for details), that is, we decompose \(({\varvec{X}}'{\varvec{X}})^{-\frac{1}{2}}{\varvec{X}}'{\varvec{Y}}_c\) with a singular value decomposition, that is

$$\begin{aligned} ({\varvec{X}}'{\varvec{X}})^{-\frac{1}{2}}{\varvec{X}}'{\varvec{Y}}_c = {\varvec{P}}{\varvec{\Phi }}{\varvec{Q}}' \end{aligned}$$
(1)

where \({\varvec{P}}'{\varvec{P}} = {\varvec{Q}}'{\varvec{Q}} = {\varvec{I}}\) and \({\varvec{\Phi }}\) is a diagonal matrix with singular values. Subsequently, define the following estimates

$$\begin{aligned} \hat{{\varvec{B}}}= & {} ({\varvec{X}}'{\varvec{X}})^{-\frac{1}{2}}{\varvec{P}} \end{aligned}$$
(2)
$$\begin{aligned} \hat{{\varvec{V}}}= & {} {\varvec{Q}}{\varvec{\Phi }}. \end{aligned}$$
(3)

The \({\varvec{m}}\) are subsequently estimated as

$$\begin{aligned} \hat{{\varvec{m}}} = N^{-1} ({\varvec{Y}} - {\varvec{XBV}}')'{\varvec{1}}. \end{aligned}$$

Reduced rank regression can thus be understood from two different points of views: a multivariate regression model with constraints on the regression weights or a principal component analysis with constraints on the object scores. Yee and Hastie (2003) approached logistic reduced rank regression as a multivariate regression model with a rank constraint on the regression weights. We could approach logistic reduced rank regression also from the PCA point of view. PCA for binary variables has received considerable attention lately (Collins et al. 2001; Schein et al. 2003; De Leeuw 2006; Landgraf and Lee 2020).

De Leeuw (2006) defined the log-odds term \(\theta _{ir}\) in terms of a principal component analysis \(\theta _{ir} = m_r + {\varvec{u}}_{i}'{\varvec{v}}_{r}\), where \({\varvec{u}}_i\) is the vector with object scores for participant i. Similar to standard PCA, object scores and variable loadings are obtained. These scores and loading, however, reconstruct the log-odds term, not the response variables themselves. De Leeuw (2006) also proposed a Majorization Minimization (MM) algorithm (Heiser 1995; Hunter and Lange 2004; Nguyen 2017) for maximum likelihood estimation of the parameters. In the majorization step, the negative log-likelihood is majorized by a least squares function. This majorizing least squares function is minimized by applying a singular value decomposition on a matrix with working responses, that are variables that function as the response variables in each iteration of the algorithm but are updated from iteration to iteration.

Yee and Hastie (2003, Sect. 3.4) note that “in general, minimization of the negative likelihood cannot be achieved by use of the singular value decomposition” and therefore proposed an alternating algorithm, where in each iteration first \({\varvec{B}}\) is estimated considering \({\varvec{V}}\) fixed, and subsequently \({\varvec{V}}\) is estimated considering \({\varvec{B}}\) fixed. For both steps a weighted least squares update is derived, where in every iteration the weights and the responses need to be redefined based on the current set of parameters. In the next section, we develop an MM algorithm for logistic reduced rank regression (note, not all reduced rank generalized linear models) based on the work of De Leeuw (2006), where in each of the iterations a generalized singular value decomposition is applied. We compare the two algorithms in terms of speed of computation in Sect. 4.

For interpretation of (logistic) reduced rank models, a researcher can inspect the estimated coefficients \({\varvec{A}} = {\varvec{BV}}'\). Coefficients in this matrix can be interpreted like the usual regression weights in (logistic) regression models. Because the number of coefficients is usually large (i.e., \(P \times R\)) it is difficult to obtain a holistic interpretation from this matrix. Visualization can help to obtain such a holistic interpretation of the reduced rank model. PCA solutions can be graphically represented by biplots (Gabriel 1971; Gower and Hand 1996; Gower et al. 2011). Biplots are generalizations of usual scatterplots for multivariate data, where the observations (\(i = 1,\ldots , N\)) and the variables (\(r = 1,\ldots , R\)) are represented in low dimensional visualizations. The observations are represented by points, whereas the variables are represented by variable axes. These biplots have been extended for reduced rank regression (Ter Braak and Looman 1994), that not only represent observations and response variables but also predictor variables by variable axes, that is, they represent three different types of information, and therefore we call them triplots.

For visualization of the logistic reduced rank regression model, two types of triplots have been proposed. Vicente-Villardón et al. (2006) and Vicente-Villardón and Vicente-Gonzalez (2019) modified the usual biplots/triplots for the representation of binary data based on logistic models. Another type of triplot was proposed by Poole and Rosenthal (1985), Clinton et al. (2004), Poole et al. (2011), and De Rooij and Groenen (2022).

2.1 General theory about MM algorithms

The idea of MM for finding a minimum of the function \({\mathcal {L}}({\varvec{\theta }})\), where \({\varvec{\theta }}\) is a vector of parameters, is to define an auxiliary function, called a majorization function, \({\mathcal {M}}({\varvec{\theta }}|{\varvec{\vartheta }})\) with two characteristics

$$\begin{aligned} {\mathcal {L}}({\varvec{\vartheta }}) = {\mathcal {M}}({\varvec{\vartheta }}|{\varvec{\vartheta }})\\ \end{aligned}$$

where \({\varvec{\vartheta }}\) is a supporting point, and

$$\begin{aligned} {\mathcal {L}}({\varvec{\theta }}) \le {\mathcal {M}}({\varvec{\theta }}|{\varvec{\vartheta }}). \end{aligned}$$

The two equations tell us that \({\mathcal {M}}({\varvec{\theta }}|{\varvec{\vartheta }})\) is a function that lies above (i.e., majorizes) the original function and touches the original function at the support point. Because of the above two properties, an iterative sequence defines a convergent algorithm because by construction

$$\begin{aligned} {\mathcal {L}}({\varvec{\theta }}^+) \le {\mathcal {M}} ({\varvec{\theta }}^+|{\varvec{\vartheta }}) \le {\mathcal {M}} ({\varvec{\vartheta }}|{\varvec{\vartheta }}) = {\mathcal {L}}({\varvec{\vartheta }}), \end{aligned}$$

where \({\varvec{\theta }}^+\) is

$$\begin{aligned} {\varvec{\theta }}^+ = \textrm{argmin}_{{\varvec{\theta }}} \ {\mathcal {M}}({\varvec{\theta }}|{\varvec{\vartheta }}), \end{aligned}$$

the updated parameter. A main advantage of MM algorithms is that they always converge monotonically to a (local) minimum. The challenge is to find a parametrized function family, \({\mathcal {M}}({\varvec{\theta }}|{\varvec{\vartheta }})\), that can be used in every step.

In our case, the original function equals the negative log-likelihood. We majorize this function with the least squares function, that is, \({\mathcal {M}}({\varvec{\theta }}|{\varvec{\vartheta }})\) is a least squares function. We use majorization via a quadratic upper bound, that is, for a twice differentiable function \({\mathcal {L}}({\varvec{\theta }})\) and for each \({\varvec{\vartheta }}\) the function

$$\begin{aligned} {\mathcal {M}}({\varvec{\theta }}|{\varvec{\vartheta }}) = {\mathcal {L}}({\varvec{\vartheta }}) + {\mathcal {L}}'({\varvec{\vartheta }}) ({\varvec{\theta }} - {\varvec{\vartheta }}) + \frac{1}{2}({\varvec{\theta }}-{\varvec{\vartheta }})'{\varvec{A}}({\varvec{\theta }}-{\varvec{\vartheta }}) \end{aligned}$$

majorizes \({\mathcal {L}}({\varvec{\theta }})\) at \({\varvec{\vartheta }}\) when the matrix \({\varvec{A}}\) is such that

$$\begin{aligned} {\varvec{A}} - \partial ^2 {\mathcal {L}}({\varvec{\theta }}), \end{aligned}$$

is positive semi definite. We also use the property that majorization is closed under summation, that is, when \({\mathcal {M}}_1\) majorizes \({\mathcal {L}}_1\) and \({\mathcal {M}}_2\) majorizes \({\mathcal {L}}_2\), then \({\mathcal {M}}_1 + {\mathcal {M}}_2\) majorizes \({\mathcal {L}}_1 + {\mathcal {L}}_2\).

2.2 The algorithm

Lets recap our loss function

$$\begin{aligned} {\mathcal {L}}({\varvec{\theta }}) = \sum _{i=1}^N\sum _{r = 1}^R {\mathcal {L}}_{ir}(\theta _{ir}) = \sum _{i=1}^N\sum _{r = 1}^R -\log \frac{1}{1 + \exp (-q_{ir}\theta _{ir})}. \end{aligned}$$

Because of the summation property, we can focus on a single element, \({\mathcal {L}}_{ir}(\theta _{ir})\). The first derivative of \({\mathcal {L}}_{ir}(\theta _{ir})\) with respect to \(\theta _{ir}\) is

$$\begin{aligned} \begin{aligned} \xi _{ir} \equiv \frac{\partial {\mathcal {L}}_{ir} (\theta _{ir})}{\partial \theta _{ir}}&= -(y_{ir} - \pi _{ir}) \end{aligned} \end{aligned}$$

Filling in the derivative and using the upper bound \(A = \frac{1}{4}\) (Böhning and Lindsay 1988; Hunter and Lange 2004), we have that

$$\begin{aligned} \begin{aligned} {\mathcal {L}}_{ir}(\theta _{ir})&\le {\mathcal {L}}_{ir}(\vartheta _{ir}) + \xi _{ir}(\theta _{ir} - \vartheta _{ir}) + \frac{1}{8}(\theta _{ir} - \vartheta _{ir})(\theta _{ir} - \vartheta _{ir}) \\&\le {\mathcal {L}}_{ir}(\vartheta _{ir}) + \xi _{ir}\theta _{ir} - \xi _{ir}\vartheta _{ir} + \frac{1}{8}(\theta _{ir}^2 + \vartheta _{ir}^2 -2\theta _{ir}\vartheta _{ir}) \\&\le {\mathcal {L}}_{ir}(\vartheta _{ir}) + \frac{1}{8}\theta _{ir}^2 + \xi _{ir}\theta _{ir} -2\frac{1}{8}\theta _{ir}\vartheta _{ir} - \xi _{ir}\vartheta _{ir} + \frac{1}{8}\vartheta _{ir}^2. \end{aligned} \end{aligned}$$

Let us now define \(z_{ir} = \vartheta _{ir} - 4\xi _{ir}\) to obtain

$$\begin{aligned} \begin{aligned} {\mathcal {L}}_{ir}(\theta _{ir})&\le {\mathcal {L}}_{ir}(\vartheta _{ir}) + \frac{1}{8}\theta _{ir}^2 -2\frac{1}{8}\theta _{ir}z_{ir} + \left( \frac{1}{8}z_{ir}^2 - \frac{1}{8}z_{ir}^2 \right) - \xi _{ir}\vartheta _{ir} + \frac{1}{8}\vartheta _{ir}^2 \\&\le {\mathcal {L}}_{ir}(\vartheta _{ir}) + \frac{1}{8}(\theta _{ir} - z_{ir})^2 - \frac{1}{8}z_{ir}^2 - \xi _{ir}\vartheta _{ir} + \frac{1}{8}\vartheta _{ir}^2 \\&\le \frac{1}{8}(\theta _{ir} - z_{ir})^2 + c_{ir}, \end{aligned} \end{aligned}$$

where \(c_{ir} = {\mathcal {L}}_{ir}(\vartheta _{ir}) - \frac{1}{8}z_{ir}^2 - \xi _{ir}\vartheta _{ir} + \frac{1}{8}\vartheta _{ir}^2\) is a constant.

Now as

$$\begin{aligned} {\mathcal {L}}({\varvec{\theta }}) = \sum _{i=1}^N\sum _{r = 1}^R {\mathcal {L}}_{ir}(\theta _{ir}) \end{aligned}$$

we have that

$$\begin{aligned} {\mathcal {L}}({\varvec{\theta }}) \le \sum _{i=1}^N\sum _{r = 1}^R \frac{1}{8}(\theta _{ir} - z_{ir})^2 + c, \end{aligned}$$

a least squares majorization function, with \(c = \sum _i \sum _r c_{ir}\).

For logistic principal component analysis, De Leeuw (2006) defined \(\theta _{ir} = m_r + {\varvec{u}}_i'{\varvec{v}}_r\). Collecting the elements \(z_{ir}\) in the matrix \({\varvec{Z}}\), in every iteration of the MM-algorithm he minimizes

$$\begin{aligned} \Vert {\varvec{Z}} - {\varvec{1m}}' - {\varvec{UV}}' \Vert ^2. \end{aligned}$$

For logistic reduced rank regression, we define \(\theta _{ir} = m_r + {\varvec{x}}_i'{\varvec{B}}{\varvec{v}}_r\). In every iteration of the MM-algorithm we minimize

$$\begin{aligned} \Vert {\varvec{Z}} - {\varvec{1m}}' - {\varvec{XBV}}' \Vert ^2, \end{aligned}$$

which can be done by computing the mean and a generalized singular value decomposition as in Expressions 12, and 3. In detail, in every iteration we compute \({\varvec{Z}} = {\varvec{1m}}' + {\varvec{XBV}}' + 4({\varvec{Y}} - {\varvec{\Pi }})\), where \({\varvec{\Pi }}\) is the matrix with elements \(\pi _{ir}\), using the current parameter values and update our parameters as

  • \({\varvec{m}}^+ = N^{-1} ({\varvec{Z}} - {\varvec{XBV}}'){\varvec{1}}\)

  • \(({\varvec{X}}'{\varvec{X}})^{-\frac{1}{2}}{\varvec{X}}'({\varvec{Z}} - {\varvec{1m}}') = {\varvec{P}}{\varvec{\Phi }}{\varvec{Q}}'\)

  • \({\varvec{B}}^+ = \sqrt{N} ({\varvec{X}}'{\varvec{X}})^{-\frac{1}{2}} {\varvec{P}}_S\)

  • \({\varvec{V}}^+ = (\sqrt{N})^{-1} {\varvec{Q}}_S{\varvec{\Phi }}_S\)

where \({\varvec{P}}_S\) are the singular vectors corresponding to the S largest singular values, similarly for \({\varvec{Q}}_S\), and \({\varvec{\Phi }}_S\) is the diagonal matrix with the S largest singular values.

3 Visualization

A rank 2 model can be visualized with a two-dimensional representation in a so-called triplot, that shows simultaneously three types of information: the predictor variables, the response variables, and the participants (objects). When a higher rank model is fitted, triplots can be constructed for any pair of dimensions.

We discuss two types of triplots for logistic reduced rank regression. The first is based on the inner product relationship where we project points, representing the participants, on response variable axes with markers indicating probabilities of responding with yes (or 1). We call this a triplot of type I. Another type of triplot was recently described in detail by De Rooij and Groenen (2023) and uses a distance representation. We call this a triplot of type D. The two triplots are equivalent in the sense that they represent the same information but in a different way. In this section, we describe the two triplots in detail, make a comparison, and propose a new hybrid type of triplot that combines the advantages of the two types of triplots.

Fig. 1
figure 1

Visualization of Logistic Reduced Rank Models. a graphical representation of two predictor variables and the process for interpolation for two observations A and B; b type I representation of one response variable and the process of prediction for the two observations; c type D representation of one response variable, where the distance between the two observations A and B and the two response classes determines the probabilities

3.1 The type I triplot

This type of logistic triplot was proposed by Vicente-Villardón et al. (2006) and Vicente-Villardón and Vicente-Gonzalez (2019). The objects, or participants, are depicted as points in a two-dimensional Euclidean space with coordinates \({\varvec{u}}_i = {\varvec{B}}'{\varvec{x}}_i\).

Each of the predictor variables is represented by a variable axis through the origin of the Euclidean space with direction \(b_{p2}/b_{p1}\). Markers can be added to the variables axis representing units \(t = \pm 1, \pm 2, ..\) with coordinates \(t \frac{b_{p2}}{b_{p1}}\). We use the convention that the variable axis has a dotted and a solid part. The solid part represents the observed range of the variable in the data, the dotted part extends the variable axis to the border of the display. The variable label is printed on the side with the highest value of the variable.

Object coordinates (\({\varvec{u}}_i\)) are a direct result of these predictor variable axis by the process of interpolation, as described in Gower and Hand (1996) and Gower et al. (2011).

We illustrated in Fig. 1a, where two predictor variables are represented one with regression weights 0.55 and -0.45 (diagonal variable axis), the other with regression weights -0.05 and -0.75 (almost vertical variable axis). Markers for values -3 till 3 are added to both variable axes. Also included are two observations, A and B, who have values on the two predictor variables 2 and 1 (A), and -3 and 2 (B) respectively. The process of interpolation is illustrated by the grey dotted lines, that is, we have to add two vectors to obtain the coordinates for the two observations.

Each of the response variables is also represented by a variable axis through the origin. The direction of the variables axis is \(v_{r2}/v_{r1}\). Markers can be added to these variables axis as well. We add markers that represent the probabilities of responding 1 equal to \(\pi = \{0.1, 0.2, \ldots , 0.9\}\). The location of these markers is given by \(\lambda ({\varvec{v}}_r'{\varvec{v}}_r)^{-1}{\varvec{v}}_r\), where \(\lambda = \log (\pi /(1-\pi )) - {\hat{m}}_r\) (based on Gower et al. 2011, page 24) . To obtain the predicted probabilities for an object or participant for response variables r, the process called prediction (Gower and Hand 1996; Gower et al. 2011) has to be used, where the point representing this object is projected onto the variable axis.

This is illustrated in Fig. 1b for a single response variable with \(v_{r1} = -0.8\) and \(v_{r2} = -0.5\). The value of \(m = -1\). On the variable axis there are markers indicating the expected probabilities. By projecting the points of the observations, A and B, onto this variable axis we obtain the expected probabilities. For observation A this is approximately 0.28, while for observation B the expected probability is approximately 0.61.

The two-dimensional space can be partitioned into two parts by a decision line for response variable r. These decision lines are perpendicular to the response variable axis and through the \(\pi = 0.5\) marker. As we have a decision line for each response variable, we have R of these lines in total, together partitioning the space in a maximum of \(\sum _{s=0}^S \left( {\begin{array}{c}R\\ s\end{array}}\right)\) regions (Coombs and Kao 1955), each region having a favourite response profile.

3.2 The type D triplot

These types of triplots are described in detail in De Rooij and Groenen (2023). Similar work has been proposed earlier mainly in the context of political vote casts by Poole and Rosenthal (1985); Clinton et al. (2004), and Poole et al. (2011).

In Type D triplots, the response variables are represented in a different manner, while the object points and the variable axes for the predictor remain the same (see Fig. 1a). Each response variable is represented by two points, one for the no-category (with coordinates \({\varvec{w}}_{r0}\)) and one for the yes-category (with coordinates \({\varvec{w}}_{r1}\)). The squared distance between an object location and these two points defines the probability, that is

$$\begin{aligned} \pi _{ir} = \frac{\exp \left( -\frac{1}{2}d^2({\varvec{u}}_i,{\varvec{w}}_{r1}) \right) }{\exp \left( -\frac{1}{2}d^2({\varvec{u}}_i,{\varvec{w}}_{r0})\right) + \exp \left( -\frac{1}{2}d^2({\varvec{u}}_i,{\varvec{w}}_{r1})\right) }, \end{aligned}$$
(4)

where \(d^2({\varvec{u}}_i,{\varvec{w}}_{r1})\) is the squared Euclidean two-mode distance

$$\begin{aligned} d^2({\varvec{u}}_i,{\varvec{w}}_{r1}) = \sum _{s=1}^S (u_{is} - w_{r1s})^2. \end{aligned}$$

The coordinates \({\varvec{w}}_{r0}\) and \({\varvec{w}}_{r1}\) for all r can be collected in the \(2R \times S\) matrix \({\varvec{W}}\). This matrix can be reparametrized as

$$\begin{aligned} {\varvec{W}} = {\varvec{A}}_l {\varvec{L}} + {\varvec{A}}_k {\varvec{K}} \end{aligned}$$

with \({\varvec{A}}_l = {\varvec{I}}_R \otimes [1, 1]'\) and \({\varvec{A}}_k = {\varvec{I}}_R \otimes [1, -1]'\), where \(\otimes\) denotes the Kronecker product, and where \({\varvec{L}}\) is the \(R \times S\) matrix with response variable locations and \({\varvec{K}}\) the \(R \times S\) matrix representing the discriminatory power for the response variables. Elements of \({\varvec{K}}\) and \({\varvec{L}}\) are denoted by \(k_{rs}\) and \(l_{rs}\), respectively.

To obtain the coordinates \({\varvec{W}}\) from the parameters of the logistic reduced rank regression we take the following steps (for details, see De Rooij and Groenen 2022). An exact comparison is difficult, because in the implementation in VGAM the call starts with a formula, so the design matrices have to be built, whereas the lmap packages starts with the two matrices. Furthermore, the VGAM implementation uses several criteria for convergence, while the lmap convergence criterion is solely based on deviance. In the comparison, we first checked using the complete data set which convergence criterion in lmap leads to the same solution. Nevertheless, the comparison is not completely fair. We estimate the same rank 2 model using these two algorithms. The data set has \(N = 1885\), \(P = 9\), and \(R = 11\). To compare the speed of the two algorithms we use the microbenchmark package (Mersmann 2021) where the two algorithms are applied ten times. Results are shown in Table 1, where it can be seen that MM algorithm is much faster.

Table 1 Timing of rrvglm-algorithm in the VGAM package and the lpca-algorithm in the lmap package for the drug consumption and NCD data sets

For our MM algorithm, \(({\varvec{X}}'{\varvec{X}})^{-\frac{1}{2}}\) and \(({\varvec{X}}'{\varvec{X}})^{-\frac{1}{2}}{\varvec{X}}\) need only to be evaluated once. During the iterations a SVD of a \(P \times R\) matrix has to be solved which is a relatively small matrix. Therefore, the computational burden during the iterations is small. As is usual for MM algorithms convergence is relatively slow in terms of the number of iterations needed (Heiser 1995). For the drug consumption data, 42 iterations were needed.

The IRLS algorithm alternates between an update of \({\varvec{B}}\) and \({\varvec{V}}\) (Yee 2015, Sect. 5.3.1). For both updates, a matrix with working weights needs to be inverted. As this matrix depends on the current estimated values, the inverse needs to be re-evaluated in every iteration. These matrix inversions are computationally heavy. In terms of the number of iterations the IRLS algorithm is faster, that is, only 6 iterations were needed for the drug consumption data.

Yee (2015) (Sect. 3.2.1) points out that, apart from some matrix multiplications, the number of floating point operations (flops) for the IRLS algorithm is \(2NS^3(P^2 + R^2)\) per iteration, which for this application is 6092320. The SVD in the MM algorithm takes \(2PR^2 + 11R^3 = 16819\) flops per iteration (cf., Trefethen and Bau 1997). These computations exclude the intercepts (\({\varvec{m}}\)). Also with respect to storage our algorithm has some advantages, that is, the MM algorithm works with the matrix \({\varvec{X}}\) while the IRLS algorithm works with \({\varvec{X}} \otimes {\varvec{I}}_S\), a much larger matrix.

4.1.2 Visualization

De Rooij and Groenen (2022). The algorithm uses the majorization inequality of De Leeuw (2006) to move from the negative log-likelihood to a least squares function and the generalized singular value decomposition for minimizing the least squares function. Each of these two steps is known and as such the algorithm is a straightforward combination of the two. The inequality of De Leeuw (2006) to find a least-squares majorization function for the negative binomial or multinomial log-likelihood can more commonly used to develop logistic models for categorical data.

We compared the new algorithm to the IRLS algorithm (Yee and Hastie 2003; Yee 2015) on two empirical data sets; our new algorithm is about ten times faster but uses more iterations. MM algorithms are known for slow convergence (see Heiser 1995) in terms of a number of iterations needed. Yet, the updates within the iterations are computationally cheap. A singular value decomposition of a \(P \times R\) matrix needs to be computed, where usually P and R are relatively small. Our MM algorithm is only applicable for logistic reduced rank models, whereas the IRLS algorithm of Yee (2015) is designed for the family of reduced rank generalized linear models.

In the VGAM-package (Yee 2022, 2015), in a second step also the standard errors of the model parameters can be obtained. In an MM algorithm, not the negative log-likelihood is itself minimized but iteratively the majorization function is minimized. Therefore, our algorithm does not automatically give an estimate of the Hessian matrix nor the standard errors. Is this a bad thing? Buja et al. (2019a, 2019b) recently argued that if we know statistical models are approximations, we should also carry the consequences of this knowledge. The computation of standard errors assumes the model to be true, that is, not an approximation. Therefore, assuming the model is true while knowing it is not true results in estimated standard errors that are biased. A better approach to obtain standard errors or confidence intervals is to use the so-called pairs bootstrap, where the predictors and responses are jointly re-sampled. For the bootstrap it is useful if the algorithm is fast.

Two types of triplots were discussed and compared, the Type I and Type D triplots. The two types of triplots are equal on the predictor side of the model, but differ in the representation of the response variables. Whereas the Type I triplot uses an inner product relationship where object points have to be projected onto response variable axes, the Type D uses a distance relationship, where the distance of an object point towards the yes and no point for each response variable determines the probabilities. We discussed advantages and disadvantages of both approaches, and were then able to develop a new, hybrid, type of triplot by combining the two.

In the Type D triplot we make use of the two-mode Euclidean distance. This distance is often used to model single-peaked response functions and the object points are then called ideal points. The single-peaked response function is usually contrasted with the dominance response function (see, for example, Drasgow et al. 2010, and references therein). Whereas in the latter the probability of answering yes is a monotonic function of the position on the latent trait, in the former this probability is a single-peaked function defined by distances. In the Type D triplot, however, the relationship is still of the dominance type, where the probability of answering yes goes monotonically up or down for objects located on any straight line in the Euclidean representation. The main reason is that the model is in terms of the distance towards the categories of the response variable, not the response variable itself. De Rooij et al. (2022) recently developed a distance model where the distance between the object and the response variable (i.e., not the categories) determines the probability of answering yes. Such a representation warrants an interpretation in terms of single peaked relationships. Logistic reduced rank regression models assume a monotonic predictor-response relationship, no matter how the model is visualized.