1 Introduction

Suppose that we have a matrix observation \(X \in \mathbb {R}^{n \times p}\) whose entries are independent normal random variables \(X_{ij} \sim \textrm{N} (M_{ij},1)\), where \(M \in \mathbb {R}^{n \times p}\) is an unknown mean matrix. In this setting, we consider estimation of M under the matrix quadratic loss (Abu-Shanab et al., 2012; Matsuda and Strawderman, 2022)

$$\begin{aligned} L(M,\hat{M}) = (\hat{M} - M)^{\top } (\hat{M} - M), \end{aligned}$$
(1)

which takes a value in the set of \(p \times p\) positive semidefinite matrices. The risk function of an estimator \(\hat{M}=\hat{M}(X)\) is defined as \(R(M,\hat{M}) = \textrm{E}_M [ L ( M,\hat{M}(X) ) ]\), and an estimator \(\hat{M}_1\) is said to dominate another estimator \(\hat{M}_2\) if \(R(M,\hat{M}_1) \preceq R(M,\hat{M}_2)\) for every M, where \(\preceq \) is the Löwner order: \(A \preceq B\) means that \(B-A\) is positive semidefinite. Thus, if \(\hat{M}_1\) dominates \(\hat{M}_2\) under the matrix quadratic loss, then

$$\begin{aligned} \textrm{E}_M [ \Vert (\hat{M}_1 - M) c \Vert ^2 ] \le \textrm{E}_M [ \Vert (\hat{M}_2 - M) c \Vert ^2 ] \end{aligned}$$

for every M and \(c \in \mathbb {R}^p\). In particular, each column of \(\hat{M}_1\) dominates that of \(\hat{M}_2\) as an estimator of the corresponding column of M under quadratic loss. Recently, Matsuda and Strawderman (2022) investigated shrinkage estimation in this setting by introducing a concept called matrix superharmonicity, which can be viewed as a generalization of the theory by Stein (1974) for a normal mean vector. Note that shrinkage estimation of a normal mean matrix under the Frobenius loss, which is the trace of the matrix quadratic loss, has been well studied, e.g., (Matsuda and Komaki 2015; Tsukuma 2008; Tsukuma and Kubokawa 2020; Yuasa and Kubokawa 2023a, b; Zheng 1986).

Many common estimators of a normal mean matrix are orthogonally invariant. Namely, they satisfy \(\hat{M}(PXQ) = P \hat{M}(X) Q\) for any orthogonal matrices \(P \in O(n)\) and \(Q \in O(p)\). It can be viewed as a generalization of the rotationally invariance of estimators for a normal mean vector, which is satisfied by many minimax shrinkage estimators (Fourdrinier et al. 2018). We focus on orthogonally invariant estimators given by

$$\begin{aligned} \hat{M} = X+\widetilde{\nabla }h(X), \end{aligned}$$
(2)

where h(X) is an orthogonally invariant function satisfying \(h(PXQ)=h(X)\) for any orthogonal matrices \(P \in O(n)\) and \(Q \in O(p)\) and \(\widetilde{\nabla }\) is the matrix gradient operator defined by (3). For example, the maximum-likelihood estimator \(\hat{M}=X\) corresponds to \(h(X)=0\). The Efron–Morris estimator (Efron and Morris 1972) defined by \(\hat{M}=X(I-(n-p-1)(X^{\top }X)^{-1})\) when \(n-p-1>0\) corresponds to \(h(X)={-(n-p-1)/2} \cdot \log \det (X^{\top } X)\). This estimator can be viewed as a matrix generalization of the James–Stein estimator, and it is minimax under the Frobenius loss (Efron and Morris 1972) as well as matrix quadratic loss (Matsuda and Strawderman 2022). We will provide another example of an orthogonally invariant estimator of the form (2) in Sect. 4. Note that an estimator of the form (2) is called a pseudo-Bayes estimator (Fourdrinier et al. 2018), because it coincides with the (generalized) Bayes estimator when h is given by the logarithm of the marginal distribution of X with respect to some prior on M (Tweedie’s formula).

In this study, to further the theory of shrinkage estimation under the matrix quadratic loss, we develop a general formula for the matrix quadratic risk of orthogonally invariant estimators of the form (2). First, we prepare several matrix derivative formulas in Sect. 2. Then, we derive the formula for the matrix quadratic risk in Sect. 3. Finally, we present an example in Sect. 4, which is motivated by Stein’s proposal for improving on the Efron–Morris estimator 50 years ago (Stein 1974).

2 Matrix derivative formulas

Here, we develop matrix derivative formulas based on Stein (1974). Note that, whereas Stein (1974) considered a setting where X is a \(p \times n\) matrix, here, we take X to be a \(n \times p\) matrix. In the following, the subscripts a, b, \(\ldots \) run from 1 to n and the subscripts i, j, \(\ldots \) run from 1 to p. We denote the Kronecker delta by \(\delta _{ij}\).

We employ the following notations for matrix derivatives introduced in Matsuda and Strawderman (2022).

Definition 1

For a function \(f: \mathbb {R}^{n \times p} \rightarrow \mathbb {R}\), its matrix gradient \(\widetilde{\nabla } f: \mathbb {R}^{n \times p} \rightarrow \mathbb {R}^{n \times p}\) is defined as

$$\begin{aligned} (\widetilde{\nabla } f(X))_{ai} = \frac{\partial }{\partial X_{ai}} f(X). \end{aligned}$$
(3)

Definition 2

For a \(C^2\) function \(f: \mathbb {R}^{n \times p} \rightarrow \mathbb {R}\), its matrix Laplacian \(\widetilde{\Delta } f: \mathbb {R}^{n \times p} \rightarrow \mathbb {R}^{p \times p}\) is defined as

$$\begin{aligned} (\widetilde{\Delta } f(X))_{ij} = \sum _{a=1}^n \frac{\partial ^2}{\partial X_{ai} \partial X_{aj}} f(X). \end{aligned}$$
(4)

Let

$$\begin{aligned} X^{\top } X = V \Lambda V^{\top } \end{aligned}$$
(5)

be a spectral decomposition of \(X^{\top } X\), where \(V=(v_1,\dots ,v_p)\) is an orthogonal matrix and \(\Lambda = \textrm{diag}(\lambda _1,\dots ,\lambda _p)\) is a diagonal matrix. Then, the derivatives of \(\lambda \) and V are obtained as follows.

Lemma 1

The derivative of \(\lambda _i\) is

$$\begin{aligned} \frac{\partial \lambda _i}{\partial X_{aj}} = 2 V_{ji} \sum _k X_{ak} V_{ki}. \end{aligned}$$
(6)

Thus

$$\begin{aligned} \widetilde{\nabla } \lambda _i = 2 X v_i v_i^{\top }, \end{aligned}$$
(7)

where \(v_i\) is the i-th column vector of V.

Proof

By differentiating \(V^{\top } V = I_p\) and using \((\textrm{d} V)^{\top } V=(V^{\top } \textrm{d} V)^{\top }\), we obtain

$$\begin{aligned} V^{\top } \textrm{d} V + (V^{\top } \textrm{d} V)^{\top } = O, \end{aligned}$$
(8)

which means the antisymmetricity of \(V^{\top } \textrm{d} V\).

Taking the differential of (5), we have

$$\begin{aligned} \textrm{d} (X^{\top } X) = (\textrm{d} V) \Lambda V^{\top } + V (\textrm{d} \Lambda ) V^{\top } + V \Lambda (\textrm{d} V)^{\top }. \end{aligned}$$
(9)

Then, multiplying (9) on the left by \(V^{\top }\) and on the right by V, we obtain

$$\begin{aligned} V^{\top } \cdot \textrm{d} (X^{\top } X) \cdot V = V^{\top } (\textrm{d} V) \Lambda + \textrm{d} \Lambda + \Lambda (\textrm{d} V)^{\top } V. \end{aligned}$$
(10)

Since \(\Lambda \) and \(\textrm{d} \Lambda \) are diagonal and \((\textrm{d} V)^{\top } V=(V^{\top } \textrm{d} V)^{\top }\), the (ij)th entry of (10) yields

$$\begin{aligned} (V^{\top } \cdot \textrm{d} (X^{\top } X) \cdot V)_{ij}&= (V^{\top } \textrm{d} V)_{ij} \lambda _j + \delta _{ij} \textrm{d} \lambda _i + \lambda _i ((\textrm{d} V)^{\top } V)_{ij} \nonumber \\ {}&= (V^{\top } \textrm{d} V)_{ij} \lambda _j + \delta _{ij} \textrm{d} \lambda _i + \lambda _i (V^{\top } \textrm{d} V)_{ji}. \end{aligned}$$

Since \((V^{\top } \textrm{d} V)_{ji}=-(V^{\top } \textrm{d} V)_{ij}\) from (8), we obtain

$$\begin{aligned} (V^{\top } \cdot \textrm{d} (X^{\top } X) \cdot V)_{ij}&= (\lambda _j-\lambda _i) (V^{\top } \textrm{d} V)_{ij} + \delta _{ij} \textrm{d} \lambda _i. \end{aligned}$$
(11)

On the other hand, from \(\textrm{d} (X^{\top } X) = (\textrm{d} X)^{\top } X + X^{\top } \textrm{d} X\)

$$\begin{aligned} \textrm{d} (X^{\top } X)_{ij} = \sum _a ((\textrm{d} X)_{ai} X_{aj} + X_{ai} (\textrm{d} X)_{aj}). \end{aligned}$$
(12)

By taking \(i=j\) in (11)

$$\begin{aligned} \textrm{d} \lambda _i = (V^{\top } \cdot \textrm{d} (X^{\top } X) \cdot V)_{ii} = \sum _{j,k} (V^{\top })_{ij} (\textrm{d} (X^{\top } X))_{jk} V_{ki}. \end{aligned}$$

Then, using (12)

$$\begin{aligned} \textrm{d} \lambda _i&= \sum _{j,k} V_{ji} V_{ki} \sum _a ((\textrm{d} X)_{aj} X_{ak} + X_{aj} (\textrm{d} X)_{ak}) \\ {}&= 2 \sum _{j,k} V_{ji} V_{ki} \sum _a X_{ak} (\textrm{d} X)_{aj} \\ {}&= 2 \sum _{a,j} V_{ji} \sum _k X_{ak} V_{ki} (\textrm{d} X)_{aj}. \end{aligned}$$

Thus, we obtain (6) and it leads to (7). \(\square \)

Lemma 2

The derivative of \(V_{ij}\) is

$$\begin{aligned} \frac{\partial V_{ij}}{\partial X_{ak}} = \sum _{l \ne j} \frac{V_{il}}{\lambda _j-\lambda _l} ((XV)_{aj} V_{kl} + (XV)_{al} V_{kj}). \end{aligned}$$
(13)

Proof

From (8), we have \((V^{\top } \textrm{d} V)_{ii}=0\). Also, from (11)

$$\begin{aligned} (V^{\top } \textrm{d} V)_{ij} = \frac{1}{\lambda _j-\lambda _i} (V^{\top } \cdot \textrm{d} (X^{\top } X) \cdot V)_{ij} \end{aligned}$$

for \(i \ne j\). Therefore

$$\begin{aligned} \textrm{d} V_{ij}&= (V V^{\top } \textrm{d} V)_{ij} \\ {}&= \sum _k V_{ik} (V^{\top } \textrm{d} V)_{kj} \\ {}&= \sum _{k \ne j} V_{ik} \frac{1}{\lambda _j-\lambda _k} (V^{\top } \cdot \textrm{d} (X^{\top } X) \cdot V)_{kj} \\ {}&= \sum _{k \ne j} \frac{1}{\lambda _j-\lambda _k} V_{ik} \sum _{l,m} V_{lk} \textrm{d} (X^{\top } X)_{lm} V_{mj}. \end{aligned}$$

Then, using (12)

$$\begin{aligned} \textrm{d} V_{ij}&= \sum _{k \ne j} \frac{1}{\lambda _j-\lambda _k} V_{ik} \sum _{l,m} V_{lk} \sum _a ((\textrm{d} X)_{al} X_{am} + X_{al} (\textrm{d} X)_{am}) V_{mj} \\ {}&= \sum _{k \ne j} \frac{1}{\lambda _j-\lambda _k} V_{ik} \sum _a \left( (XV)_{aj} \sum _l V_{lk} (\textrm{d} X)_{al} + (XV)_{ak} \sum _m V_{mj} (\textrm{d} X)_{am} \right) \\ {}&= \sum _l \sum _{k \ne j} \frac{V_{ik}}{\lambda _j-\lambda _k} \sum _a ((XV)_{aj} V_{lk}+(XV)_{ak} V_{lj}) (\textrm{d} X)_{al} \\ {}&= \sum _{a,k} \sum _{l \ne j} \frac{V_{il}}{\lambda _j-\lambda _l} ((XV)_{aj} V_{kl}+(XV)_{al} V_{kj}) (\textrm{d} X)_{ak}, \end{aligned}$$

where we switched k and l in the last step. Thus, we obtain (13). \(\square \)

A function h is said to be orthogonally invariant if it satisfies \(h(PXQ)=h(X)\) for any orthogonal matrices \(P \in O(n)\) and \(Q \in O(p)\). Such a function can be written as \(h(X)=H(\lambda )\), where \(\lambda \) is the eigenvalues of \(X^{\top } X\) as given by (5), and its derivatives are calculated as follows.

Lemma 3

The matrix gradient (3) of an orthogonally invariant function \(h(X)=H(\lambda )\) is

$$\begin{aligned} \widetilde{\nabla } h = 2 \sum _i \frac{\partial H}{\partial \lambda _i} X v_i v_i^{\top }. \end{aligned}$$
(14)

Thus

$$\begin{aligned} (\widetilde{\nabla } h)^{\top } (\widetilde{\nabla } h) = 4 \sum _i \left( \frac{\partial H}{\partial \lambda _i} \right) ^2 \lambda _i v_i v_i^{\top } = V D V^{\top }, \end{aligned}$$
(15)

where D is the \(p \times p\) diagonal matrix given by

$$\begin{aligned} D_{kk} = 4 \lambda _k \left( \frac{\partial H}{\partial \lambda _k} \right) ^2. \end{aligned}$$

Proof

From (7)

$$\begin{aligned} \widetilde{\nabla } h = \sum _i \frac{\partial H}{\partial \lambda _i} \widetilde{\nabla } \lambda _i = 2 \sum _i \frac{\partial H}{\partial \lambda _i} X v_i v_i^{\top }, \end{aligned}$$

which yields (14). Then, using \(X^{\top }X = V \Lambda V^{\top }\) and \(V^{\top } V = I_p\)

$$\begin{aligned} (\widetilde{\nabla } h)^{\top } (\widetilde{\nabla } h)&= 4 \sum _{i,j} \frac{\partial H}{\partial \lambda _i} \frac{\partial H}{\partial \lambda _j} v_i v_i^{\top } X^{\top } X v_j v_j^{\top } \\ {}&= 4 \sum _{i,j} \frac{\partial H}{\partial \lambda _i} \frac{\partial H}{\partial \lambda _j} v_i \Lambda _{ij} v_j^{\top } \\ {}&= 4 \sum _{i} \left( \frac{\partial H}{\partial \lambda _i} \right) ^2 \lambda _i v_i v_i^{\top }, \end{aligned}$$

which yields (15). \(\square \)

Lemma 4

The matrix Laplacian (4) of an orthogonally invariant function \(h(X)=H(\lambda )\) is

$$\begin{aligned} \widetilde{\Delta } h = V D V^{\top }, \end{aligned}$$
(16)

where D is the \(p \times p\) diagonal matrix given by

$$\begin{aligned} D_{kk} = 4 \lambda _k \frac{\partial ^2 H}{\partial \lambda _k^2} + 2n \frac{\partial H}{\partial \lambda _k} + 2 \sum _{l \ne k} \frac{\lambda _l}{\lambda _k-\lambda _l} \left( \frac{\partial H}{\partial \lambda _k} - \frac{\partial H}{\partial \lambda _l} \right) . \end{aligned}$$

Proof

From (14)

$$\begin{aligned} \frac{\partial ^2 h}{\partial X_{ai} \partial X_{aj}}&= \frac{\partial }{\partial X_{ai}} \left( 2 \sum _k \frac{\partial H}{\partial \lambda _k} X v_k v_k^{\top } \right) _{aj} \nonumber \\ {}&= 2 \sum _k \left( \left( \widetilde{\nabla } \frac{\partial H}{\partial \lambda _k} \right) _{ai} (X v_k v_k^{\top })_{aj} + \frac{\partial H}{\partial \lambda _k} (v_k v_k^{\top })_{ij} \right. \nonumber \\ {}&\left. + \frac{\partial H}{\partial \lambda _k} \sum _l X_{al} \frac{\partial }{\partial X_{ai}} (v_k v_k^{\top })_{lj} \right) \nonumber \\ {}&= 2 \sum _k \left( 2 \sum _l \frac{\partial ^2 H}{\partial \lambda _k \partial \lambda _l} (X v_l v_l^{\top })_{ai} (X v_k v_k^{\top })_{aj} + \frac{\partial H}{\partial \lambda _k} V_{ik} V_{jk} \right. \nonumber \\ {}&\quad \left. + \frac{\partial H}{\partial \lambda _k} \sum _l X_{al} \left( \frac{\partial V_{lk}}{\partial X_{ai}} V_{jk} + V_{lk} \frac{\partial V_{jk}}{\partial X_{ai}} \right) \right) . \end{aligned}$$
(17)

Also, from (13)

$$\begin{aligned} \frac{\partial V_{lk}}{\partial X_{ai}} V_{jk} + V_{lk} \frac{\partial V_{jk}}{\partial X_{ai}}&= \sum _{m \ne k} \frac{V_{lm}}{\lambda _k-\lambda _m} ((XV)_{ak}V_{im}+(XV)_{am}V_{ik}) V_{jk} \nonumber \\ {}&\qquad + V_{lk} \sum _{m \ne k} \frac{V_{jm}}{\lambda _k-\lambda _m} ((XV)_{ak}V_{im}+(XV)_{am}V_{ik}) \nonumber \\ {}&= \sum _{m \ne k} \frac{(V_{lm}V_{jk}+V_{lk}V_{jm})((XV)_{ak}V_{im}+(XV)_{am}V_{ik})}{\lambda _k-\lambda _m}. \end{aligned}$$
(18)

By substituting (18) into (17) and taking the sum

$$\begin{aligned} (\widetilde{\Delta } h)_{ij}&= \sum _{a=1}^n \frac{\partial ^2 h}{\partial X_{ai} \partial X_{aj}} \\ {}&= 4 \sum _{k,l} \frac{\partial ^2 H}{\partial \lambda _k \partial \lambda _l} (v_l v_l^{\top } X^{\top } X v_k v_k^{\top })_{ij} + 2n \sum _k \frac{\partial H}{\partial \lambda _k} V_{ik} V_{jk} \\ {}&\qquad + 2 \sum _{k,l} \frac{\partial H}{\partial \lambda _k} \sum _{m \ne k} \frac{(V_{lm}V_{jk}+V_{lk}V_{jm})((X^{\top }XV)_{lk}V_{im}+(X^{\top }XV)_{lm}V_{ik})}{\lambda _k-\lambda _m} \\ {}&= \sum _k \left( 4 \lambda _k \frac{\partial ^2 H}{\partial \lambda _k^2} + 2n \frac{\partial H}{\partial \lambda _k} \right) V_{ik} V_{jk} \\ {}&\qquad + 2 \sum _{k,l} \frac{\partial H}{\partial \lambda _k} \sum _{m \ne k} \frac{(V_{lm}V_{jk}+V_{lk}V_{jm})(\lambda _k V_{lk}V_{im}+\lambda _m V_{lm}V_{ik})}{\lambda _k-\lambda _m} \\ {}&= \sum _k \left( 4 \lambda _k \frac{\partial ^2 H}{\partial \lambda _k^2} + 2n \frac{\partial H}{\partial \lambda _k} \right) V_{ik} V_{jk} {+} 2 \sum _k \frac{\partial H}{\partial \lambda _k} \sum _{m \ne k} \frac{\lambda _k V_{jm} V_{im} {+} \lambda _m V_{jk} V_{ik}}{\lambda _k{-}\lambda _m}, \end{aligned}$$

where we used \(X^{\top }X = V \Lambda V^{\top }\) and

$$\begin{aligned}&\sum _l {(V_{lm}V_{jk}+V_{lk}V_{jm})(\lambda _k V_{lk}V_{im}+\lambda _m V_{lm}V_{ik})} \\ {}&\quad = \lambda _k (\delta _{km} V_{jk} + V_{jm}) V_{im} + \lambda _m (V_{jk} + \delta _{km} V_{jm}) V_{ik}. \end{aligned}$$

Then

$$\begin{aligned} (\widetilde{\Delta } h)_{ij}&= \sum _k \left( 4 \lambda _k \frac{\partial ^2 H}{\partial \lambda _k^2} + 2n \frac{\partial H}{\partial \lambda _k} \right. \\&\left. \quad + 2 \frac{\partial H}{\partial \lambda _k} \sum _{m \ne k} \frac{\lambda _m}{\lambda _k-\lambda _m} \right) V_{ik} V_{jk} + 2 \sum _k \lambda _k \frac{\partial H}{\partial \lambda _k} \sum _{m \ne k} \frac{V_{im} V_{jm}}{\lambda _k-\lambda _m} \\ {}&= \sum _k \left( 4 \lambda _k \frac{\partial ^2 H}{\partial \lambda _k^2} + 2n \frac{\partial H}{\partial \lambda _k} + 2 \sum _{m \ne k} \frac{\lambda _m}{\lambda _k-\lambda _m} \left( \frac{\partial H}{\partial \lambda _k} - \frac{\partial H}{\partial \lambda _m} \right) \right) V_{ik} V_{jk}, \end{aligned}$$

where we used

$$\begin{aligned} \sum _k \lambda _k \frac{\partial H}{\partial \lambda _k} \sum _{m \ne k} \frac{V_{im} V_{jm}}{\lambda _k-\lambda _m}&= \sum _m V_{im} V_{jm} \sum _{k \ne m} \frac{\lambda _k}{\lambda _k-\lambda _m} \frac{\partial H}{\partial \lambda _k} \\ {}&= \sum _k V_{ik} V_{jk} \sum _{m \ne k} \frac{\lambda _m}{\lambda _m-\lambda _k} \frac{\partial H}{\partial \lambda _m}. \end{aligned}$$

Thus, by rewriting m to l, we obtain (16). \(\square \)

By taking the trace of the matrix Laplacian (16), we have

$$\begin{aligned} {\Delta } h&= \textrm{tr} (\widetilde{\Delta } h) = \textrm{tr} (D) \\ {}&= \sum _k \left( 4 \lambda _k \frac{\partial ^2 H}{\partial \lambda _k^2} + 2n \frac{\partial H}{\partial \lambda _k} + 2 \sum _{l \ne k} \frac{\lambda _l}{\lambda _k-\lambda _l} \left( \frac{\partial H}{\partial \lambda _k} - \frac{\partial H}{\partial \lambda _l} \right) \right) \\ {}&= \sum _k \left( 4 \lambda _k \frac{\partial ^2 H}{\partial \lambda _k^2} + 2n \frac{\partial H}{\partial \lambda _k} \right) + 2 \sum _k \frac{\partial H}{\partial \lambda _k} \sum _{l \ne k} \frac{\lambda _k+\lambda _l}{\lambda _k-\lambda _l} \\ {}&= 4 \sum _k \lambda _k \frac{\partial ^2 H}{\partial \lambda _k^2} + 2 \sum _k \left( n-p+1 + 2 \lambda _k \sum _{l \ne k} \frac{1}{\lambda _k-\lambda _l} \right) \frac{\partial H}{\partial \lambda _k}, \end{aligned}$$

where we used

$$\begin{aligned} \sum _{l \ne k} \frac{\lambda _k+\lambda _l}{\lambda _k-\lambda _l}&= \lambda _k \sum _{l \ne k} \frac{1}{\lambda _k-\lambda _l} + \sum _{l \ne k} \frac{\lambda _l}{\lambda _k-\lambda _l} \nonumber \\ {}&= \lambda _k \sum _{l \ne k} \frac{1}{\lambda _k-\lambda _l} + \sum _{l \ne k} \left( \frac{\lambda _k}{\lambda _k-\lambda _l} - 1 \right) \nonumber \\ {}&= 2 \lambda _k \sum _{l \ne k} \frac{1}{\lambda _k-\lambda _l} -p+1. \end{aligned}$$
(19)

This coincides with the Laplacian formula in Stein (1974).

3 Risk formula

Now, we derive a general formula for the matrix quadratic risk of orthogonally invariant estimators of the form (2).

Theorem 5

Let \(h(X)=H(\lambda )\) be an orthogonally invariant function. Then, the matrix quadratic risk of an estimator \(\hat{M}=X+\widetilde{\nabla }h(X)\) is given by

$$\begin{aligned} \textrm{E}_M (\hat{M}-M)^{\top } (\hat{M}-M)&= n I_p + \textrm{E}_M [ V D V^{\top } ], \end{aligned}$$
(20)

where D is the \(p \times p\) diagonal matrix given by

$$\begin{aligned} D_{kk}&= 4 \left( 2 \lambda _k \frac{\partial ^2 H}{\partial \lambda _k^2} + n \frac{\partial H}{\partial \lambda _k} + \lambda _k \left( \frac{\partial H}{\partial \lambda _k} \right) ^2 + \sum _{l \ne k} \frac{\lambda _l}{\lambda _k-\lambda _l} \left( \frac{\partial H}{\partial \lambda _k} - \frac{\partial H}{\partial \lambda _l} \right) \right) . \end{aligned}$$

Proof

From Matsuda and Strawderman (2022), the matrix quadratic risk of an estimator \(\hat{M}=X+g(X)\) with a weakly differentiable function g is

$$\begin{aligned} \textrm{E}_M (\hat{M}-M)^{\top } (\hat{M}-M) = n I_p + \textrm{E}_M [\widetilde{\textrm{div}} \ g(X) + (\widetilde{\textrm{div}} \ g(X) )^{\top } + g(X)^{\top } g(X)], \end{aligned}$$
(21)

where the matrix divergence \(\widetilde{\textrm{div}} \ g: \mathbb {R}^{n \times p} \rightarrow \mathbb {R}^{p \times p}\) of a function \(g: \mathbb {R}^{n \times p} \rightarrow \mathbb {R}^{n \times p}\) is defined as

$$\begin{aligned} (\widetilde{\textrm{div}} \ g(X))_{ij} = \sum _{a=1}^n \frac{\partial }{\partial X_{ai}} g_{aj}(X). \end{aligned}$$

Therefore, by substituting \(g(X)=\widetilde{\nabla }h(X)\) and using \(\widetilde{\textrm{div}} \circ \widetilde{\nabla }=\widetilde{\Delta }\)

$$\begin{aligned} \textrm{E}_M (\hat{M}-M)^{\top } (\hat{M}-M) = n I_p + E_M [2 \widetilde{\Delta } h(X) + \widetilde{\nabla }h(X)^{\top } \widetilde{\nabla }h(X)]. \end{aligned}$$

Thus, using (15) and (16), we obtain (20). \(\square \)

By taking the trace of (20) and using (19), we obtain the following formula for the Frobenius risk of orthogonally invariant estimators, which coincides with the one given by Stein (1974).

Corollary 6

Let \(h(X)=H(\lambda )\) be an orthogonally invariant function. Then, the Frobenius risk of an estimator \(\hat{M}=X+\widetilde{\nabla }h(X)\) is given by

$$\begin{aligned}&\textrm{E}_M \Vert \hat{M}-M \Vert _{\textrm{F}}^2 \\ {}&\quad = np + 4 \textrm{E}_M \left[ \sum _k \left( 2 \lambda _k \frac{\partial ^2 H}{\partial \lambda _k^2} + n \frac{\partial H}{\partial \lambda _k} + \lambda _k \left( \frac{\partial H}{\partial \lambda _k} \right) ^2 \right) + \sum _k \frac{\partial H}{\partial \lambda _k} \sum _{l \ne k} \frac{\lambda _k + \lambda _l}{\lambda _k-\lambda _l} \right] \\ {}&\quad = np + 4 \textrm{E}_M \left[ \sum _k \left( 2 \lambda _k \frac{\partial ^2 H}{\partial \lambda _k^2} + n \frac{\partial H}{\partial \lambda _k} +\lambda _k \left( \frac{\partial H}{\partial \lambda _k} \right) ^2 \right) \right. \\ {}&\left. + \sum _k \frac{\partial H}{\partial \lambda _k} \left( 2 \lambda _k \sum _{l \ne k} \frac{1}{\lambda _k-\lambda _l} -p+1 \right) \right] . \end{aligned}$$

We derived the risk formula for orthogonally invariant estimators of the form (2), which are called pseudo-Bayes estimators (Fourdrinier et al. 2018). The class of pseudo-Bayes estimators includes all Bayes and generalized Bayes estimators. It is an interesting future work to extend the current result to general orthogonally invariant estimators. Also, extension to unknown covariance case is an important future problem. Note that Section 6.6.2 of Tsukuma and Kubokawa (2020) derived a risk formula for a class of estimators in the unknown covariance setting.

4 Example

We provide an example of the application of Theorem 5. Let \(X = U \Sigma V^{\top }\) with \(U \in \mathbb {R}^{n \times p}\), \(\Sigma = \textrm{diag} (\sigma _1, \ldots , \sigma _p)\) and \(V \in \mathbb {R}^{p \times p}\) be a singular value decomposition of X, where \(U^{\top } U = V^{\top } V = I_p\) and \(\sigma _1 \ge \cdots \ge \sigma _p \ge 0\) are the singular values of X. We consider an orthogonally invariant estimator given by

$$\begin{aligned} \hat{M} = U \cdot \textrm{diag} \left( \sigma _1 - \frac{c_1}{\sigma _1}, \dots , \sigma _p - \frac{c_p}{\sigma _p} \right) \cdot V^{\top }, \end{aligned}$$
(22)

where \(c_1,\dots ,c_p \ge 0\).

Lemma 7

The estimator (22) can be written in the form (2) with

$$\begin{aligned} h(X) = -\frac{1}{2} \sum _{k=1}^p {c_k} \log \lambda _k = - \sum _{k=1}^p {c_k} \log \sigma _k, \end{aligned}$$

where \(\lambda _1,\dots ,\lambda _p\) are the eigenvalues of \(X^{\top } X\), as shown in (5).

Proof

From (7)

$$\begin{aligned} \widetilde{\nabla } \log \lambda _k = \frac{1}{\lambda _k} \widetilde{\nabla } \lambda _k = \frac{2}{\lambda _k} X v_k v_k^{\top }. \end{aligned}$$

Thus

$$\begin{aligned} \widetilde{\nabla } h(X) = -X \sum _k \frac{c_k}{\lambda _k} v_k v_k^{\top } = -U \cdot \textrm{diag} \left( \frac{c_1}{\sigma _1}, \dots , \frac{c_p}{\sigma _p} \right) \cdot V^{\top }, \end{aligned}$$

where we used \(X=U \Sigma V^{\top }\) and \(\lambda _k=\sigma _k^2\). Therefore, the estimator (22) is written as \(\hat{M}=X+\widetilde{\nabla } h(X)\). \(\square \)

Theorem 8

The matrix quadratic risk of the estimator (22) is given by

$$\begin{aligned} \textrm{E}_M (\hat{M}-M)^{\top } (\hat{M}-M)&= n I_p + \textrm{E}_M [ V D V^{\top } ], \end{aligned}$$
(23)

where D is the \(p \times p\) diagonal matrix given by

$$\begin{aligned} D_{kk} = \frac{1}{\lambda _k} c_k (c_k-2n+4) - \frac{2}{\lambda _k} \sum _{l \ne k} \frac{c_k \lambda _l - c_l \lambda _k}{\lambda _k-\lambda _l}. \end{aligned}$$

Proof

To apply Theorem 5, let

$$\begin{aligned} H(\lambda ) = -\frac{1}{2} \sum _k c_k \log \lambda _k. \end{aligned}$$

We have

$$\begin{aligned} \frac{\partial H}{\partial \lambda _k} = -\frac{c_k}{2 \lambda _k}, \quad \frac{\partial ^2 H}{\partial \lambda _k^2} = \frac{c_k}{2 \lambda _k^2}. \end{aligned}$$

Thus

$$\begin{aligned} D_{kk}&= 4 \left( 2 \lambda _k \frac{\partial ^2 H}{\partial \lambda _k^2} + n \frac{\partial H}{\partial \lambda _k} + \lambda _k \left( \frac{\partial H}{\partial \lambda _k} \right) ^2 + \sum _{l \ne k} \frac{\lambda _l}{\lambda _k-\lambda _l} \left( \frac{\partial H}{\partial \lambda _k} - \frac{\partial H}{\partial \lambda _l} \right) \right) \\ {}&= \frac{1}{\lambda _k} c_k (c_k-2n+4) -2 \sum _{l \ne k} \frac{\lambda _l}{\lambda _k-\lambda _l} \left( \frac{c_k}{\lambda _k} - \frac{c_l}{\lambda _l} \right) \\ {}&= \frac{1}{\lambda _k} c_k (c_k-2n+4) - \frac{2}{\lambda _k} \sum _{l \ne k} \frac{c_k \lambda _l - c_l \lambda _k}{\lambda _k-\lambda _l}. \end{aligned}$$

Therefore, we obtain (23) from Theorem 5. \(\square \)

The Efron–Morris estimator (Efron and Morris 1972) corresponds to (22) with \(c_k \equiv n-p-1\). In this case

$$\begin{aligned} D_{kk}&= \frac{1}{\lambda _k} (n-p-1) (-n-p+3) - \frac{2}{\lambda _k} (n-p-1) \sum _{l \ne k} \frac{\lambda _l - \lambda _k}{\lambda _k-\lambda _l} \\ {}&= \frac{1}{\lambda _k} (n-p-1) (-n-p+3) + \frac{2}{\lambda _k} (n-p-1) (p-1) \\ {}&= - \frac{1}{\lambda _k} (n-p-1)^2. \end{aligned}$$

Thus, its matrix quadratic risk (23) is

$$\begin{aligned} \textrm{E}_M (\hat{M}-M)^{\top } (\hat{M}-M)&= n I_p - (n-p-1)^2 \textrm{E}_M \left[ (X^{\top } X)^{-1} \right] . \end{aligned}$$
(24)

This coincides with the result in Matsuda and Strawderman (2022).

Motivated by Stein’s proposal (Stein 1974) for improving on the Efron–Morris estimator, we consider the estimator (22) with \(c_k = n+p-2k-1\). In the following, we call it “Stein’s estimator” for convenience. Stein (1974) stated that the positive part of Stein’s estimator dominates the positive part of the Efron–Morris estimator under the Frobenius lossFootnote 1, where “positive-part” means the modification of (22) given by

$$\begin{aligned} \hat{M} = U \cdot \textrm{diag} \left( \left( \sigma _1 - \frac{c_1}{\sigma _1} \right) _+, \dots , \left( \sigma _p - \frac{c_p}{\sigma _p} \right) _+ \right) \cdot V^{\top }, \end{aligned}$$
(25)

where \((a)_+=\max (0,a)\). It is known that the estimator (22) is dominated by its positive part (25) under the Frobenius loss (Tsukuma 2008).

Proposition 9

The matrix quadratic risk of Stein’s estimator (estimator (22) with \(c_k = n+p-2k-1\)) is given by

$$\begin{aligned} \textrm{E}_M (\hat{M}-M)^{\top } (\hat{M}-M)&= n I_p + \textrm{E}_M [ V D V^{\top } ], \end{aligned}$$

where D is the \(p \times p\) diagonal matrix given by

$$\begin{aligned} D_{kk}&= -\frac{1}{\lambda _k} (n+p-2k-1) (n-3p+2k-1) + 4 \sum _{l \ne k} \frac{k-l}{\lambda _k-\lambda _l}. \end{aligned}$$

Thus, Stein’s estimator dominates the maximum-likelihood estimator under the matrix quadratic loss when \(n \ge 3p-1\).

Proof

By substituting \(c_k=n+p-2k-1\) into Theorem 8

$$\begin{aligned} D_{kk}&= \frac{1}{\lambda _k} (n+p-2k-1) (-n+p-2k+3) \\&- \frac{2}{\lambda _k} \sum _{l \ne k} \frac{(n+p-2k-1) \lambda _l - (n+p-2l-1) \lambda _k}{\lambda _k-\lambda _l} \\ {}&= \frac{1}{\lambda _k} (n+p-2k-1) (-n+p-2k+3) \\ {}&- \frac{2}{\lambda _k} \sum _{l \ne k} \left( -\frac{2(k-l) \lambda _k}{\lambda _k-\lambda _l} -(n+p-2k-1) \right) \\&= \frac{1}{\lambda _k} (n+p-2k-1) (-n+p-2k+3) \\&+ 4 \sum _{l \ne k} \frac{k-l}{\lambda _k-\lambda _l} +\frac{2}{\lambda _k} (p-1)(n+p-2k-1) \\ {}&= -\frac{1}{\lambda _k} (n+p-2k-1) (n-3p+2k-1) \\&+ 4 \sum _{l \ne k} \frac{k-l}{\lambda _k-\lambda _l}. \end{aligned}$$

The second term is nonpositive, since \(\lambda _1 \ge \lambda _2 \ge \dots \ge \lambda _p\). When \(n \ge 3p-1\), the first term is also nonpositive, and thus

$$\begin{aligned} \textrm{E}_M (\hat{M}-M)^{\top } (\hat{M}-M) \preceq n I_p. \end{aligned}$$

\(\square \)

Numerical results indicate that the bound of n in Proposition 9 may be relaxed to \(n \ge p+2\), which is the same bound with the Efron–Morris estimator. See Appendix.

Finally, we present simulation results to compare Stein’s estimator and the Efron–Morris estimator.

Figure 1 compares the Frobenius risk of Stein’s estimator and the Efron–Morris estimator when \(n=10\) and \(p=3\). It implies that Stein’s estimator dominates the Efron–Morris estimator under the Frobenius loss. Both estimators attain constant risk reduction when some singular values of M are small, regardless of the magnitude of the other singular values. Thus, both estimators work well for low rank matrices. See Matsuda and Strawderman (2022) for related discussions.

Fig. 1
figure 1

Frobenius risk of the Efron–Morris estimator (dashed) and Stein’s estimator (solid) for \(n=10\) and \(p=3\). Left: \(\sigma _2(M)=\sigma _3(M)=0\). Right: \(\sigma _1(M)=20\), \(\sigma _3(M)=0\)

Fig. 2
figure 2

Eigenvalues of the matrix quadratic risk of the Efron–Morris estimator (dashed) and Stein’s estimator (solid) for \(n=10\) and \(p=3\). Left: \(\sigma _2(M)=\sigma _3(M)=0\). Right: \(\sigma _1(M)=20\), \(\sigma _3(M)=0\). In the left panel, the second and third eigenvalues of each estimator almost overlap. In the right panel, the first eigenvalues of two estimators almost overlap

Fig. 3
figure 3

Frobenius risk of the positive-part Efron–Morris estimator (dashed) and positive-part Stein’s estimator (solid) for \(n=10\) and \(p=3\). Left: \(\sigma _2(M)=\sigma _3(M)=0\). Right: \(\sigma _1(M)=20\), \(\sigma _3(M)=0\)

Fig. 4
figure 4

Eigenvalues of the matrix quadratic risk of the positive-part Efron–Morris estimator (dashed) and positive-part Stein’s estimator (solid) for \(n=10\) and \(p=3\). Left: \(\sigma _2(M)=\sigma _3(M)=0\). Right: \(\sigma _1(M)=20\), \(\sigma _3(M)=0\). In the left panel, the second and third eigenvalues of each estimator almost overlap. In the right panel, the first eigenvalues of two estimators almost overlap

Figure 2 plots the three eigenvalues \(\lambda _1 \ge \lambda _2 \ge \lambda _3\) of the matrix quadratic risk of Stein’s estimator and the Efron–Morris estimator in the same setting with Fig. 1. Since all eigenvalues are less than \(n=10\), the matrix quadratic risk \(R(M,\hat{M})\) satisfies \(R(M,\hat{M}) \preceq n I_p\) for every M. Thus, both estimators dominate the maximum-likelihood estimator under the matrix quadratic loss, which is compatible with (24) and Proposition 9. Also, each eigenvalue for Stein’s estimator is smaller than the corresponding one for the Efron–Morris estimator, which suggests that Stein’s estimator dominates the Efron–Morris estimator even under the matrix quadratic loss. It is an interesting future work to develop its rigorous theory.

Figures 3  and 4  present the results for the positive-part estimators in the same settings with Figs. 1 and 2, respectively. They show qualitatively the same behavior.