Matrix quadratic risk of orthogonally invariant estimators for a normal mean matrix

Matsuda, Takeru

doi:10.1007/s42081-023-00216-z

Matrix quadratic risk of orthogonally invariant estimators for a normal mean matrix

Original Paper
Stein Estimation and Statistical Shrinkage Methods
Open access
Published: 22 August 2023

Volume 7, pages 313–328, (2024)
Cite this article

Download PDF

You have full access to this open access article

Japanese Journal of Statistics and Data Science Aims and scope Submit manuscript

Matrix quadratic risk of orthogonally invariant estimators for a normal mean matrix

Download PDF

Takeru Matsuda ORCID: orcid.org/0000-0002-1572-5085^1,2

555 Accesses
Explore all metrics

Abstract

In estimation of a normal mean matrix under the matrix quadratic loss, we develop a general formula for the matrix quadratic risk of orthogonally invariant estimators. The derivation is based on several formulas for matrix derivatives of orthogonally invariant functions of matrices. As an application, we calculate the matrix quadratic risk of a singular value shrinkage estimator motivated by Stein’s proposal for improving on the Efron–Morris estimator 50 years ago.

On the Maximum Likelihood Estimation of a Covariance Matrix

Article 01 January 2018

Estimation in High Dimensions: A Geometric Perspective

Improved complexities of conditional gradient-type methods with applications to robust matrix recovery problems

Article 29 November 2019

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Suppose that we have a matrix observation $X \in \mathbb {R}^{n \times p}$ whose entries are independent normal random variables $X_{ij} \sim \textrm{N} (M_{ij},1)$, where $M \in \mathbb {R}^{n \times p}$ is an unknown mean matrix. In this setting, we consider estimation of M under the matrix quadratic loss (Abu-Shanab et al., 2012; Matsuda and Strawderman, 2022)

$$\begin{aligned} L(M,\hat{M}) = (\hat{M} - M)^{\top } (\hat{M} - M), \end{aligned}$$

(1)

which takes a value in the set of $p \times p$ positive semidefinite matrices. The risk function of an estimator $\hat{M}=\hat{M}(X)$ is defined as $R(M,\hat{M}) = \textrm{E}_M [ L ( M,\hat{M}(X) ) ]$, and an estimator $\hat{M}_1$ is said to dominate another estimator $\hat{M}_2$ if $R(M,\hat{M}_1) \preceq R(M,\hat{M}_2)$ for every M, where $\preceq $ is the Löwner order: $A \preceq B$ means that $B-A$ is positive semidefinite. Thus, if $\hat{M}_1$ dominates $\hat{M}_2$ under the matrix quadratic loss, then

$$\begin{aligned} \textrm{E}_M [ \Vert (\hat{M}_1 - M) c \Vert ^2 ] \le \textrm{E}_M [ \Vert (\hat{M}_2 - M) c \Vert ^2 ] \end{aligned}$$

for every M and $c \in \mathbb {R}^p$. In particular, each column of $\hat{M}_1$ dominates that of $\hat{M}_2$ as an estimator of the corresponding column of M under quadratic loss. Recently, Matsuda and Strawderman (2022) investigated shrinkage estimation in this setting by introducing a concept called matrix superharmonicity, which can be viewed as a generalization of the theory by Stein (1974) for a normal mean vector. Note that shrinkage estimation of a normal mean matrix under the Frobenius loss, which is the trace of the matrix quadratic loss, has been well studied, e.g., (Matsuda and Komaki 2015; Tsukuma 2008; Tsukuma and Kubokawa 2020; Yuasa and Kubokawa 2023a, b; Zheng 1986).

Many common estimators of a normal mean matrix are orthogonally invariant. Namely, they satisfy $\hat{M}(PXQ) = P \hat{M}(X) Q$ for any orthogonal matrices $P \in O(n)$ and $Q \in O(p)$. It can be viewed as a generalization of the rotationally invariance of estimators for a normal mean vector, which is satisfied by many minimax shrinkage estimators (Fourdrinier et al. 2018). We focus on orthogonally invariant estimators given by

$$\begin{aligned} \hat{M} = X+\widetilde{\nabla }h(X), \end{aligned}$$

(2)

where h(X) is an orthogonally invariant function satisfying $h(PXQ)=h(X)$ for any orthogonal matrices $P \in O(n)$ and $Q \in O(p)$ and $\widetilde{\nabla }$ is the matrix gradient operator defined by (3). For example, the maximum-likelihood estimator $\hat{M}=X$ corresponds to $h(X)=0$. The Efron–Morris estimator (Efron and Morris 1972) defined by $\hat{M}=X(I-(n-p-1)(X^{\top }X)^{-1})$ when $n-p-1>0$ corresponds to $h(X)={-(n-p-1)/2} \cdot \log \det (X^{\top } X)$. This estimator can be viewed as a matrix generalization of the James–Stein estimator, and it is minimax under the Frobenius loss (Efron and Morris 1972) as well as matrix quadratic loss (Matsuda and Strawderman 2022). We will provide another example of an orthogonally invariant estimator of the form (2) in Sect. 4. Note that an estimator of the form (2) is called a pseudo-Bayes estimator (Fourdrinier et al. 2018), because it coincides with the (generalized) Bayes estimator when h is given by the logarithm of the marginal distribution of X with respect to some prior on M (Tweedie’s formula).

In this study, to further the theory of shrinkage estimation under the matrix quadratic loss, we develop a general formula for the matrix quadratic risk of orthogonally invariant estimators of the form (2). First, we prepare several matrix derivative formulas in Sect. 2. Then, we derive the formula for the matrix quadratic risk in Sect. 3. Finally, we present an example in Sect. 4, which is motivated by Stein’s proposal for improving on the Efron–Morris estimator 50 years ago (Stein 1974).

2 Matrix derivative formulas

Here, we develop matrix derivative formulas based on Stein (1974). Note that, whereas Stein (1974) considered a setting where X is a $p \times n$ matrix, here, we take X to be a $n \times p$ matrix. In the following, the subscripts a, b, $\ldots $ run from 1 to n and the subscripts i, j, $\ldots $ run from 1 to p. We denote the Kronecker delta by $\delta _{ij}$.

We employ the following notations for matrix derivatives introduced in Matsuda and Strawderman (2022).

Definition 1

For a function $f: \mathbb {R}^{n \times p} \rightarrow \mathbb {R}$, its matrix gradient $\widetilde{\nabla } f: \mathbb {R}^{n \times p} \rightarrow \mathbb {R}^{n \times p}$ is defined as

$$\begin{aligned} (\widetilde{\nabla } f(X))_{ai} = \frac{\partial }{\partial X_{ai}} f(X). \end{aligned}$$

(3)

Definition 2

For a $C^2$ function $f: \mathbb {R}^{n \times p} \rightarrow \mathbb {R}$, its matrix Laplacian $\widetilde{\Delta } f: \mathbb {R}^{n \times p} \rightarrow \mathbb {R}^{p \times p}$ is defined as

$$\begin{aligned} (\widetilde{\Delta } f(X))_{ij} = \sum _{a=1}^n \frac{\partial ^2}{\partial X_{ai} \partial X_{aj}} f(X). \end{aligned}$$

(4)

Let

$$\begin{aligned} X^{\top } X = V \Lambda V^{\top } \end{aligned}$$

(5)

be a spectral decomposition of $X^{\top } X$, where $V=(v_1,\dots ,v_p)$ is an orthogonal matrix and $\Lambda = \textrm{diag}(\lambda _1,\dots ,\lambda _p)$ is a diagonal matrix. Then, the derivatives of $\lambda $ and V are obtained as follows.

Lemma 1

The derivative of $\lambda _i$ is

$$\begin{aligned} \frac{\partial \lambda _i}{\partial X_{aj}} = 2 V_{ji} \sum _k X_{ak} V_{ki}. \end{aligned}$$

(6)

Thus

$$\begin{aligned} \widetilde{\nabla } \lambda _i = 2 X v_i v_i^{\top }, \end{aligned}$$

(7)

where $v_i$ is the i-th column vector of V.

Proof

By differentiating $V^{\top } V = I_p$ and using $(\textrm{d} V)^{\top } V=(V^{\top } \textrm{d} V)^{\top }$, we obtain

$$\begin{aligned} V^{\top } \textrm{d} V + (V^{\top } \textrm{d} V)^{\top } = O, \end{aligned}$$

(8)

which means the antisymmetricity of $V^{\top } \textrm{d} V$.

Taking the differential of (5), we have

$$\begin{aligned} \textrm{d} (X^{\top } X) = (\textrm{d} V) \Lambda V^{\top } + V (\textrm{d} \Lambda ) V^{\top } + V \Lambda (\textrm{d} V)^{\top }. \end{aligned}$$

(9)

Then, multiplying (9) on the left by $V^{\top }$ and on the right by V, we obtain

$$\begin{aligned} V^{\top } \cdot \textrm{d} (X^{\top } X) \cdot V = V^{\top } (\textrm{d} V) \Lambda + \textrm{d} \Lambda + \Lambda (\textrm{d} V)^{\top } V. \end{aligned}$$

(10)

Since $\Lambda $ and $\textrm{d} \Lambda $ are diagonal and $(\textrm{d} V)^{\top } V=(V^{\top } \textrm{d} V)^{\top }$, the (i, j)th entry of (10) yields

$$\begin{aligned} (V^{\top } \cdot \textrm{d} (X^{\top } X) \cdot V)_{ij}&= (V^{\top } \textrm{d} V)_{ij} \lambda _j + \delta _{ij} \textrm{d} \lambda _i + \lambda _i ((\textrm{d} V)^{\top } V)_{ij} \nonumber \\ {}&= (V^{\top } \textrm{d} V)_{ij} \lambda _j + \delta _{ij} \textrm{d} \lambda _i + \lambda _i (V^{\top } \textrm{d} V)_{ji}. \end{aligned}$$

Since $(V^{\top } \textrm{d} V)_{ji}=-(V^{\top } \textrm{d} V)_{ij}$ from (8), we obtain

$$\begin{aligned} (V^{\top } \cdot \textrm{d} (X^{\top } X) \cdot V)_{ij}&= (\lambda _j-\lambda _i) (V^{\top } \textrm{d} V)_{ij} + \delta _{ij} \textrm{d} \lambda _i. \end{aligned}$$

(11)

On the other hand, from $\textrm{d} (X^{\top } X) = (\textrm{d} X)^{\top } X + X^{\top } \textrm{d} X$

$$\begin{aligned} \textrm{d} (X^{\top } X)_{ij} = \sum _a ((\textrm{d} X)_{ai} X_{aj} + X_{ai} (\textrm{d} X)_{aj}). \end{aligned}$$

(12)

By taking $i=j$ in (11)

$$\begin{aligned} \textrm{d} \lambda _i = (V^{\top } \cdot \textrm{d} (X^{\top } X) \cdot V)_{ii} = \sum _{j,k} (V^{\top })_{ij} (\textrm{d} (X^{\top } X))_{jk} V_{ki}. \end{aligned}$$

Then, using (12)

$$\begin{aligned} \textrm{d} \lambda _i&= \sum _{j,k} V_{ji} V_{ki} \sum _a ((\textrm{d} X)_{aj} X_{ak} + X_{aj} (\textrm{d} X)_{ak}) \\ {}&= 2 \sum _{j,k} V_{ji} V_{ki} \sum _a X_{ak} (\textrm{d} X)_{aj} \\ {}&= 2 \sum _{a,j} V_{ji} \sum _k X_{ak} V_{ki} (\textrm{d} X)_{aj}. \end{aligned}$$

Thus, we obtain (6) and it leads to (7). $\square $

Lemma 2

The derivative of $V_{ij}$ is

$$\begin{aligned} \frac{\partial V_{ij}}{\partial X_{ak}} = \sum _{l \ne j} \frac{V_{il}}{\lambda _j-\lambda _l} ((XV)_{aj} V_{kl} + (XV)_{al} V_{kj}). \end{aligned}$$

(13)

Proof

From (8), we have $(V^{\top } \textrm{d} V)_{ii}=0$. Also, from (11)

$$\begin{aligned} (V^{\top } \textrm{d} V)_{ij} = \frac{1}{\lambda _j-\lambda _i} (V^{\top } \cdot \textrm{d} (X^{\top } X) \cdot V)_{ij} \end{aligned}$$

for $i \ne j$. Therefore

$$\begin{aligned} \textrm{d} V_{ij}&= (V V^{\top } \textrm{d} V)_{ij} \\ {}&= \sum _k V_{ik} (V^{\top } \textrm{d} V)_{kj} \\ {}&= \sum _{k \ne j} V_{ik} \frac{1}{\lambda _j-\lambda _k} (V^{\top } \cdot \textrm{d} (X^{\top } X) \cdot V)_{kj} \\ {}&= \sum _{k \ne j} \frac{1}{\lambda _j-\lambda _k} V_{ik} \sum _{l,m} V_{lk} \textrm{d} (X^{\top } X)_{lm} V_{mj}. \end{aligned}$$

Then, using (12)

$$\begin{aligned} \textrm{d} V_{ij}&= \sum _{k \ne j} \frac{1}{\lambda _j-\lambda _k} V_{ik} \sum _{l,m} V_{lk} \sum _a ((\textrm{d} X)_{al} X_{am} + X_{al} (\textrm{d} X)_{am}) V_{mj} \\ {}&= \sum _{k \ne j} \frac{1}{\lambda _j-\lambda _k} V_{ik} \sum _a \left( (XV)_{aj} \sum _l V_{lk} (\textrm{d} X)_{al} + (XV)_{ak} \sum _m V_{mj} (\textrm{d} X)_{am} \right) \\ {}&= \sum _l \sum _{k \ne j} \frac{V_{ik}}{\lambda _j-\lambda _k} \sum _a ((XV)_{aj} V_{lk}+(XV)_{ak} V_{lj}) (\textrm{d} X)_{al} \\ {}&= \sum _{a,k} \sum _{l \ne j} \frac{V_{il}}{\lambda _j-\lambda _l} ((XV)_{aj} V_{kl}+(XV)_{al} V_{kj}) (\textrm{d} X)_{ak}, \end{aligned}$$

where we switched k and l in the last step. Thus, we obtain (13). $\square $

A function h is said to be orthogonally invariant if it satisfies $h(PXQ)=h(X)$ for any orthogonal matrices $P \in O(n)$ and $Q \in O(p)$. Such a function can be written as $h(X)=H(\lambda )$, where $\lambda $ is the eigenvalues of $X^{\top } X$ as given by (5), and its derivatives are calculated as follows.

Lemma 3

The matrix gradient (3) of an orthogonally invariant function $h(X)=H(\lambda )$ is

$$\begin{aligned} \widetilde{\nabla } h = 2 \sum _i \frac{\partial H}{\partial \lambda _i} X v_i v_i^{\top }. \end{aligned}$$

(14)

Thus

$$\begin{aligned} (\widetilde{\nabla } h)^{\top } (\widetilde{\nabla } h) = 4 \sum _i \left( \frac{\partial H}{\partial \lambda _i} \right) ^2 \lambda _i v_i v_i^{\top } = V D V^{\top }, \end{aligned}$$

(15)

where D is the $p \times p$ diagonal matrix given by

$$\begin{aligned} D_{kk} = 4 \lambda _k \left( \frac{\partial H}{\partial \lambda _k} \right) ^2. \end{aligned}$$

Proof

From (7)

$$\begin{aligned} \widetilde{\nabla } h = \sum _i \frac{\partial H}{\partial \lambda _i} \widetilde{\nabla } \lambda _i = 2 \sum _i \frac{\partial H}{\partial \lambda _i} X v_i v_i^{\top }, \end{aligned}$$

which yields (14). Then, using $X^{\top }X = V \Lambda V^{\top }$ and $V^{\top } V = I_p$

$$\begin{aligned} (\widetilde{\nabla } h)^{\top } (\widetilde{\nabla } h)&= 4 \sum _{i,j} \frac{\partial H}{\partial \lambda _i} \frac{\partial H}{\partial \lambda _j} v_i v_i^{\top } X^{\top } X v_j v_j^{\top } \\ {}&= 4 \sum _{i,j} \frac{\partial H}{\partial \lambda _i} \frac{\partial H}{\partial \lambda _j} v_i \Lambda _{ij} v_j^{\top } \\ {}&= 4 \sum _{i} \left( \frac{\partial H}{\partial \lambda _i} \right) ^2 \lambda _i v_i v_i^{\top }, \end{aligned}$$

which yields (15). $\square $

Lemma 4

The matrix Laplacian (4) of an orthogonally invariant function $h(X)=H(\lambda )$ is

$$\begin{aligned} \widetilde{\Delta } h = V D V^{\top }, \end{aligned}$$

(16)

where D is the $p \times p$ diagonal matrix given by

$$\begin{aligned} D_{kk} = 4 \lambda _k \frac{\partial ^2 H}{\partial \lambda _k^2} + 2n \frac{\partial H}{\partial \lambda _k} + 2 \sum _{l \ne k} \frac{\lambda _l}{\lambda _k-\lambda _l} \left( \frac{\partial H}{\partial \lambda _k} - \frac{\partial H}{\partial \lambda _l} \right) . \end{aligned}$$

Proof

From (14)

$$\begin{aligned} \frac{\partial ^2 h}{\partial X_{ai} \partial X_{aj}}&= \frac{\partial }{\partial X_{ai}} \left( 2 \sum _k \frac{\partial H}{\partial \lambda _k} X v_k v_k^{\top } \right) _{aj} \nonumber \\ {}&= 2 \sum _k \left( \left( \widetilde{\nabla } \frac{\partial H}{\partial \lambda _k} \right) _{ai} (X v_k v_k^{\top })_{aj} + \frac{\partial H}{\partial \lambda _k} (v_k v_k^{\top })_{ij} \right. \nonumber \\ {}&\left. + \frac{\partial H}{\partial \lambda _k} \sum _l X_{al} \frac{\partial }{\partial X_{ai}} (v_k v_k^{\top })_{lj} \right) \nonumber \\ {}&= 2 \sum _k \left( 2 \sum _l \frac{\partial ^2 H}{\partial \lambda _k \partial \lambda _l} (X v_l v_l^{\top })_{ai} (X v_k v_k^{\top })_{aj} + \frac{\partial H}{\partial \lambda _k} V_{ik} V_{jk} \right. \nonumber \\ {}&\quad \left. + \frac{\partial H}{\partial \lambda _k} \sum _l X_{al} \left( \frac{\partial V_{lk}}{\partial X_{ai}} V_{jk} + V_{lk} \frac{\partial V_{jk}}{\partial X_{ai}} \right) \right) . \end{aligned}$$

(17)

Also, from (13)

$$\begin{aligned} \frac{\partial V_{lk}}{\partial X_{ai}} V_{jk} + V_{lk} \frac{\partial V_{jk}}{\partial X_{ai}}&= \sum _{m \ne k} \frac{V_{lm}}{\lambda _k-\lambda _m} ((XV)_{ak}V_{im}+(XV)_{am}V_{ik}) V_{jk} \nonumber \\ {}&\qquad + V_{lk} \sum _{m \ne k} \frac{V_{jm}}{\lambda _k-\lambda _m} ((XV)_{ak}V_{im}+(XV)_{am}V_{ik}) \nonumber \\ {}&= \sum _{m \ne k} \frac{(V_{lm}V_{jk}+V_{lk}V_{jm})((XV)_{ak}V_{im}+(XV)_{am}V_{ik})}{\lambda _k-\lambda _m}. \end{aligned}$$

(18)

By substituting (18) into (17) and taking the sum

$$\begin{aligned} (\widetilde{\Delta } h)_{ij}&= \sum _{a=1}^n \frac{\partial ^2 h}{\partial X_{ai} \partial X_{aj}} \\ {}&= 4 \sum _{k,l} \frac{\partial ^2 H}{\partial \lambda _k \partial \lambda _l} (v_l v_l^{\top } X^{\top } X v_k v_k^{\top })_{ij} + 2n \sum _k \frac{\partial H}{\partial \lambda _k} V_{ik} V_{jk} \\ {}&\qquad + 2 \sum _{k,l} \frac{\partial H}{\partial \lambda _k} \sum _{m \ne k} \frac{(V_{lm}V_{jk}+V_{lk}V_{jm})((X^{\top }XV)_{lk}V_{im}+(X^{\top }XV)_{lm}V_{ik})}{\lambda _k-\lambda _m} \\ {}&= \sum _k \left( 4 \lambda _k \frac{\partial ^2 H}{\partial \lambda _k^2} + 2n \frac{\partial H}{\partial \lambda _k} \right) V_{ik} V_{jk} \\ {}&\qquad + 2 \sum _{k,l} \frac{\partial H}{\partial \lambda _k} \sum _{m \ne k} \frac{(V_{lm}V_{jk}+V_{lk}V_{jm})(\lambda _k V_{lk}V_{im}+\lambda _m V_{lm}V_{ik})}{\lambda _k-\lambda _m} \\ {}&= \sum _k \left( 4 \lambda _k \frac{\partial ^2 H}{\partial \lambda _k^2} + 2n \frac{\partial H}{\partial \lambda _k} \right) V_{ik} V_{jk} {+} 2 \sum _k \frac{\partial H}{\partial \lambda _k} \sum _{m \ne k} \frac{\lambda _k V_{jm} V_{im} {+} \lambda _m V_{jk} V_{ik}}{\lambda _k{-}\lambda _m}, \end{aligned}$$

where we used $X^{\top }X = V \Lambda V^{\top }$ and

$$\begin{aligned}&\sum _l {(V_{lm}V_{jk}+V_{lk}V_{jm})(\lambda _k V_{lk}V_{im}+\lambda _m V_{lm}V_{ik})} \\ {}&\quad = \lambda _k (\delta _{km} V_{jk} + V_{jm}) V_{im} + \lambda _m (V_{jk} + \delta _{km} V_{jm}) V_{ik}. \end{aligned}$$

Then

$$\begin{aligned} (\widetilde{\Delta } h)_{ij}&= \sum _k \left( 4 \lambda _k \frac{\partial ^2 H}{\partial \lambda _k^2} + 2n \frac{\partial H}{\partial \lambda _k} \right. \\&\left. \quad + 2 \frac{\partial H}{\partial \lambda _k} \sum _{m \ne k} \frac{\lambda _m}{\lambda _k-\lambda _m} \right) V_{ik} V_{jk} + 2 \sum _k \lambda _k \frac{\partial H}{\partial \lambda _k} \sum _{m \ne k} \frac{V_{im} V_{jm}}{\lambda _k-\lambda _m} \\ {}&= \sum _k \left( 4 \lambda _k \frac{\partial ^2 H}{\partial \lambda _k^2} + 2n \frac{\partial H}{\partial \lambda _k} + 2 \sum _{m \ne k} \frac{\lambda _m}{\lambda _k-\lambda _m} \left( \frac{\partial H}{\partial \lambda _k} - \frac{\partial H}{\partial \lambda _m} \right) \right) V_{ik} V_{jk}, \end{aligned}$$

where we used

$$\begin{aligned} \sum _k \lambda _k \frac{\partial H}{\partial \lambda _k} \sum _{m \ne k} \frac{V_{im} V_{jm}}{\lambda _k-\lambda _m}&= \sum _m V_{im} V_{jm} \sum _{k \ne m} \frac{\lambda _k}{\lambda _k-\lambda _m} \frac{\partial H}{\partial \lambda _k} \\ {}&= \sum _k V_{ik} V_{jk} \sum _{m \ne k} \frac{\lambda _m}{\lambda _m-\lambda _k} \frac{\partial H}{\partial \lambda _m}. \end{aligned}$$

Thus, by rewriting m to l, we obtain (16). $\square $

By taking the trace of the matrix Laplacian (16), we have

$$\begin{aligned} {\Delta } h&= \textrm{tr} (\widetilde{\Delta } h) = \textrm{tr} (D) \\ {}&= \sum _k \left( 4 \lambda _k \frac{\partial ^2 H}{\partial \lambda _k^2} + 2n \frac{\partial H}{\partial \lambda _k} + 2 \sum _{l \ne k} \frac{\lambda _l}{\lambda _k-\lambda _l} \left( \frac{\partial H}{\partial \lambda _k} - \frac{\partial H}{\partial \lambda _l} \right) \right) \\ {}&= \sum _k \left( 4 \lambda _k \frac{\partial ^2 H}{\partial \lambda _k^2} + 2n \frac{\partial H}{\partial \lambda _k} \right) + 2 \sum _k \frac{\partial H}{\partial \lambda _k} \sum _{l \ne k} \frac{\lambda _k+\lambda _l}{\lambda _k-\lambda _l} \\ {}&= 4 \sum _k \lambda _k \frac{\partial ^2 H}{\partial \lambda _k^2} + 2 \sum _k \left( n-p+1 + 2 \lambda _k \sum _{l \ne k} \frac{1}{\lambda _k-\lambda _l} \right) \frac{\partial H}{\partial \lambda _k}, \end{aligned}$$

where we used

$$\begin{aligned} \sum _{l \ne k} \frac{\lambda _k+\lambda _l}{\lambda _k-\lambda _l}&= \lambda _k \sum _{l \ne k} \frac{1}{\lambda _k-\lambda _l} + \sum _{l \ne k} \frac{\lambda _l}{\lambda _k-\lambda _l} \nonumber \\ {}&= \lambda _k \sum _{l \ne k} \frac{1}{\lambda _k-\lambda _l} + \sum _{l \ne k} \left( \frac{\lambda _k}{\lambda _k-\lambda _l} - 1 \right) \nonumber \\ {}&= 2 \lambda _k \sum _{l \ne k} \frac{1}{\lambda _k-\lambda _l} -p+1. \end{aligned}$$

(19)

This coincides with the Laplacian formula in Stein (1974).

3 Risk formula

Now, we derive a general formula for the matrix quadratic risk of orthogonally invariant estimators of the form (2).

Theorem 5

Let $h(X)=H(\lambda )$ be an orthogonally invariant function. Then, the matrix quadratic risk of an estimator $\hat{M}=X+\widetilde{\nabla }h(X)$ is given by

$$\begin{aligned} \textrm{E}_M (\hat{M}-M)^{\top } (\hat{M}-M)&= n I_p + \textrm{E}_M [ V D V^{\top } ], \end{aligned}$$

(20)

where D is the $p \times p$ diagonal matrix given by

$$\begin{aligned} D_{kk}&= 4 \left( 2 \lambda _k \frac{\partial ^2 H}{\partial \lambda _k^2} + n \frac{\partial H}{\partial \lambda _k} + \lambda _k \left( \frac{\partial H}{\partial \lambda _k} \right) ^2 + \sum _{l \ne k} \frac{\lambda _l}{\lambda _k-\lambda _l} \left( \frac{\partial H}{\partial \lambda _k} - \frac{\partial H}{\partial \lambda _l} \right) \right) . \end{aligned}$$

Proof

From Matsuda and Strawderman (2022), the matrix quadratic risk of an estimator $\hat{M}=X+g(X)$ with a weakly differentiable function g is

$$\begin{aligned} \textrm{E}_M (\hat{M}-M)^{\top } (\hat{M}-M) = n I_p + \textrm{E}_M [\widetilde{\textrm{div}} \ g(X) + (\widetilde{\textrm{div}} \ g(X) )^{\top } + g(X)^{\top } g(X)], \end{aligned}$$

(21)

where the matrix divergence $\widetilde{\textrm{div}} \ g: \mathbb {R}^{n \times p} \rightarrow \mathbb {R}^{p \times p}$ of a function $g: \mathbb {R}^{n \times p} \rightarrow \mathbb {R}^{n \times p}$ is defined as

$$\begin{aligned} (\widetilde{\textrm{div}} \ g(X))_{ij} = \sum _{a=1}^n \frac{\partial }{\partial X_{ai}} g_{aj}(X). \end{aligned}$$

Therefore, by substituting $g(X)=\widetilde{\nabla }h(X)$ and using $\widetilde{\textrm{div}} \circ \widetilde{\nabla }=\widetilde{\Delta }$

$$\begin{aligned} \textrm{E}_M (\hat{M}-M)^{\top } (\hat{M}-M) = n I_p + E_M [2 \widetilde{\Delta } h(X) + \widetilde{\nabla }h(X)^{\top } \widetilde{\nabla }h(X)]. \end{aligned}$$

Thus, using (15) and (16), we obtain (20). $\square $

By taking the trace of (20) and using (19), we obtain the following formula for the Frobenius risk of orthogonally invariant estimators, which coincides with the one given by Stein (1974).

Corollary 6

Let $h(X)=H(\lambda )$ be an orthogonally invariant function. Then, the Frobenius risk of an estimator $\hat{M}=X+\widetilde{\nabla }h(X)$ is given by

$$\begin{aligned}&\textrm{E}_M \Vert \hat{M}-M \Vert _{\textrm{F}}^2 \\ {}&\quad = np + 4 \textrm{E}_M \left[ \sum _k \left( 2 \lambda _k \frac{\partial ^2 H}{\partial \lambda _k^2} + n \frac{\partial H}{\partial \lambda _k} + \lambda _k \left( \frac{\partial H}{\partial \lambda _k} \right) ^2 \right) + \sum _k \frac{\partial H}{\partial \lambda _k} \sum _{l \ne k} \frac{\lambda _k + \lambda _l}{\lambda _k-\lambda _l} \right] \\ {}&\quad = np + 4 \textrm{E}_M \left[ \sum _k \left( 2 \lambda _k \frac{\partial ^2 H}{\partial \lambda _k^2} + n \frac{\partial H}{\partial \lambda _k} +\lambda _k \left( \frac{\partial H}{\partial \lambda _k} \right) ^2 \right) \right. \\ {}&\left. + \sum _k \frac{\partial H}{\partial \lambda _k} \left( 2 \lambda _k \sum _{l \ne k} \frac{1}{\lambda _k-\lambda _l} -p+1 \right) \right] . \end{aligned}$$

We derived the risk formula for orthogonally invariant estimators of the form (2), which are called pseudo-Bayes estimators (Fourdrinier et al. 2018). The class of pseudo-Bayes estimators includes all Bayes and generalized Bayes estimators. It is an interesting future work to extend the current result to general orthogonally invariant estimators. Also, extension to unknown covariance case is an important future problem. Note that Section 6.6.2 of Tsukuma and Kubokawa (2020) derived a risk formula for a class of estimators in the unknown covariance setting.

4 Example

We provide an example of the application of Theorem 5. Let $X = U \Sigma V^{\top }$ with $U \in \mathbb {R}^{n \times p}$, $\Sigma = \textrm{diag} (\sigma _1, \ldots , \sigma _p)$ and $V \in \mathbb {R}^{p \times p}$ be a singular value decomposition of X, where $U^{\top } U = V^{\top } V = I_p$ and $\sigma _1 \ge \cdots \ge \sigma _p \ge 0$ are the singular values of X. We consider an orthogonally invariant estimator given by

$$\begin{aligned} \hat{M} = U \cdot \textrm{diag} \left( \sigma _1 - \frac{c_1}{\sigma _1}, \dots , \sigma _p - \frac{c_p}{\sigma _p} \right) \cdot V^{\top }, \end{aligned}$$

(22)

where $c_1,\dots ,c_p \ge 0$.

Lemma 7

The estimator (22) can be written in the form (2) with

$$\begin{aligned} h(X) = -\frac{1}{2} \sum _{k=1}^p {c_k} \log \lambda _k = - \sum _{k=1}^p {c_k} \log \sigma _k, \end{aligned}$$

where $\lambda _1,\dots ,\lambda _p$ are the eigenvalues of $X^{\top } X$, as shown in (5).

Proof

From (7)

$$\begin{aligned} \widetilde{\nabla } \log \lambda _k = \frac{1}{\lambda _k} \widetilde{\nabla } \lambda _k = \frac{2}{\lambda _k} X v_k v_k^{\top }. \end{aligned}$$

Thus

$$\begin{aligned} \widetilde{\nabla } h(X) = -X \sum _k \frac{c_k}{\lambda _k} v_k v_k^{\top } = -U \cdot \textrm{diag} \left( \frac{c_1}{\sigma _1}, \dots , \frac{c_p}{\sigma _p} \right) \cdot V^{\top }, \end{aligned}$$

where we used $X=U \Sigma V^{\top }$ and $\lambda _k=\sigma _k^2$. Therefore, the estimator (22) is written as $\hat{M}=X+\widetilde{\nabla } h(X)$. $\square $

Theorem 8

The matrix quadratic risk of the estimator (22) is given by

$$\begin{aligned} \textrm{E}_M (\hat{M}-M)^{\top } (\hat{M}-M)&= n I_p + \textrm{E}_M [ V D V^{\top } ], \end{aligned}$$

(23)

where D is the $p \times p$ diagonal matrix given by

$$\begin{aligned} D_{kk} = \frac{1}{\lambda _k} c_k (c_k-2n+4) - \frac{2}{\lambda _k} \sum _{l \ne k} \frac{c_k \lambda _l - c_l \lambda _k}{\lambda _k-\lambda _l}. \end{aligned}$$

Proof

To apply Theorem 5, let

$$\begin{aligned} H(\lambda ) = -\frac{1}{2} \sum _k c_k \log \lambda _k. \end{aligned}$$

We have

$$\begin{aligned} \frac{\partial H}{\partial \lambda _k} = -\frac{c_k}{2 \lambda _k}, \quad \frac{\partial ^2 H}{\partial \lambda _k^2} = \frac{c_k}{2 \lambda _k^2}. \end{aligned}$$

Thus

$$\begin{aligned} D_{kk}&= 4 \left( 2 \lambda _k \frac{\partial ^2 H}{\partial \lambda _k^2} + n \frac{\partial H}{\partial \lambda _k} + \lambda _k \left( \frac{\partial H}{\partial \lambda _k} \right) ^2 + \sum _{l \ne k} \frac{\lambda _l}{\lambda _k-\lambda _l} \left( \frac{\partial H}{\partial \lambda _k} - \frac{\partial H}{\partial \lambda _l} \right) \right) \\ {}&= \frac{1}{\lambda _k} c_k (c_k-2n+4) -2 \sum _{l \ne k} \frac{\lambda _l}{\lambda _k-\lambda _l} \left( \frac{c_k}{\lambda _k} - \frac{c_l}{\lambda _l} \right) \\ {}&= \frac{1}{\lambda _k} c_k (c_k-2n+4) - \frac{2}{\lambda _k} \sum _{l \ne k} \frac{c_k \lambda _l - c_l \lambda _k}{\lambda _k-\lambda _l}. \end{aligned}$$

Therefore, we obtain (23) from Theorem 5. $\square $

The Efron–Morris estimator (Efron and Morris 1972) corresponds to (22) with $c_k \equiv n-p-1$. In this case

$$\begin{aligned} D_{kk}&= \frac{1}{\lambda _k} (n-p-1) (-n-p+3) - \frac{2}{\lambda _k} (n-p-1) \sum _{l \ne k} \frac{\lambda _l - \lambda _k}{\lambda _k-\lambda _l} \\ {}&= \frac{1}{\lambda _k} (n-p-1) (-n-p+3) + \frac{2}{\lambda _k} (n-p-1) (p-1) \\ {}&= - \frac{1}{\lambda _k} (n-p-1)^2. \end{aligned}$$

Thus, its matrix quadratic risk (23) is

$$\begin{aligned} \textrm{E}_M (\hat{M}-M)^{\top } (\hat{M}-M)&= n I_p - (n-p-1)^2 \textrm{E}_M \left[ (X^{\top } X)^{-1} \right] . \end{aligned}$$

(24)

This coincides with the result in Matsuda and Strawderman (2022).

Motivated by Stein’s proposal (Stein 1974) for improving on the Efron–Morris estimator, we consider the estimator (22) with $c_k = n+p-2k-1$. In the following, we call it “Stein’s estimator” for convenience. Stein (1974) stated that the positive part of Stein’s estimator dominates the positive part of the Efron–Morris estimator under the Frobenius loss^{Footnote 1}, where “positive-part” means the modification of (22) given by

$$\begin{aligned} \hat{M} = U \cdot \textrm{diag} \left( \left( \sigma _1 - \frac{c_1}{\sigma _1} \right) _+, \dots , \left( \sigma _p - \frac{c_p}{\sigma _p} \right) _+ \right) \cdot V^{\top }, \end{aligned}$$

(25)

where $(a)_+=\max (0,a)$. It is known that the estimator (22) is dominated by its positive part (25) under the Frobenius loss (Tsukuma 2008).

Proposition 9

The matrix quadratic risk of Stein’s estimator (estimator (22) with $c_k = n+p-2k-1$) is given by

$$\begin{aligned} \textrm{E}_M (\hat{M}-M)^{\top } (\hat{M}-M)&= n I_p + \textrm{E}_M [ V D V^{\top } ], \end{aligned}$$

where D is the $p \times p$ diagonal matrix given by

$$\begin{aligned} D_{kk}&= -\frac{1}{\lambda _k} (n+p-2k-1) (n-3p+2k-1) + 4 \sum _{l \ne k} \frac{k-l}{\lambda _k-\lambda _l}. \end{aligned}$$

Thus, Stein’s estimator dominates the maximum-likelihood estimator under the matrix quadratic loss when $n \ge 3p-1$.

Proof

By substituting $c_k=n+p-2k-1$ into Theorem 8

$$\begin{aligned} D_{kk}&= \frac{1}{\lambda _k} (n+p-2k-1) (-n+p-2k+3) \\&- \frac{2}{\lambda _k} \sum _{l \ne k} \frac{(n+p-2k-1) \lambda _l - (n+p-2l-1) \lambda _k}{\lambda _k-\lambda _l} \\ {}&= \frac{1}{\lambda _k} (n+p-2k-1) (-n+p-2k+3) \\ {}&- \frac{2}{\lambda _k} \sum _{l \ne k} \left( -\frac{2(k-l) \lambda _k}{\lambda _k-\lambda _l} -(n+p-2k-1) \right) \\&= \frac{1}{\lambda _k} (n+p-2k-1) (-n+p-2k+3) \\&+ 4 \sum _{l \ne k} \frac{k-l}{\lambda _k-\lambda _l} +\frac{2}{\lambda _k} (p-1)(n+p-2k-1) \\ {}&= -\frac{1}{\lambda _k} (n+p-2k-1) (n-3p+2k-1) \\&+ 4 \sum _{l \ne k} \frac{k-l}{\lambda _k-\lambda _l}. \end{aligned}$$

The second term is nonpositive, since $\lambda _1 \ge \lambda _2 \ge \dots \ge \lambda _p$. When $n \ge 3p-1$, the first term is also nonpositive, and thus

$$\begin{aligned} \textrm{E}_M (\hat{M}-M)^{\top } (\hat{M}-M) \preceq n I_p. \end{aligned}$$

$\square $

Numerical results indicate that the bound of n in Proposition 9 may be relaxed to $n \ge p+2$, which is the same bound with the Efron–Morris estimator. See Appendix.

Finally, we present simulation results to compare Stein’s estimator and the Efron–Morris estimator.

Figure 1 compares the Frobenius risk of Stein’s estimator and the Efron–Morris estimator when $n=10$ and $p=3$. It implies that Stein’s estimator dominates the Efron–Morris estimator under the Frobenius loss. Both estimators attain constant risk reduction when some singular values of M are small, regardless of the magnitude of the other singular values. Thus, both estimators work well for low rank matrices. See Matsuda and Strawderman (2022) for related discussions.

Figure 2 plots the three eigenvalues $\lambda _1 \ge \lambda _2 \ge \lambda _3$ of the matrix quadratic risk of Stein’s estimator and the Efron–Morris estimator in the same setting with Fig. 1. Since all eigenvalues are less than $n=10$, the matrix quadratic risk $R(M,\hat{M})$ satisfies $R(M,\hat{M}) \preceq n I_p$ for every M. Thus, both estimators dominate the maximum-likelihood estimator under the matrix quadratic loss, which is compatible with (24) and Proposition 9. Also, each eigenvalue for Stein’s estimator is smaller than the corresponding one for the Efron–Morris estimator, which suggests that Stein’s estimator dominates the Efron–Morris estimator even under the matrix quadratic loss. It is an interesting future work to develop its rigorous theory.

Figures 3 and 4 present the results for the positive-part estimators in the same settings with Figs. 1 and 2, respectively. They show qualitatively the same behavior.

Notes

page 31 of Stein (1974): “It is not difficult to verify, and follows from the general formula (14) that the estimate (8) is better than the crude Efron–Morris estimate (9)”. However, we could not find its proof. It is an interesting future work to fill in this gap.

References

Abu-Shanab, R., Kent, J. T., & Strawderman, W. E. (2012). Shrinkage estimation with a matrix loss function. Electronic Journal of Statistics, 6, 2347–2355.
Article MathSciNet Google Scholar
Efron, B., & Morris, C. (1972). Empirical Bayes on vector observations: An extension of Stein’s method. Biometrika, 59, 335–347.
Article MathSciNet Google Scholar
Fourdrinier, D., Strawderman, W. E., & Wells, M. (2018). Shrinkage estimation. Springer.
Book Google Scholar
Matsuda, T., & Komaki, F. (2015). Singular value shrinkage priors for Bayesian prediction. Biometrika, 102, 843–854.
Article MathSciNet Google Scholar
Matsuda, T., & Strawderman, W. E. (2022). Estimation under matrix quadratic loss and matrix superharmonicity. Biometrika, 109, 503–519.
Article MathSciNet Google Scholar
Stein, C. (1974). Estimation of the mean of a multivariate normal distribution. Proceedings of the Prague Symposium on Asymptotic Statistics, 2, 345–381.
MathSciNet Google Scholar
Tsukuma, H. (2008). Admissibility and minimaxity of Bayes estimators for a normal mean matrix. Journal of Multivariate Analysis, 99, 2251–2264.
Article MathSciNet Google Scholar
Tsukuma, H., & Kubokawa, T. (2020). Shrinkage estimation for mean and covariance matrices. Springer.
Book Google Scholar
Yuasa, R., & Kubokawa, T. (2023). Generalized Bayes estimators with closed forms for the normal mean and covariance matrices. Journal of Statistical Planning and Inference, 222, 182–194.
Article MathSciNet Google Scholar
Yuasa, R., & Kubokawa, T. (2023). Weighted shrinkage estimators of normal mean matrices and dominance properties. Journal of Multivariate Analysis, 194, 105138.
Article MathSciNet Google Scholar
Zheng, Z. (1986). On estimation of matrix of normal mean. Journal of Multivariate Analysis, 18, 70–82.
Article MathSciNet Google Scholar

Download references

Acknowledgements

The author would like to thank the reviewer for constructive comments. The author would also like to thank William Strawderman for helpful comments. This work was supported by JSPS KAKENHI under Grant Nos. 21H05205, 22K17865 and JST Moonshot under Grant No. JPMJMS2024.

Funding

Open access funding provided by The University of Tokyo.

Author information

Authors and Affiliations

Graduate School of Information Science and Technology, The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo, 113-8656, Japan
Takeru Matsuda
RIKEN Center for Brain Science, RIKEN, 2-1 Hirosawa, Wako, Saitama, 351-0198, Japan
Takeru Matsuda

Authors

Takeru Matsuda
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Takeru Matsuda.

Ethics declarations

Conflict of interest

On behalf of all authors, Takeru Matsuda states that there is no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix: Bound of n in Proposition 9

Figure 5 plots the largest eigenvalue of the matrix quadratic risk of Stein’s estimator when $\sigma _1(M)=\dots =\sigma _p(M)=50$. It indicates that the bound of n in Proposition 9 may be relaxed to $n \ge p+2$.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Matsuda, T. Matrix quadratic risk of orthogonally invariant estimators for a normal mean matrix. Jpn J Stat Data Sci 7, 313–328 (2024). https://doi.org/10.1007/s42081-023-00216-z

Download citation

Received: 16 May 2023
Revised: 01 August 2023
Accepted: 02 August 2023
Published: 22 August 2023
Issue Date: June 2024
DOI: https://doi.org/10.1007/s42081-023-00216-z

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Matrix quadratic risk of orthogonally invariant estimators for a normal mean matrix

Abstract

Similar content being viewed by others

On the Maximum Likelihood Estimation of a Covariance Matrix

Estimation in High Dimensions: A Geometric Perspective

Improved complexities of conditional gradient-type methods with applications to robust matrix recovery problems

1 Introduction

2 Matrix derivative formulas

Definition 1

Definition 2

Lemma 1

Proof

Lemma 2

Proof

Lemma 3

Proof

Lemma 4

Proof

3 Risk formula

Theorem 5

Proof

Corollary 6

4 Example

Lemma 7

Proof

Theorem 8

Proof

Proposition 9

Proof

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendix: Bound of n in Proposition 9

Appendix: Bound of n in Proposition 9

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation