1 Introduction

The problem of estimating the probability density \( f_X: \mathcal {X} \rightarrow \mathbb {R}\) of a random variable \( X \) in \( \mathcal {X} \), given a set of data samples \( \{ x_i \}_{i=1}^N \) drawn from \( f_X \) has received significant attention in the recent years [1,2,3,4,5]. The applications range from purely generative purposes [5, 6], over classical image restoration problems [7,8,9] to medical image reconstruction [10,11,12]. This is a challenging problem in high dimension (e.g., for images of size \( M \times N \), i.e., \( \mathcal {X} = \mathbb {R}^{M \times N} \)), due to extremely sparsely populated regions [13]. A fruitful approach is to estimate the density at different times when undergoing a diffusion process [3, 5]. Intuitively, the diffusion equilibrates high- and low-density regions over time, thus easing the estimation problem.

Let \( Y_t \) (carelessly) denote the random variable whose distribution is defined by diffusing \( f_X \) for some time \( t \). We denote the density of \( Y_t \) by \( f_Y(\,\cdot \,, t) \), which fulfills the diffusion partial differential equation (PDE) \( (\partial _t - \Delta _1)f_Y(\,\cdot \,, t) = 0 \) with initial condition \( f_Y(\,\cdot \,, 0) = f_X \). The empirical Bayes theory [14] provides a machinery for reversing the diffusion PDE: Given an instantiation of the random variable \( Y_t \), the Bayesian least squares estimate of \( X \) can be expressed solely using \( f_Y(\,\cdot \,, t) \). Importantly, this holds for all positive \( t \), as long as \( f_Y \) is properly constructed.

In practice, we wish to have a parametrized, trainable model of \( f_Y \), say \( f_\theta \) where \( \theta \) is a parameter vector, such that \( f_Y(x, t) \approx f_\theta (x, t) \) for all \( x \in \mathcal {X} \) and all \( t \in [0, \infty ) \). Recent choices [3, 15] for the family of functions \( f_\theta (\,\cdot \,, t) \) were of practical nature: Instead of an analytic expression for \( f_\theta \) at any time \( t \), authors proposed a time-conditioned network in the hope that it can learn to behave as if it had undergone the diffusion PDE. Further, instead of worrying about the normalization \( \int _\mathcal {X} f_Y(\,\cdot \,, t) = 1 \) for all \( t \in [0, \infty ) \), usually they directly estimate the score \( -\nabla _{\hspace{-2.22214pt}1} \log f_Y(\,\cdot \,, t): \mathcal {X} \rightarrow \mathcal {X} \) with some network \( s_\theta (\,\cdot \,, t): \mathcal {X} \rightarrow \mathcal {X} \). This has the advantage that normalization constants vanish, but usually the constraint \( \partial _j (s_\theta (\,\cdot \,, t))_i = \partial _i (s_\theta (\,\cdot \,, t))_j \) is not enforced in the architecture of \( s_\theta \). Thus, \( s_\theta (\,\cdot \,, t) \) is in general not the gradient of a scalar function (the negative-log-density it claims to model).

In contrast to this line of works, in this paper we pursue a more principled approach. Specifically, we leverage products of Gaussian mixure model (GMM) experts to model the distribution of responses of transformations acting on natural images. Here, an expert is a one-dimensional distribution modeling certain characteristics of the random variable \( Y_t \) (the terminology is borrowed from [1]). In particular, we derive conditions under which \( f_Y(\,\cdot \,, t) \) can be expressed analytically from \( f_Y(\,\cdot \,, 0 ) \). We call our model product of Gaussian mixture diffusion model (PoGMDM) to reflect the building blocks: products of GMM experts and diffusion. The conditions arising for a model acting on filter responses naturally lead to the consideration of the wavelet transformation. Analyzing the conditions for a convolutional model naturally leads to the shearlet transformation. Thus, we present three models that utilize transformations that are extremely popular in image processing: linear filters, the wavelet transformation and the shearlet transformation. To the best of our knowledge, this paper is the first in proposing strategies to learn patch-based and convolutional priors in a unified framework.

1.1 Contributions

This paper constitutes an extension to our previous conference publication [7], in which we introduced the idea of explicit diffusion models and showed preliminary numerical results. In this work, we present two additional explicit diffusion models and derive conditions under which they fulfill the diffusion PDE. In particular, the derived conditions naturally lead to models that can leverage transformations that are well known and popular in image processing: wavelets and shearlets. For all models, we show how the associated transformation, along with the diffusion model for the density, can be learned. We provide numerical results for denoising and extend our analysis regarding noise level estimation and blind heteroscedastic denoising. Our contributions can be summarized as follows:

  • We derive conditions under which products of GMM experts acting on filter, wavelet, and shearlet responses obey the diffusion PDE.

  • We show how the GMM experts, along with the corresponding transformations, can be learned and provide algorithms to solve the arising sub-problems.

  • We evaluate the learned models on the prototypical image restoration problem: denoising.

Code for training, validation, and visualization, along with pre-trained models is available at https://github.com/VLOGroup/PoGMDM.

This paper is organized as follows: In Sect. 2, we give background information on diffusion and how it can be used for parameter estimation of learned densities. This section also encompasses an overview of related work. In Sect. 3, we introduce the backbone of our models and derive conditions under which they obey the diffusion PDE. We demonstrate the practical applicability of our models in Sect. 4 with numerical experiments. We explore alternative parametrizations and possible extensions of our models in Sect. 5 and finally conclude the paper, providing future research directions, in Sect. 6.

1.2 Notation and Preliminaries

For the sake of simplicity, throughout this article, we assume that all distributions admit a density with respect to the Lebesgue measure, although the numerical experiments only assume access to an empirical distribution. Thus, we use the terms distribution and density interchangeably. In Sect. 3, we define normal distributions that are supported on a subspace (e.g., the zero-mean subspace \(\{ x \in \mathbb {R}^n: \langle \mathbbm {1}_{\mathbb {R}^n}, x \rangle _{\mathbb {R}^n} = 0 \}\)). In this case, we restrict our analysis to the support, which is theoretically supported by the disintegration theorem [16]. We use the symbols \( \mathbb {R}_{+} \) and \( \mathbb {R}_{++} \) to denote the non-negative real numbers \( \{ x \in \mathbb {R}: x \ge 0 \} \) and positive reals numbers \( \{ x \in \mathbb {R}: x > 0 \} \), respectively. We denote with \( \langle \,\cdot \,, \,\cdot \,\rangle _{\mathbb {R}^n} {: \mathbb {R}^n \times \mathbb {R}^n \rightarrow \mathbb {R}: (x, y) \mapsto \sum _{i=1}^n x_i y_i} \) the standard inner product in the Euclidean space \( \mathbb {R}^n \), and with \( \Vert \,\cdot \,\Vert ^2: \mathbb {R}^n \rightarrow \mathbb {R}_+ \) the map \( x \mapsto \langle x, x\rangle _{\mathbb {R}^n}\). In addition, \( (\,\cdot \,\otimes \,\cdot \,): \mathbb {R}^n \times \mathbb {R}^n \rightarrow \mathbb {R}^{n \times n} \) is the standard outer product in \( \mathbb {R}^n \): \( (x \otimes y)_{ij} = x_i y_j \). \({\text {conj}}\) denotes element-wise complex conjugation. Let \( \mathcal {Q} \subset \mathcal {H} \) be a (not necessarily convex) subset of a Hilbert space \( \mathcal {H} \). We define by \( {{\,\textrm{proj}\,}}_\mathcal {Q}: \mathcal {H} \rightarrow {\mathcal {H}} \) the orthogonal projection onto the set \( \mathcal {Q} \). With slight abuse of notation, we ignore that this is a multivalued map in general. \( L^2(\Omega ) \) denotes the standard Lebesgue space on a domain \( \Omega \subset \mathbb {R}^n \). Lastly, \( \textrm{Id}_{\mathcal {H}} \) and \( \mathbbm {1}_\mathcal {H} \) denote the identity map and the one-vector in \( \mathcal {H} \), respectively.

2 Background

In this section, we first emphasize the importance of diffusion in density estimation (and sampling) in high dimensions. Then, we detail the relationship between diffusing the density function, empirical Bayes, and denoising score matching [17].

2.1 Diffusion Eases Density Estimation and Sampling

Let \( f_X \) be a density on \( \mathcal {X} \subset \mathbb {R}^d\). A major difficulty in estimating \( f_X \) with parametric models is that \( f_X \) is extremely sparsely populated in high dimensional spaces,Footnote 1 i.e., \( d\gg 1 \). This phenomenon has many names, e.g., the curse of dimensionality or the manifold hypothesis [13]. Thus, the learning problem is difficult, since meaningful gradients are rare. Conversely, let us for the moment assume access to a model \( \tilde{f}_X \) that approximates \( f_X \) well. In general, it is still very challenging to generate a set of points \( \{ x_i \}_{i=1}^I \) such that we can confidently say that the associated empirical density \( \frac{1}{I} \sum _{i=1}^I \delta _{x_i} \) approximates \( \tilde{f}_X \) (let alone \(f_X\)) well. This is because, in general, there does not exist a procedure to directly draw samples from \( \tilde{f}_X \), and Markov chain Monte Carlo methods are prohibitively slow in practice, especially for multimodal distributions in high dimensions [3].

The diffusion PDE or heat equation

$$\begin{aligned} (\partial _t - \Delta _1) f(\,\cdot \,, t) = 0\ \text {with initial condition}\ f(\,\cdot \,, 0) = f_X \nonumber \\ \end{aligned}$$
(1)

equilibrates the density \( f_X \), thus mitigating the challenges outlined above. Here, \( \partial _t \) denotes the standard partial derivative with respect to time \( \frac{\partial }{\partial t} \) and \( \Delta _1 = {{\,\mathrm{\textrm{Tr}}\,}}\circ \nabla _{\hspace{-2.22214pt}1}^2 \) is the Laplace operator, where the \( 1 \) indicates its application to the first argument. We detail the evolution of \( f_X \) under this diffusion PDE and relations to empirical Bayes in Sect. 2.2.

Learning \( f(\,\cdot \,, t) \) for \( t \ge 0 \) is more stable since the diffusion “fills the space” with meaningful gradients [3]. Of course, this assumes that for different times \( t_1 \) and \( t_2 \), the models of \( f(\,\cdot \,, t_1) \) and \( f(\,\cdot \,, t_2) \) are somehow related to each other. As an example of this relation, the recently popularized noise-conditional score-network [3] shares convolution filters over time, but their input is transformed through a time-conditional instance normalization. In this work, we make this relation explicit by considering a family of functions \( f(\,\cdot \,, 0) \) for which \( f(\,\cdot \,, t) \) can be expressed analytically.

For sampling, \( f(\,\cdot \,, t) \) for \( t > 0 \) can help by gradually moving samples toward high-density regions of \( f_X \), regardless of initialization. To utilize this, a very simple idea with relations to simulated annealing [18] is to have a pre-defined time schedule \( t_T> t_{T-1}> \cdots> t_{{0}} > 0 \) and sample \( f(\,\cdot \,, t_i) \), \( i = T, \dotsc , 0 \) (e.g. with Langevin Monte Carlo [19]) successively [3]. In [15], instead of considering discrete time steps, the authors propose to model the sampling procedure as a continuous-time stochastic differential equation. We note that the diffusion PDE (1) on the densities corresponds to the stochastic differential equation

$$\begin{aligned} \textrm{d}X = \sqrt{2} \textrm{d}W \end{aligned}$$
(2)

on the random variables, where \( W \) is the standard Wiener process. This is known as the variance exploding stochastic differential equation in the literature.

2.2 Diffusion, Empirical Bayes, and Denoising Score Matching

In this section, similar to the introduction, we again adopt the interpretation that the evolution in (1) defines the density of a random variable \( Y_t \). That is, \( Y_t \) is a random variable with probability density \( f_Y(\,\cdot \,, t) \), which fulfills \( (\partial _t - \Delta _1) f_Y(\,\cdot \,, t) = 0 \) and \( f_Y(\,\cdot \,, 0) = f_X \). It is well known that Green’s function of (1) is a Gaussian (see, e.g., [20]) with zero mean and covariance \( 2t\textrm{Id}\). In other words, for \( t > 0 \) we can write \( f_Y(\,\cdot \,, t) = G_{0,2t\textrm{Id}_{{\mathcal {X}}}} * f_X \), where

$$\begin{aligned} G_{\mu ,\Sigma }(x) = |2 \pi \Sigma |^{-1/2} \exp \bigl ( - \Vert x - \mu \Vert _{\Sigma ^{-1}}^2 / 2 \bigr ). \end{aligned}$$
(3)

Thus, the diffusion PDE constructs a (linear) scale space in the space of probability densities and we refer to \( Y_t \) (respectively, \( f_{Y_t} \)) as the smoothed random variable (respectively, density). Equivalently, in terms of the random variables, we can write \( Y_t = X + \sqrt{2t}N \) where \( N \) is a random variable with normal distribution \( \mathcal {N}(0, \textrm{Id}_{\mathcal {X}}) \). Next, we show how to estimate the corresponding instantiation of \( X \) which has “most likely” spawned an instantiation of \( Y_t \) using empirical Bayes.

In the school of empirical Bayes [14], we try to estimate a clean random variable given a corrupted instantiation, using only knowledge about the density of the random variable corresponding to the corrupted instance. In particular, for our setup we have a corruption model

$$\begin{aligned} y_t = x + \sqrt{2t} \eta , \end{aligned}$$
(4)

with \( x \sim f_X \) and \( \eta \sim \mathcal {N}(0, \textrm{Id}_{\mathcal {X}})\). It is well known that the Bayesian minimum mean-squared-error (MMSE) estimate is the conditional mean, i.e., the map \( y_t \mapsto \int x f_{X \mid Y_t}(x \mid y_t)\,\textrm{d}x \). In classical Bayes theory, such a map would be constructed by utilizing Bayes theorem, i.e., writing \( f_{X \mid Y_t} = \frac{f_{Y_t\mid X}f_{X}}{f_{Y_t}} \) and choosing an appropriate prior \( f_{X} \). However, a classical result from empirical Bayes estimation reveals that a map \( y_t \mapsto \int x f_{X \mid Y_t}(x \mid y_t)\,\textrm{d}x \) can be constructed only assuming access to the smoothed density \( f_{Y_t} \) without any reference to the prior \( f_X \). This result is known as the Miyasawa estimate [21] or Tweedie’s formula [22, 23] and we derive it here for completeness.

First, by the corruption model (4) we can write

$$\begin{aligned} f_{Y_t \mid X}(y \mid x) = |2\pi \sigma ^2\textrm{Id}|^{-\frac{1}{2}} \exp \Bigl ( -\frac{\Vert y - x\Vert ^2}{2\sigma ^2} \Bigr ), \end{aligned}$$
(5)

where we use the relation \( \sigma = \sqrt{2t} \), and thus by Bayes theorem it follows that

$$\begin{aligned} \begin{aligned} f_{Y_t}(y)&= \int f_{Y_t\mid X}(y\mid x) f_{X}(x)\,\textrm{d}x \\&= \int |2\pi \sigma ^2\textrm{Id}|^{-\frac{1}{2}} \exp \Bigl ( -\frac{\Vert y - x\Vert ^2}{2\sigma ^2} \Bigr ) f_{X}(x)\,\textrm{d}x. \end{aligned} \end{aligned}$$
(6)

Taking the gradient w.r.t. \( y \) and multiplying by \( \sigma ^2 \) yields

$$\begin{aligned} \begin{aligned}&\sigma ^2 \nabla f_{Y_t}(y) \\&\quad = \int (x - y)|2\pi \sigma ^2\textrm{Id}|^{-\frac{1}{2}} \exp \Bigl ( -\frac{\Vert y - x\Vert ^2}{2\sigma ^2} \Bigr ) f_{X}(x)\,\textrm{d}x \\&\quad = \int (x - y) f_{X, Y_t}(x, y)\,\textrm{d}x\\&\quad = \int x f_{X, Y_t}(x, y)\,\textrm{d}x - yf_{Y_t}(y) \end{aligned} \end{aligned}$$
(7)

and after dividing by \( f_{Y_t} \) it follows that

$$\begin{aligned} y + \sigma ^2 \frac{\nabla f_{Y_t}(y)}{f_{Y_t}(y)} = \int xf_{X\mid Y_t}(x\mid y)\,\textrm{d}x, \end{aligned}$$
(8)

where we used the definition of conditional densities, i.e., that \( f_{X \mid Y_t} = \frac{f_{X, Y_t}}{f_{Y_t}} \). Finally, by noting that \( \frac{\nabla f_{Y_t}}{f_{Y_t}} = \nabla \log f_{Y_t} \), the above can be rewritten as

$$\begin{aligned} y + \sigma ^2 \nabla \log f_{Y_t}(y) = \int xf_{X\mid Y_t}(x\mid y)\,\textrm{d}x. \end{aligned}$$
(9)

We refer to the work of Raphan and Simoncelli [23] for an empirical Bayes theory encompassing a more general family of corruptions. They refer to this type of estimator more generally as nonparametric empirical Bayes least squares (NEBLS).

We illustrate the idea of empirical MMSE estimation on a toy example in Fig. 1, where the data distribution consists of Dirac measures \( f_X = \sum _{i=1}^6 w_i \delta _{x_i} \) with

$$\begin{aligned} \begin{pmatrix} x_1^\top \\ \vdots \\ x_6^\top \end{pmatrix} = \begin{pmatrix} 0.588 &{} 0.966 \\ 0.289 &{} 0.112 \\ -0.313 &{}-0.924 \\ -0.696 &{} 0.990 \\ -0.906 &{} 0.030 \\ -0.516 &{} 0.039 \end{pmatrix} \text { and } w = \begin{pmatrix} 0.23 \\ 0.1 \\ 0.09 \\ 0.19 \\ 0.29 \\ 0.08 \end{pmatrix}. \end{aligned}$$
(10)

The figure illustrates that that \( f_{Y_t} \) approaches a simple form as \( t \) approaches infinity. Indeed, it has been shown [24] that \( f_{Y_t} \) is log-concave for large enough \( t \), and \(-\log f_{Y_t}\) approaches a quadratic function.

Fig. 1
figure 1

Diffusion of an empirical density consisting of the weighted Dirac measures \( \sum _{i=1}^6 w_i \delta _{x_i} \) specified in (10) at times \( t \in \{ {0.008}, {0.078}, {0.3} \} \). The arrows show the empirical Bayes estimate \( y \mapsto y + 2t \nabla \log {f}_{Y_t}(y) \). As \( t \) approaches infinity, \( {f}_{Y_t} \) becomes log-concave and \(-\log f_{Y_t}\) approaches a quadratic function

Recently, (9) has been used for parameter estimation [15, 17]: Let \( \{ x_i \}_{i=1}^I \) be a dataset of \( I \) samples drawn from \( f_X \) and let \( Y_t \) be governed by diffusion. Additionally, let \( f_\theta : \mathcal {X} \times [0, \infty ) \rightarrow \mathbb {R}_+ \) denote a parametrized model for which we wish that \( f_\theta (\,\cdot \,, t) \approx f_{Y_t} \), for all \( t > 0 \). Then, both the left- and right-hand side of (9) are known—in expectation. This naturally leads to the loss function

$$\begin{aligned} \min _{\theta \in \Theta } \int _{(0, \infty )} \mathbb {E}_{(x, y_t) \sim f_{X, Y_t}} \Vert x - y_t - \sigma ^2(t) \nabla _{\hspace{-2.22214pt}1} \log f_\theta (y_t, t)\Vert ^2 \,\textrm{d}t \nonumber \\ \end{aligned}$$
(11)

for estimating \( \theta \) such that \( f_\theta (\,\cdot \,, t) \approx f_{Y_t} \) for all \( t > 0 \). Here, \( f_{X, Y_t} \) denotes the joint distribution of the clean and smoothed random variables and \( \Theta \) describes the set of feasible parameters. This learning problem is known as denoising score matching [15, 17, 25].

3 Methods

In this section, we first introduce one-dimensional GMMs as the backbone of our model and recall some properties that are needed for the analysis in the following subsections. Then, we detail how we can utilize PoGMDMs based on filter, wavelet, and shearlet responses to model the distribution of natural images. For all models, we present assumptions under which they obey the diffusion PDE.

The backbone of our models is the one-dimensional GMM expert \( \psi _j: \mathbb {R}\times \triangle ^L \times [0, \infty ) \rightarrow \mathbb {R}_+ \) with \( L \) components of the form

$$\begin{aligned} \psi _j(x,w_j,t) = \sum _{l=1}^{L} w_{jl} G_{\mu _l,\sigma _j^2(t)}(x). \nonumber \\ \end{aligned}$$
(12)

The weights of each expert \( w_j = (w_{j1}, \dotsc , w_{jL})^\top \) must satisfy the unit simplex constraint, i.e., \( w_j \in \triangle ^L \), \( \triangle ^L = \{ x \in \mathbb {R}^L: x \ge 0, \langle \mathbbm {1}_{\mathbb {R}^L}, x \rangle _{\mathbb {R}^L} = 1 \} \). Although not necessary, we assume for simplicity that all experts \( \psi _j \) have the same number of components and the discretization of their means \( \mu _l \) over the real line is shared and fixed a priori (for details see Sect. 4.1).

The main contribution of our work is to show that, under certain assumptions, it suffices to adapt the variances of the individual experts to implement the diffusion of a model built through multiplying experts of the form (12). In detail, we show that the variance \( \sigma _j^2: [0, \infty ) \rightarrow \mathbb {R}_+ \) of the \( j \)-th expert can be modeled as

$$\begin{aligned} \sigma _j^2(t) = \sigma _0^2 + c_j 2t, \end{aligned}$$
(13)

where \( \sigma _0 {> 0} \) is chosen a priori to support the uniform discretization of the means \( \mu _l \) and \( c_j \in \mathbb {R}_{++} \) are derived from properties of the product model such that it obeys the diffusion PDE (1).

In the following subsections we exploit two well-known properties of GMMs to derive how to adapt the variance \( \sigma _j^2(t) \) of each expert with diffusion time \( t \), such that the product model obeys the diffusion PDE: First, up to normalization, the product of GMMs is again a GMM (see, e.g., [26]). This allows us to work on highly expressive models that enable efficient evaluations due to factorization. Second, we use the fact that there exists an analytical solution to the diffusion PDE if \( f_X \) is a GMM: Green’s function associated with the linear isotropic diffusion PDE (1) is a Gaussian with isotropic covariance \( 2t \textrm{Id} \). Due to the linearity of the convolution, it suffices to analyze the convolution of individual components of the product model, which is just the convolution of two Gaussians. Using previous notation, if \( X \) is a random variable with normal distribution \( \mathcal {N}(\mu _X, \Sigma _X) \), then \( Y_t \) follows the distribution \( \mathcal {N}(\mu _X, \Sigma _X + 2t\textrm{Id}) \). In particular, the mean remains unchanged and it suffices to adapt the covariance matrix with the diffusion time.

In what follows, we discuss three product models whose one-dimensional GMM experts act on filter, wavelet, and shearlet responses. In particular, we present conditions under which the diffusion of the product model can be implemented by adapting the variances of the one-dimensional GMM experts. For all three models, we give an analytic expression for the constants \( c_j \) in (13).

3.1 Patch Model

In this section, we approximate the distribution of image patches \( p \in \mathbb {R}^a \) of size \( a = b\times b \) by a product of \( J \in \mathbb {N} \) GMM experts acting on filter responses. In detail, the model is of the form

$$\begin{aligned} f^{\textrm{filt}}_\theta (p, t) = Z(\{ k_j \}_{j=1}^J{, \sigma _0, t})^{-1}\prod _{j=1}^J \psi _j(\langle k_j, p \rangle _{\mathbb {R}^a}, w_j, t). \nonumber \\ \end{aligned}$$
(14)

Each GMM expert \( \psi _j: \mathbb {R}\times \triangle ^L \times [0, \infty ) \rightarrow \mathbb {R}_+ \) for \( j=1,\ldots ,J \) models the distribution of filter responses \( \mathbb {E}_{p_t\sim f_{Y_t}} {\bigl [} \delta (\,\cdot \,- \langle k_j, p_t \rangle _{\mathbb {R}^a}) {\bigr ]} \) of the associated filters \( k_j \in \mathbb {R}^a \) for all \( t > 0 \). We denote with \( Z(\{ k_j \}_{j=1}^J{,\sigma _0, t}) \) the partition function such that \( f^{\textrm{filt}}_\theta \) is properly normalized. In this model, we can summarize the learnable parameters \(\theta = \{ (k_j, w_j) \}_{j=1}^J \).

First, the following theorem establishes the exact form of (12) as a GMM on \( \mathbb {R}^a \). The covariance matrix and the means are endowed with the subscript \( \mathbb {R}^a \) to emphasize that the resulting GMM models patches of this size; The models based on wavelet and shearlet responses discussed later (Sects. 3.2 and 3.3, respectively) can be applied to images of arbitrary size, which we emphasize by using the subscript \( \mathbb {R}^n \). We denote with \( \hat{l}: \{ 1, \dotsc , J \} \rightarrow \{ 1, \dotsc , L \} \) a fixed but arbitrary selection from the index set \( \{ 1, \dotsc , L \} \).

Theorem 1

\( f^{\textrm{filt}}_\theta (\,\cdot \,, 0) \) is a homoscedastic GMM on \( \mathbb {R}^a \) with \( L^J \) components and precision matrix

$$\begin{aligned} (\Sigma _{\mathbb {R}^a})^{-1} = \frac{1}{\sigma _0^2} \sum _{j=1}^J (k_j \otimes k_j). \end{aligned}$$
(15)

The mean of the component identified by the choice of the index map \( \hat{l} \) has the form

$$\begin{aligned} \mu _{\mathbb {R}^a,\hat{l}} = \Sigma _{\mathbb {R}^a} \sum _{j=1}^J k_j\mu _{\hat{l}(j)}. \end{aligned}$$
(16)

Proof

By definition,

$$\begin{aligned} \prod _{j=1}^J \psi _j(\langle k_j, p \rangle _{\mathbb {R}^a}, w_{j}, 0)= & {} \prod _{j=1}^J \sum _{l=1}^{L} \frac{w_{jl}}{\sqrt{2\pi \sigma _0^2}} \nonumber \\{} & {} \exp \left( -\frac{1}{2\sigma _0^2}{(\langle k_j, p \rangle _{\mathbb {R}^a} - \mu _l)}^2 \right) . \end{aligned}$$
(17)

The general component of the above is uniquely identified by the choice of the map \( \hat{l} \) as

$$\begin{aligned} \begin{aligned}&(2\pi \sigma _0^2)^{-\frac{J}{2}} \biggl ( \prod _{j=1}^J w_{j\hat{l}(j)} \biggr ) \\ {}&\times \exp \Biggl ( -\frac{1}{2\sigma _0^2} \sum _{j=1}^J (\langle k_j, p \rangle _{\mathbb {R}^a} - \mu _{\hat{l}(j)})^2 \Biggr ). \end{aligned} \end{aligned}$$
(18)

To find \( (\Sigma _{\mathbb {R}^a})^{-1} \), we match the gradient of the familiar quadratic form: Motivated by \( \nabla _{\hspace{-2.22214pt}p} \Vert p - \mu _{\mathbb {R}^a,\hat{j}}\Vert ^2_{\Sigma _{\mathbb {R}^a}^{-1}} / 2 = \Sigma _{\mathbb {R}^a}^{-1} (p - \mu _{\mathbb {R}^a,\hat{l}}) \) we find that \( \nabla _{\hspace{-2.22214pt}p} \bigl ( \frac{1}{2\sigma _0^2} \sum _{j=1}^J (\langle k_j, p \rangle _{\mathbb {R}^a} - \mu _{\hat{l}(j)})^2 \bigr ) = \frac{1}{\sigma _0^2} \sum _{j=1}^J \bigl ( (k_j \otimes k_j) p - k_j \mu _{\hat{l}(j)}\bigr ) \). From the first term, we immediately identify \( (\Sigma _{\mathbb {R}^a})^{-1} = \frac{1}{\sigma _0^2} \sum _{j=1}^J (k_j \otimes k_j) \), and we find \( \mu _{\mathbb {R}^a, \hat{l}} \) by left-multiplying \( \Sigma _{\mathbb {R}^a} \) onto \( \sum _{j=1}^J k_j \mu _{\hat{l}(j)} \). \(\square \)

The next theorem establishes a tractable analytical expression for the diffusion process under the assumption of pair-wise orthogonal filters, that is

$$\begin{aligned} \langle k_j, k_i \rangle = {\left\{ \begin{array}{ll} 0 &{} \text {if}\ i \ne j, \\ \Vert k_j\Vert ^2 &{} \text {else}, \end{array}\right. } \text { for all } i, j \in \{ 1,\dotsc ,J \}. \nonumber \\ \end{aligned}$$
(19)

Theorem 2

(Patch diffusion) Under assumption (19), \( f^{\textrm{filt}}_\theta (\,\cdot \,, t) \) satisfies the diffusion PDE \( (\partial _t - \Delta _1) f^{\textrm{filt}}_\theta (\,\cdot \,, t) = 0 \) if \( \sigma _j^2(t) = \sigma _0^2 + \Vert k_j\Vert ^2 2t \).

Proof

Assuming (19), the Eigendecomposition of the precision matrix can be trivially constructed. In particular, \( (\Sigma _{\mathbb {R}^a})^{-1} = \sum _{j=1}^J \frac{\Vert k_j\Vert ^{2}}{\sigma _0^2} (\frac{k_j}{\Vert k_j\Vert } \otimes \frac{k_j}{\Vert k_j\Vert }) \), hence \( \Sigma _{\mathbb {R}^a} = \sum _{j=1}^J \frac{\sigma _0^2}{\Vert k_j\Vert ^{2}} (\frac{k_j}{\Vert k_j\Vert } \otimes \frac{k_j}{\Vert k_j\Vert }) \). As discussed in Sect. 2.2, \( \Sigma _{\mathbb {R}^a} \) evolves as \( \Sigma _{\mathbb {R}^a} \mapsto \Sigma _{\mathbb {R}^a} + 2t\textrm{Id}_{\mathbb {R}^a} \) under diffusion. Equivalently, on the level of Eigenvalues, \( \frac{\sigma _0^2}{\Vert k_j\Vert ^{2}} \mapsto \frac{\sigma _0^2 + 2t\Vert k_j\Vert ^{2}}{\Vert k_j\Vert ^{2}} \) for all \( j = 1, \ldots , J \). Recall that \( \sigma _0^2 \) is just \( \sigma _j^2(0) \). Thus, \( f^{\textrm{filt}}_\theta (\,\cdot \,,t) \) satisfies the diffusion PDE if \( \sigma _j^2(t) = \sigma _0^2 + \Vert k_j\Vert ^2 2t \). \(\square \)

Corollary 1

With assumption (19), the experts \( \psi _j(\,\cdot \,, w_j, t) \) in (14) model the marginal distribution of the random variable \( U_{j, t} = \langle k_j, Y_t \rangle \). In addition, \({f^{\text {filt}}_{\theta }} \) is normalized when \( Z(\{ k_j \}_{j=1}^J{, \sigma _0, t}) = \Bigl ((2\pi )^a \prod _{j=1}^J \frac{\sigma _0^2 + 2t\Vert k_j\Vert ^2}{\Vert k_j\Vert ^2}\Bigr )^{\frac{1}{2}} \).

Proof

We first show that \( \psi _j(\,\cdot \,, w_j, t) \) model the marginal distribution of the random variable \( U_{j, t} = \langle k_j, Y_t \rangle \). Consider one component of the resulting homoscedastic GMM: \( \hat{Y}_t \sim \mathcal {N}(\mu _{\mathbb {R}^a,\hat{l}}, \Sigma _{\mathbb {R}^a} + 2t\textrm{Id}_{\mathbb {R}^a}) \). The distribution of \( \hat{U}_{j, t} = \langle k_j, \hat{Y}_t \rangle \) is \( \hat{U}_{j, t} \sim \mathcal {N}(k_j^\top \mu _{\mathbb {R}^a,\hat{l}}, k_j^\top (\Sigma _a + 2t\textrm{Id}_a) k_j) \) (see e.g. [27] for a proof). Under our orthogonality assumptions, this simplifies to \( \mathcal {N}(\mu _{\hat{l}(j)}, \sigma _0^2 + 2t\Vert k_j\Vert ^2) \). The claim follows from the linear combination of the different components.

The normalization constant is the classical normalization of a Gaussian, which requires the pseudo-determinant of \( \Sigma _{\mathbb {R}^a} \) [16]. The pseudo-determinant is easily calculated by the product of the Eigenvalues outlined in Theorem 2. \(\square \)

3.2 Wavelet Model

The key ingredient in the previous section was the orthogonality of the filters. In other words, the filter bank \( \{ k_j \}_{j=1}^J \) forms an orthogonal (not necessarily orthonormal) basis for (a subspace of) \( \mathbb {R}^a \). In this section, we discuss the application of explicit diffusion models in another well-known orthogonal basis: Wavelets. In what follows, we briefly discuss the main concepts of the discrete wavelet transformation needed for our purposes. For the sake of simplicity, we stick to the one-dimensional case but note that the extension to two dimensions is straight forward, see e.g. [28, Chapter 4.4]. The following is largely adapted from [28], we refer the reader to this and [29, 30] for information on the extension to two-dimensional signals as well as efficient implementations using the fast wavelet transformation.

3.2.1 The Discrete Wavelet Transformation

Let \( \omega \in L^2(\mathbb {R}) \) be a wavelet satisfying the admissibility condition

$$\begin{aligned} 0< \int _0^\infty \frac{|(\mathcal {F}\omega )(\zeta )|^2}{\zeta }\,\textrm{d}\zeta < \infty . \end{aligned}$$
(20)

The set of functions

$$\begin{aligned} \left\{ \omega _{j, k} = 2^{-j/2} \omega (2^{-j}\,\cdot \,- k): j, k \in \mathbb {Z} \right\} \end{aligned}$$
(21)

forms an orthonormal basis of \( L^2(\mathbb {R}) \) under certain conditions that we now recall. Let \( (V_j)_{j\in \mathbb {Z}} \) be a multiscale analysis with generator or scaling function \( \phi \in V_0 \), i.e., \( \{ T_k \phi : k \in \mathbb {Z} \} \) form an orthonormal basis of \( V_0 \) (\( T_k \) is a translation operator \( (T_k \phi )(x) = \phi (x + k) \)). The scaling property

$$\begin{aligned} u \in V_j \iff D_{1/2}u\in V_{j+1} \quad ((D_s u)(x) = u(sx)) \end{aligned}$$
(22)

of the multiscale analysis \( (V_j)_{j\in \mathbb {Z}} \) implies that the functions \( \phi _{{j}, k} = 2^{-j/2}\phi (2^{-j}\,\cdot \,- k) \), \( k \in \mathbb {Z} \) form an orthonormal basis of \( V_j \). Further, the scaling property implies that \( \phi \in V_{-1} \) and since \( \phi _{-1, k} \) form an orthonormal basis of \( V_{-1} \), we have that

$$\begin{aligned} \phi (x) = \sqrt{2}\sum _{k\in \mathbb {Z}} h_k \phi (2x - k) \end{aligned}$$
(23)

with \( h_k = \langle \phi , \phi _{-1, k} \rangle _{L^2(\mathbb {R})} \). We define the detail or wavelet spaces \( W_j \) as the orthogonal complements of the approximation spaces \( V_j \) in \( V_{j-1} \), i.e.,

$$\begin{aligned} V_{j-1} = V_j \oplus W_j, \quad V_j \perp W_j. \end{aligned}$$
(24)

From this follows that \( V_j = \bigoplus \limits _{m\ge j+1} W_m \) and due to the completeness of \( V_j \), that \( \bigoplus \limits _{m \in \mathbb {Z}} W_m = L^2(\mathbb {R}) \). By the orthogonality, we have that \( {{\,\textrm{proj}\,}}_{V_{j-1}} = {{\,\textrm{proj}\,}}_{V_j} + {{\,\textrm{proj}\,}}_{W_j} \) and hence \( {{\,\textrm{proj}\,}}_{W_j} = {{\,\textrm{proj}\,}}_{V_{j-1}} - {{\,\textrm{proj}\,}}_{V_j} \). Thus, any \( u \in L^2(\mathbb {R}) \) can be represented as

$$\begin{aligned} u = \sum _{j\in \mathbb {Z}} {{\,\textrm{proj}\,}}_{W_j} u = {{\,\textrm{proj}\,}}_{V_m} u + \sum _{j\le m}{{\,\textrm{proj}\,}}_{W_j} u \end{aligned}$$
(25)

justifying the name multiscale analysis. Then (see [28, Theorem 4.67] for details) \( \omega \in V_{-1} \) defined by

$$\begin{aligned} \omega (x) = \sqrt{2}\sum _{k\in \mathbb {Z}} (-1)^{k} h_{1-k} \phi (2x-k) \end{aligned}$$
(26)

is a wavelet, \( \{ \omega _{j,k}: k \in \mathbb {Z} \} \) is an orthonormal basis of \( W_j \) and in particular the construction (21) is an orthonormal basis of \( L^2(\mathbb {R}) \).

3.2.2 Modeling Wavelet Coefficients

In this section, we describe how we can utilize a product of GMM experts to model the distribution of wavelet responses. For the subsequent analysis, first observe that by (24) the detail spaces (and the approximation spaces) are orthogonal. Utilizing the shorthand notation

$$\begin{aligned} \mathcal {W}_j = {{\,\textrm{proj}\,}}_{W_j}, \end{aligned}$$
(27)

since \( \mathcal {W}_j \) is an orthogonal projection, it satisfies the properties

$$\begin{aligned}&\text {(self-adjoint)}{} & {} (\mathcal {W}_j)^*{} & {} = \mathcal {W}_j, \nonumber \\&\quad \text {(idempotency)}{} & {} \mathcal {W}_j \circ \mathcal {W}_j{} & {} = \mathcal {W}_j,\ \text {and} \nonumber \\&\quad \text {(identity on subspace)}\ \ \ {}{} & {} \mathcal {W}_j|_{W_j}{} & {} = \textrm{Id}_{W_j} \end{aligned}$$
(28)

where \( \mathcal {W}_j|_{W_j} \) denotes the restriction of \( \mathcal {W}_j \) to \( W_j \).

As in the previous section, we model the wavelet responses with Gaussian mixture experts. In detail, let \( x \) be a signal in \( \mathbb {R}^n \) and thus, \( \mathcal {W}_j: \mathbb {R}^n \rightarrow \mathbb {R}^n \). Then, the model readsFootnote 2

$$\begin{aligned} f^{\textrm{wave}}_\theta (x, t) \propto \prod _{j=1}^{J} \prod _{i=1}^{n} \psi _j ((\mathcal {W}_j x)_{i}, w_j, t). \end{aligned}$$
(29)

Following the approach utilized in Theorem 2, we first describe the exact form of (29) as a GMM on \( \mathbb {R}^n \). We denote with \( \hat{l}: \{1, \dotsc , n \} \rightarrow \{ 1, \dotsc , L \} \) a fixed but arbitrary selection from the index set \( \{ 1, \dotsc , L \} \). In addition, the notation \( \sum _{\hat{l} = 1}^{L^n} \) indicates the summation over all \( L^n \) possible selections and for the following proof we define \( \mu _{\mathbb {R}^n}(\hat{l})\, {{:}{=}}\, (\mu _{\hat{l}(1)}, \mu _{\hat{l}(2)},\dotsc ,\mu _{\hat{l}(n)})^\top \in \mathbb {R}^n \).

Theorem 3

\( f^{\textrm{wave}}_\theta (\,\cdot \,, t) \) is a homoscedastic GMM on \( \mathbb {R}^n \) with precision matrix

$$\begin{aligned} (\Sigma _{\mathbb {R}^n})^{-1} = \frac{1}{\sigma _0^2} \sum _{j=1}^J \mathcal {W}_j. \end{aligned}$$
(30)

Proof

By definition for \( t = 0\), we have

$$\begin{aligned}{} & {} f^{\textrm{wave}}_\theta (x, 0)\nonumber \\{} & {} \quad \propto \prod _{i=1}^n \prod _{j=1}^J \sum _{l=1}^{L} \frac{w_{jl}}{\sqrt{2\pi \sigma _0^2}} \exp \left( -\frac{((\mathcal {W}_j x)_i - \mu _l)^2}{2\sigma _0^2}\right) . \end{aligned}$$
(31)

First, we expand the product over the pixels

$$\begin{aligned} \begin{aligned}&f^{\textrm{wave}}_\theta (x, 0) \propto \prod _{j=1}^J \sum _{{\hat{l}} = 1}^{L^n} (2\pi \sigma _0^2)^{-\frac{n}{2}} \overline{w}_{j{\hat{l}}} \\&\quad \times \exp \left( -\frac{\Vert (\mathcal {W}_j x) - \mu _{\mathbb {R}^n}(\hat{l})\Vert ^2}{2\sigma _0^2}\right) \end{aligned} \nonumber \\ \end{aligned}$$
(32)

using the index map \( \hat{l} \) and \(\overline{w}_{j\hat{l}} = \prod _{i=1}^I w_{j\hat{l}(i)} \). Further, expanding over the features results in

$$\begin{aligned} \begin{aligned}&f^{\textrm{wave}}_\theta (x, 0) \propto \sum _{\hat{\imath }= 1}^{(L^n)^J}(2\pi \sigma _0^2)^{-\frac{nJ}{2}} \overline{\overline{w}}_{\hat{\imath }(i,j)} \\&\quad \times \exp \left( -\frac{1}{2\sigma _0^2}\sum _{j=1}^J \Vert (\mathcal {W}_j x) - \mu _{\mathbb {R}^n,\hat{\imath }(i, j)}\Vert ^2\right) , \end{aligned}\nonumber \\ \end{aligned}$$
(33)

where \( \overline{\overline{w}}_{\hat{\imath }(i,j)}=\prod _{j=1}^{J}\prod _{i=1}^{I} w_{\hat{\imath }(i,j)} \). Notice that (33) describes a homoscedastic GMM on \( \mathbb {R}^n \) with precision matrix

$$\begin{aligned} (\Sigma _{\mathbb {R}^n})^{-1} = \frac{1}{\sigma _0^2} \sum _{j=1}^J \bigl (\mathcal {W}_j\bigr )^*\mathcal {W}_j. \end{aligned}$$
(34)

Using the properties of a projection (28) and the orthogonality property \( W_i \perp W_j \) for \( i \ne j \), this simplifies to

$$\begin{aligned} (\Sigma _{\mathbb {R}^n})^{-1} = \frac{1}{\sigma _0^2} \sum _{j=1}^J \mathcal {W}_j. \end{aligned}$$
(35)

\(\square \)

Theorem 4

(Wavelet diffusion) \( f^{\textrm{wave}}_\theta (\,\cdot \,, t) \) satisfies the diffusion PDE \( (\partial _t - \Delta _1) f^{\textrm{wave}}_\theta (\,\cdot \,, t) = 0 \) if \( \sigma _j^2(t) = \sigma _0^2 + 2t \).

Proof

Notice that, using the properties of a projection (28), \( (\Sigma _{\mathbb {R}^n})^{-1}|_{\oplus _{j=1}^J W_j} = \frac{1}{\sigma ^2}\textrm{Id}_{\oplus _{j=1}^J W_j} \). Thus, on \( \bigoplus \limits _{j=1}^J W_j \), in analogy to the model based on filter responses, it suffices to adapt the variance of the one-dimensional GMMs \( \psi _j \) with \( \sigma _0^2 \mapsto \sigma _0^2 + 2t \). \(\square \)

We can endow the different sub-bands of the wavelet transformation with scalars to weight their influence as follows: Replacing \( \mathcal {W}_j \) with \( \lambda _j \mathcal {W}_j \) in (34) (the derivation does not change up to this point), we find that \( (\Sigma _{\mathbb {R}^n})^{-1} = \sum _{j=1}^J \frac{\lambda _j^2}{\sigma _0^2} \mathcal {W}_j \). Then, the diffusion PDE is satisfied when \( \sigma _j^2(t) = \sigma _0^2 + 2t \lambda _j^2 \). Thus, the scaling parameters \( \lambda _{{j}} \) are analogous to the filter-norm \( \Vert k_j\Vert \) in Theorem 2.

We briefly discuss the extension to two-dimensional signals: Let \( x \in \mathbb {R}^{n \times n} \) be a two-dimensional signal. \( \mathcal {W}_j^{d}: \mathbb {R}^{n \times n} \rightarrow \mathbb {R}^{n \times n} \) is a linear operator corresponding to the \( j \)-th detail level (\( j \in \{ 1, \dotsc , J \} \) where \( J \in \mathbb {N} \) is the coarsest scale in the decomposition) in the wavelet decomposition in direction \( d \). We denote the (now three) detail spaces at scale \( j \) as \( W_{j}^{d} \), where \( d \in \{ \textbf{v}, \textbf{h}, \textbf{d} \} \) indexes the direction (vertical, horizontal, and diagonal.). Our model accounts for the directional sub-bands with individual GMM experts, i.e., every \( \psi _j \) is replaced by a triplet \( \psi _j^{d} \) endowed with weights \( w_j^d \in \triangle ^{n_w} \) for \( d \in \{ \textbf{v}, \textbf{h}, \textbf{d} \} \). Then, the entire previous discussion holds, where in particular \( W_i^{d} \perp W_j^{\tilde{d}} \) for all \( d, \tilde{d} \in \{ \textbf{v}, \textbf{h}, \textbf{d} \} \) and all \( i \ne j \). Since the operators \( \mathcal {W}_j^d \) are derived from generating sequence \( h \in \mathbb {R}^k \) (see Sect. 4.1.2), the learnable parameters are summarized as \( \theta = \{ h, \{\lambda _j^d\}_{j,d}, \{ w_j^d \}_{j,d} \} \).

3.2.3 Interpretation as Diffusion Wavelet Shrinkage

Wavelet shrinkage is a popular class of denoising algorithms. Starting from the seminal work of [31,32,33], a vast literature is dedicated to finding optimal shrinkage parameters for wavelet-based denoising (see, e.g.

[34,35,36,37,38,39] and the references therein). In what follows, we briefly describe historical approaches to estimating shrinkage parameters.

The key motivation behind wavelet shrinkage denoising algorithms is the observation that wavelet coefficients of natural images are sparse, wheres the wavelet coefficients of noisy images are densely filled with “small” values. Thus, a straight forward denoising algorithm might be to calculate the wavelet coefficients, “shrink” small coefficients toward zero, and calculate the inverse wavelet transform of the shrank coefficients. Popular shrinkage operators include the soft-shrinkage \( x \mapsto {\text {sgn}}(x) \max \{ |x| - \tau , 0 \} \) and the hard-shrinkage \( x \mapsto x {\chi }_{\{|x| > \tau \}} \). It is easy to see that these operators promote sparsity in the wavelet coefficients, as they correspond to the proximal maps w.r.t. \( \tau \Vert \,\cdot \,\Vert _1 \) and \( \tau \Vert \,\cdot \,\Vert _{0} \), respectively. Here, \( \tau > 0 \) is a thresholding parameter that has to be chosen depending on the noise level.

Historically, research for wavelet shrinkage models has focused on finding the optimal shrinkage parameter \( \tau \) (w.r.t. some risk, e.g. the squared error), assuming a particular choice of the shrinkage operator (e.g. the soft-shrinkage). Popular selection methods include VisuShrink [31] and SureShrink [32]. The former is signal independent and the threshold is essentially determined by the dimensionality of the signal as well as the (assumed known) noise level. In contrast, the latter chooses the thresholding parameter depending on the energy in a particular sub-band and does not depend on the dimensionality of the signal explicitly. The BayesShrink [40] method is also sub-band adaptive, and the authors provide expressions (or at least good approximations) for the optimal thresholding parameter under a generalized Gaussian prior on the wavelet coefficients. In particular, they rely on classical noise level estimation techniques to fit the generalized Gaussian to the wavelet coefficients (of the noisy image) and arrive at a simple expression for a sub-band dependent threshold.

The general methodology outlined in the previous section allows us to take a different approach: Instead of fixing the thresholding function and estimating the threshold solely on the corrupted image, we instead propose to learn the distribution of wavelet coefficients in different sub-bands for all noise levels \( \sigma > 0 \). Notice that an empirical Bayes step on the wavelet coefficients under our model corresponds to applying a point-wise nonlinearity.

In contrast to the traditional wavelet shrinkage, our model does not prescribe a shrinkage function for which an optimal parameter has to be estimated for different noise levels. Rather, by learning the distribution of the wavelet coefficients at “all” noise levels, we have access to an MMSE optimal “shrinkage” function view of the empirical Bayes step on the experts. In addition, our wavelet prior can be used in more general inverse problems whereas classical shrinkage methods are only applicable to denoising (although the denoising engine could be used in regularization by denoising [41] or plug-and-play [42] approaches).

3.3 Convolutional Model

The model based on filter responses discussed in Sect. 3.1 cannot account for the correlation of overlap** patches when used for whole image restoration [43, 44]. Similarly, the model based on wavelet responses is limited in expressiveness since it only models the distribution of a scalar random variable per sub-band. In what follows, we describe a convolutional PoGMDM that avoids the extraction and combination of patches in patch-based image priors and can account for the local nature of low-level image features. The following analysis assumes vectorized images \( x \in \mathbb {R}^n \) with \( n \) pixels; the generalization to higher dimensions is straight forward.

In analogy to the product-of-experts-type model acting on filter responses, here we extend the fields-of-experts model [44] to our considered diffusion setting by accounting for the diffusion time \( t \) and obtain

$$\begin{aligned} f_\theta ^{\textrm{conv}}(x, t) \propto \prod _{i=1}^n \prod _{j=1}^J \psi _j((K_j x)_i, w_{j}, t). \end{aligned}$$
(36)

Here, each expert \( \psi _j \) models the density of convolution features extracted by convolution kernels \( {\{ k_j \}}_{j=1}^J \) of size \( a = b \times b \), where \( {\{ K_j \}}_{j=1}^J \subset \mathbb {R}^{n \times n}\) are the corresponding matrix representations. Further, \( w_j \in \triangle ^L \) are the weights of the components of the \( j \)-th expert \( \psi _j \) [see (12)]. As with the models based on filter and wavelet responses, it is sufficient to adapt the variances \(\sigma _j^2(t)\) by the diffusion time as the following analysis shows.

Equation (36) again describes a homoscedastic GMM on \( \mathbb {R}^n \) with precision \( (\Sigma _{\mathbb {R}^n})^{-1} = \frac{1}{\sigma _0^2} \sum _{j=1}^J K_j^\top K_j \). This can be seen by essentially following the derivation of (34) in Theorem 4, at which point we did not yet exploit the special structure of \( \mathcal {W}_j: \mathbb {R}^n \rightarrow \mathbb {R}^n \) (i.e., it may be an arbitrary linear operator).

In order to derive conditions under which (36) fulfills the diffusion PDE, we begin by fixing the convolutions as cyclic, i.e., \( K_j x \equiv k_j *_n x \), where \( *_n \) denotes a 2-dimensional convolution with cyclic boundary conditions. Due to the assumed boundary conditions, the Fourier transformation \( \mathcal {F} \) diagonalizes the convolution matrices: \( K_j = \mathcal {F}^* {{\,\textrm{diag}\,}}(\mathcal {F}k_j) \mathcal {F} \). Thus, the precision matrix can be expressed as

$$\begin{aligned} (\Sigma _{\mathbb {R}^n})^{-1} = \mathcal {F}^*{{\,\textrm{diag}\,}}\biggl (\sum _{j=1}^J \frac{|\mathcal {F}k_j|^2}{\sigma ^2}\biggr ) \mathcal {F} \end{aligned}$$
(37)

where we used the fact that \( \mathcal {F}\mathcal {F}^* = \textrm{Id}_{\mathbb {R}^n} \) and \( {{\text {conj}}(z)}z = |z|^2 \) (here \( |\,\cdot \,| \) denotes the complex modulus acting element-wise on its argument). To get a tractable analytic expression for the variances \( \sigma _j^2(t) \), we further assume that the spectra of \( k_j \) have disjoint support, i.e.,

$$\begin{aligned} \Gamma _i \cap \Gamma _j = \emptyset \ \text { if }\ i\ne j, \end{aligned}$$
(38)

where \( \Gamma _j = {\text {supp}} \mathcal {F}k_j \). In addition, we assume that the magnitude is constant over the support, i.e.,

$$\begin{aligned} |\mathcal {F} k_j| = \xi _j {\chi _{\Gamma _j}}, \end{aligned}$$
(39)

where \(\xi _j \in \mathbb {R}\) is the magnitude and \( {\chi _A} \) is the characteristic function of the set \( A \)

$$\begin{aligned} \chi _A(x) = {\left\{ \begin{array}{ll} 1 &{} \text {if}\ x \in A,\\ 0 &{} \text {else}. \end{array}\right. } \end{aligned}$$
(40)

Theorem 5

(Convolutional Diffusion) Under assumptions (38) and (39), \( f^{\textrm{conv}}_\theta (\,\cdot \,, t) \) satisfies the diffusion PDE \( (\partial _t - \Delta _1) f^{\textrm{conv}}_\theta (\,\cdot \,, t) = 0 \) if \( \bar{\sigma }_j^2(t) = \sigma _0^2 + \xi _j^2 2t \).

Proof

In analogy to Theorem 2, with (37) \( \mathcal {F}^*{{\,\textrm{diag}\,}}\bigg (\sum _{j=1}^J \frac{\sigma ^2}{|\mathcal {F}k_j|^2} \bigg ) \mathcal {F} \mapsto \mathcal {F}^*{{\,\textrm{diag}\,}}\left( \frac{\sigma ^2 + 2t\sum _{j=1}^J|\mathcal {F}k_j|^2 }{\sum _{j=1}^J |\mathcal {F}k_j|^2}\right) \mathcal {F} \) under diffusion. The inner sum decomposes as

$$\begin{aligned} \frac{\sigma _0^2 + 2t \sum _{j=1}^J |\mathcal {F}k_j|^2}{\sum _{j=1}^J |\mathcal {F}k_j|^2} = \sum _{j=1}^J \frac{\sigma _0^2 + 2t |\mathcal {F}k_j|^2}{|\mathcal {F}k_j|^2} \end{aligned}$$
(41)

using (38), and with (39) the numerator reduces to \( \sigma _0^2 + 2t\xi _j^2 \). \(\square \)

We emphasize that the convolutional model (36) is distinctly different from the model based on filter responses discussed in Sect. 3.1. In particular, the one-dimensional GMM experts \( \psi _j(\,\cdot \,, w_j, t) \) do not model the distribution of the filter responses of their corresponding filter kernels \( k_j \), but account for the non-trivial correlation of overlap** patches. We discuss this in more detail in Sect. 5.3.

3.3.1 Shearlets

In the previous section, we derived abstract conditions under which a product of one-dimensional GMM experts, with each expert modeling the distribution of convolutional features, can obey the diffusion PDE. In particular, we derived that the spectra of the corresponding convolution filters must be non-overlap** and constant on their support. Naturally, the question arises how such a filter bank can be constructed. Luckily, the shearlet transformation [45] (and in particular the non-separable version of [46]) fulfills these conditions. As an extension to the wavelet transformation, the shearlet transformation [45] can represent directional information in multidimensional signals via shearing. Here, we consider the non-separable digital shearlet transformation [46], whose induced frequency tiling is shown schematically in Fig. 2. In particular, the frequency plane is partitioned into non-overlap** cones indexed by the scaling and shearing parameters described in the next paragraph.

Fig. 2
figure 2

Schematic illustration of the cone-like frequency tiling of the non-separable shearlet transformation [46]. The shearlets are constructed such that their spectra are non-overlap**

We briefly describe our setup but refer the interested reader to [46, 47] for more details. We construct a digital shearlet system, specified by the positive scaling integer \( j = 0,\dotsc ,J \) the translations \( m \in \mathbb {Z}^2 \) and shearing \( |k| \le \lceil 2^{\lfloor \frac{j}{2} \rfloor }\rceil \). The system is constructed by a one-dimensional low-pass filter \( h_1 \) and a two-dimensional directional filter \( P \). Given one-dimensional filters \( h_{J-j/2} \) and \( g_{J-j} \) derived from \( h_1 \) in a wavelet multiresolution analysis, let \( W_j = g_{J-j} \otimes h_{J-j/2} \) and let \( p_j \) be the Fourier coefficients of \( P \) at scaling level \( j \). Then, the system is constructed by

$$\begin{aligned} \gamma _{j,k} = \Bigl [ \Bigl ( S_k\bigl ( (p_j * W_j)_{\uparrow _{2^{j/2}}} *_1 h_{j/2} \bigr ) \Bigr ) *_1 {\overleftarrow{h}_{j/2}} \Bigr ]_{\downarrow _{2^{j/2}}}. \nonumber \\ \end{aligned}$$
(42)

Here, \( \uparrow _{a} \) and \( \downarrow _{a} \) are \( a \)-fold up- and down-sampling operators, and \( {\overleftarrow{(\cdot )}} \) indicates sequence reversal \( {\overleftarrow{(\cdot )}}(n) = (\cdot )(-n) \), and \( S_k \) is a shearing operator. The digital shearlet transformation of an image \( x \in \mathbb {R}^{n \times n} \) is then given by

$$\begin{aligned} \lambda _{j,k} {\text {conj}} (\gamma _{j, k}) * x \end{aligned}$$
(43)

where \( \lambda _{j,k} > 0 \) are learnable weights that reflect the importance of the respective scale and shear level. The learnable weights \( \lambda _{j,k} > 0 \) are easily accounted for in the diffusion models by adapting \( \xi _{j, k} \) in (39) (where we have swapped the index \( j \) for a two-index \( j, k \) to account for the scales and shearing levels). Thus, we can summarize the learnable parameters for the model based on shearlet responses as \( \theta = \{ h_1, P, \{\lambda _{j,k}\}_{j,k}, \{w_{j,k}\}_{j,k} \} \).

4 Numerical Results

In this section, we first detail the setup for numerical optimization. In particular, we discuss how we can learn the one-dimensional GMM experts along with the corresponding transformation (filters, wavelets, and shearlets) jointly. Then, we show results for denoising utilizing a simple one-step empirical Bayes scheme as well as denoising algorithms derived for classical diffusion models. In addition, we show that we can use our models for noise level estimation and blind heteroscedastic denoising, and exploit Corollary  1 to derive a direct sampling scheme.

4.1 Numerical Optimization

For the numerical experiments, \( f_X \) reflects the distribution of rotated and flipped \( b \times b \) patches from the 400 gray-scale images in the BSDS 500 [48] training and test set, with each pixel value in the interval \( [0, 1] \). We optimize the score matching objective function (11) using projected AdaBelief [49] for \( {100\,000} \) steps. We approximate the infinite-time diffusion PDE by uniformly drawing \( \sqrt{2t} \) from the interval \( [{0}, {0.4}] \). For the denoising experiments, we utilize the validation images from [44] (also known as “Set68”). Due to computational constraints, we utilize only the first 15 images of the dataset according to a lexicographic ordering of the filenames. In addition, our wavelet and shearlet toolboxes only allow the processing of square images. To avoid boundary artifacts arising through padding images to a square, we only utilize the central region of size \( {320} \times {320} \).

For all experiments, \( \psi _j \) is a \( L = {125} \) component GMM, with equidistant means \( \mu _l\) in the interval \( [-\eta _j, \eta _j] \) (we discuss the choice of \( \eta _j \) for the different models in their respective sections). To support the uniform discretization of the means, the standard deviation of the \( j \)-th experts is \( \sigma _{0} = \frac{2\eta _j}{L - 1} \). In the one-dimensional GMM backbone of all models, we have to project a weight vector \( w \in \mathbb {R}^L \) onto the unit simplex \( \triangle ^L \). We realize the projection \( {{\,\textrm{proj}\,}}_{\triangle ^{L}}: \mathbb {R}^L \rightarrow \mathbb {R}^L \) with the sorting-based method proposed by [50], which is summarized in Algorithm 2. In addition, we further assume that the one-dimensional GMM experts are symmetric around \( 0 \), i.e., that the weights \( w_j \) are in the set \( \{ x \in \mathbb {R}^L: (x \in \triangle ^L) \wedge (x = {\overleftarrow{x}}) \} \). We implement by storing only \( \lceil L / 2 \rceil \) weights, and mirroring the tail of \( \lceil L / 2 \rceil - 1 \) elements prior to the projection algorithm and function evaluations. To ensure that the one-dimensional GMM experts are sufficiently peaky around zero, we always choose \( L \) to be odd.

In the next sections, we detail the constraints the building blocks of the learned transformations have to fulfill and how to satisfy them in practice.

4.1.1 Learning Orthogonal Filters

Let \( K = [k_1, k_2, \dotsc , k_J] \in \mathbb {R}^{a \times J} \) denote the matrix obtained by horizontally stacking the filters. We are interested in finding

$$\begin{aligned} {\text {proj}}_{\mathcal {O}}(K) {=} \mathop {\mathrm {arg\,min}}\limits _{M \in \mathcal {O}} \Vert M - K\Vert _F^2 \end{aligned}$$
(44)

where \( \mathcal {O} = \{ X \in \mathbb {R}^{a \times J}: X^\top X = D^2 \} \), \( D = {{\,\textrm{diag}\,}}(\lambda _1,\lambda _2,\dotsc ,\lambda _J) \) is diagonal, and \( \Vert \,\cdot \,\Vert _F \) is the Frobenius norm. Since \( {\text {proj}}_{\mathcal {O}}(K)^\top {\text {proj}}_{\mathcal {O}}(K) = D^2 \) we can represent it as \( {\text {proj}}_{\mathcal {O}}(K) = OD \) with \( O \) semi-unitary (\( O^\top O = \textrm{Id}_{\mathbb {R}^J} \)). Other than positivity, we do not place any restrictions on \( \lambda _1, \dotsc , \lambda _J \), as these are related to the precision in our model. Thus, we rewrite the objective

$$\begin{aligned} {{\,\textrm{proj}\,}}_{\mathcal {O}}(K) {=} \mathop {\mathrm {arg\,min}}\limits _{\begin{array}{c} O^\top O = \textrm{Id}_J \\ D = {{\,\textrm{diag}\,}}(\lambda _1,\dotsc ,\lambda _J) \end{array}} \mathcal {E}(O, D) \end{aligned}$$
(45)

where

$$\begin{aligned} \mathcal {E}(O, D) {:}{=}\Vert OD - K\Vert _F^2 = \Vert K\Vert _F^2 - 2 \langle K, OD \rangle _F + \Vert D\Vert _F^2, \nonumber \\ \end{aligned}$$
(46)

with \( \langle \,\cdot \,, \,\cdot \,\rangle _F \) denoting the Frobenius inner product.

We propose the following alternating minimization scheme for finding \( O \) and \( D \). The solution for the reduced sub-problem in \( O \) can be computed by setting \( O = U \), using the polar decomposition of \( DK^\top = UP \), where \( U \in \mathbb {R}^{J \times a}\) is semi-unitary (\( U^\top U = \textrm{Id}_{\mathbb {R}^a} \)) and \( P = P^\top \succeq 0 \). The sub-problem in \( D \) is solved by setting \( D_{i,i} = \bigl ((O^\top K)_{i,i}\bigr )_{+} \). The algorithm is summarized in Algorithm 1, where we have empirically observed fast convergence; \( B = 3 \) steps already yielded satisfactory results. A preliminary theoretical analysis of the algorithm is presented in the supplementary material of our conference paper [7].

Algorithm 1
figure a

Algorithm for orthogonalizing a set of filters \( K \).

Assuming a patch size of \( a = b \times b \) we use \( J = b^2 - 1 \) filters spanning the space of zero-mean patches \( \mathfrak {Z} = \{ x \in \mathbb {R}^a: \langle \mathbbm {1}_{\mathbb {R}^a}, x \rangle _{\mathbb {R}^a} = 0 \} \). We found that implementing \( {{\,\textrm{proj}\,}}_{\mathcal {O} \cap \mathfrak {Z}} \) as \( {{\,\textrm{proj}\,}}_{\mathfrak {Z}} \circ {{\,\textrm{proj}\,}}_{\mathcal {O}} \), both constraints were always almost exactly fulfilled. To ensure the correct projection, an alternative would be to utilize Dykstra’s projection algorithm [51]. The filters are initialized by independently drawing their entries from a zero-mean Gaussian distribution with standard deviation \( b^{-1} \). Since the filters can be freely scaled, we simply choose \( \eta _j = 1 \) for all \( j = 1, \dotsc , J \).

To visually evaluate whether our learned model matches the empirical marginal densities for any diffusion time \( t \), we plot them in Fig. 3. At the top, the learned \( 7 \times 7 \) orthogonal filters \( k_j \) are depicted. The filters bare striking similarity to the Eigenimages of the covariance matrices of [43, Fig. 6], who learn a GMM directly on the space of image patches (i.e., without any factorizing structure). This comes as no surprise, since the construction of the patch model (14) can be interpreted as “learning the Eigendecomposition” (see Theorem 1 and the proof of Theorem 2). The learned potential functions \( -\log \psi _j(\,\cdot \,, w_j, t) \) and activation functions \( -\nabla \log \psi _j(\,\cdot \,, w_j, t) \)Footnote 3 associated with the \( j \)-th filter are shown below the filters in Fig. 3. Indeed, the learned potential functions match the negative-log empirical marginal response histograms

$$\begin{aligned} -\log \mathbb {E}_{p \sim {f_{Y_t}}} {\bigl [} \delta (z - \langle k_j, p \rangle ) {\bigr ]} \end{aligned}$$
(47)

visualized at the bottom almost perfectly even at extremely low-density tails. This supports the theoretical argument that diffusion eases the problem of density estimation outlined in the introductory sections.

Fig. 3
figure 3

Learned filters \( k_j \) (top, the intervals show the values of black and white, respectively, amplified by a factor of 10), potential functions \( -\log \psi _j(\,\cdot \,, w_j, t) \) and activation functions \( -\nabla \log \psi _j(\,\cdot \,, w_j, t) \). On the bottom, the negative-log empirical marginal response histograms are drawn

4.1.2 Learning Wavelets

The discrete wavelet transformation is characterized by the sequence \( h \in \mathbb {R}^K \). In addition to learning the parameters of the one-dimensional GMM, we follow [52] and also learn \( h \). From the sequence \( h \), the scaling function \( \phi \) and wavelet function \( \omega \) are defined by

$$\begin{aligned} \phi (x) = \sum _{k=1}^K h_k \sqrt{2} \phi (2x - k) \end{aligned}$$
(48)

and

$$\begin{aligned} \omega (x) = \sum _{k=1}^K{\bigl (g(h)\bigr )_k}\sqrt{2}\phi (2x - k) \end{aligned}$$
(49)

where \( {\bigl (g(h)\bigr )_k} = (-1)^kh_{K - k - 1} \). For \( \omega \) to be a wavelet, it must follow the admissibility criterion

$$\begin{aligned} \int _0^\infty \frac{|(\mathcal {F}\omega )(\zeta )|^2}{\zeta }\,\textrm{d}\zeta < \infty , \end{aligned}$$
(50)

cf [29], from which it immediately follows that \( (\mathcal {F}\omega )(0) = \int _\mathbb {R}\omega = 0 \). For the transformation to be unitary, we need that \( \int _\mathbb {R}\phi = 1 \), and

$$\begin{aligned} \int _\mathbb {R}\phi (x)\phi (x-n)\,\textrm{d} x = \delta _{n} \text { for all } n \in \mathbb {Z}. \end{aligned}$$
(51)

From these constraints, the feasible set of wavelet-generating sequences is described by

$$\begin{aligned} \begin{aligned}&\Omega = \{ h \in \mathbb {R}^K: \langle \mathbbm {1}_{\mathbb {R}^K}, g(h) \rangle _{\mathbb {R}^K} = 0,\\&\quad \langle \mathbbm {1}_{\mathbb {R}^K}, h \rangle _{\mathbb {R}^K} = \sqrt{2},\\&\quad \langle h, \circlearrowleft _{2n} h \rangle _{\mathbb {R}^K} = \delta _n\ \text { for all } n \in \mathbb {Z} \}. \end{aligned} \nonumber \\ \end{aligned}$$
(52)

Here \( \circlearrowleft _{n}: \mathbb {R}^K \rightarrow \mathbb {R}^K \) rolls its argument by \( n \) entries, i.e., \( \circlearrowleft _{n} x = (x_{K-n+1}, x_{K-n+2},\dotsc ,x_{K},x_1,x_2,\dotsc ,x_{K-n})^\top \). Observe that the orthogonality condition encodes \( K / 2 \) constraints (we assume that \( K \) is even), since \( \circlearrowleft _{0} = \circlearrowleft _{K} = \textrm{Id}_{\mathbb {R}^K} \). To project onto \( \Omega \), we write the projection problem

$$\begin{aligned} {{\,\textrm{proj}\,}}_{\Omega } (\bar{x}) {=} \mathop {\mathrm {arg\,min}}\limits _{x \in \Omega } \frac{1}{2} \Vert x - \bar{x}\Vert _2^2 \end{aligned}$$
(53)

in its Lagragian form using \( \mathcal {L}: \mathbb {R}^K \times \mathbb {R}\times \mathbb {R}\times \mathbb {R}^{K/2} \rightarrow \mathbb {R}: \)

$$\begin{aligned} \begin{aligned}&(x, \Lambda _{\textrm{scal}}, \Lambda _{\textrm{adm}}, \Lambda )\\&\mapsto \frac{1}{2} \Vert x - \bar{x}\Vert _2^2\\&\quad + \Lambda _{\textrm{scal}} \bigl ( \langle \mathbbm {1}_{R^K}, h \rangle _{\mathbb {R}^K} - \sqrt{2} \bigr ) + \Lambda _{\textrm{adm}} \bigl ( \langle \mathbbm {1}_{\mathbb {R}^K}, g(h) \rangle _{\mathbb {R}^K} \bigr )\\&\quad + \sum _{n = 0}^{\frac{K}{2}-1} \Lambda _{n+1} \bigl ( \langle h, \circlearrowleft _{2n} h \rangle _{\mathbb {R}^K} - {\chi _{\{0\}}(n)} \bigr ). \end{aligned}\nonumber \\ \end{aligned}$$
(54)

and find stationary points by solving the associated nonlinear least-squares problem

$$\begin{aligned} \Vert \nabla \mathcal {L}(x, \Lambda _{\textrm{scal}}, \Lambda _{\textrm{adm}}, \Lambda )\Vert _2^2/2 = 0 \end{aligned}$$
(55)

using 10 iterations of Gauss-Newton. To facilitate convergence, we warm start the Lagrange multipliers \( \Lambda _{\textrm{scal}}, \Lambda _{\textrm{adm}}, \Lambda \) with the solution from the previous outer iteration. We initialize the sequence \( h \) with the generating sequences of the db2- (\( K = 4 \)) and db4-wavelet (\( K = 8 \)). For both, we utilize \( J = 2 \) levels. We use the pytorch_wavelets [53] implementation of the discrete wavelet transformation.

In contrast to the model based on filter responses, the model based on wavelet responses does not have the freedom to adapt the scaling of filters. To overcome this, we discretize the means over the real line individually for each sub-band. In detail, for the \( j \)-th level and \( d \)-th direction, \( d \in \{ \textbf{h}, \textbf{v}, \textbf{d} \} \), we choose \( \eta ^d_j = 1.1 q^{d}_j \), where \( q^d_j \) is the \(.999 \)-quantile of corresponding responses.

The initial and learned generating sequences, their corresponding scaling and wavelet functions, along with the learned potential functions and MMSE-shrinkage are shown in Fig. 4. In these figures, it is apparent that our chosen parametrization is sub-optimal. In particular, in order to represent the heavy tails (especially for level \( j = 1\)), many intermediate weights are set to \( 0 \). This leads to the MMSE shrinkage functions becoming step-like. We emphasize that this is a practical problem of choosing the appropriate parametrization; we discuss alternatives to our equispaced GMM in Sect. 5.

Fig. 4
figure 4

Left: db2 initial generating sequence (\( K = 4\)). Right: db4 initial generating sequence (\( K = 8\)). From top to bottom: Initial generating sequence \( h \) along with the corresponding scaling and wavelet functions \( \phi \) and \( {\omega } \), learned generating sequence \( h \) along with the corresponding scaling and wavelet functions \( \phi \) and \( {\omega } \), learned potential functions \( -\log \psi (\,\cdot \,, w, t) \) and the learned MMSE shrinkage functions \( y_t \mapsto y_t {+} 2t \nabla \log \psi (y_t, w, t) \)

4.1.3 Learning Shearlets

We initialize the one-dimensional low-pass filter \( h_1 \) and the two-dimensional directional filter \( P \) with the standard choices from [47]: \( h_1 \) is initialized as maximally flat 9-tap symmetric low-pass filterFootnote 4, \( P \) is initialized as the maximally flat fan filterFootnote 5 described in [54]. Furthermore, \( \lambda _{j,k} \) is initialized as \( 1 \) for all scale levels \( j \) and shearings \( k \), and we set \( \eta _{j, k} = 0.5 \).

We enforce the following constraints on the parameter blocks: The weighting parameters \( \lambda _{j,k} \) must satisfy non-negativity \( \lambda _{j, k} \in \mathbb {R}_{\ge 0} \). The parameters \( h_1 \) and \( P \) specifying the shearlet system must satisfy \( h_1 \in \mathcal {H} {:}{=}\{ x \in \mathbb {R}^9: \langle \mathbbm {1}_{{\mathbb {R}^9}}, x \rangle _{\mathbb {R}^9} = 1 \} \) and \( P \in \mathcal {P} {:}{=}\{ x \in \mathbb {R}^{17 \times 17}: \Vert x\Vert _1 = 1 \}\).

The projection operators can be realized as follows: The projection onto the non-negative real line is just \( {{\,\textrm{proj}\,}}_{\mathbb {R}_{\ge 0}}(x) = \max \{ x, 0 \} \). The map \( {{\,\textrm{proj}\,}}_{\mathcal {H}}(x) = x - \frac{\langle \mathbbm {1}_{{\mathbb {R}^9}}, x \rangle _{\mathbb {R}^9} - 1}{9} \) realizes the projection onto the linear constrain encoded in \( \mathcal {H} \). The projection onto the unit-one-norm-sphere is \( {{\,\textrm{proj}\,}}_{\mathcal {P}}(x) = {\text {sgn}}(x) \odot {{\,\textrm{proj}\,}}_{\triangle ^m}(|x|) \) (see e.g. [55, 56]), where we ignore the degenerate case of projecting the origin where \( {{\,\textrm{proj}\,}}_{\mathcal {P}} \) is not well defined. Our implementation of the shearlet transformation is based on the ShearLab 3D [47] toolboxFootnote 6.

Algorithm 2
figure b

Simplex projection from [50].

For the numerical experiments, we chose \( J = 2 \) scales and \( 5 \) shearings (\( k \in \{ -2, \dotsc , 2 \} \)). We show the initial and learned filter weights \( \lambda _{j,k} \), the one-dimensional low-pass filter \( h_1 \), and the two-dimensional directional filter \( P \) in Fig. 5. The resulting shearlet system in the frequency- and time-domain, along with the learned potential functions, is shown in Fig. 6. We again emphasize that the learned one-dimensional potential functions \( \psi _{j,k}(\,\cdot \,, w_{j,k}, t) \) are distinctly different from the other models. In particular, they exhibit multiple local minima, sometimes different from \( 0 \), such that certain image structures can be enhanced under this prior. This is in stark contrast to the learned filter and wavelet responses, which show a single minimum at \( 0 \) and the classical heavy-tailed shape.

Fig. 5
figure 5

Initial (a) and learned (b) building blocks for the shearlet system: One-dimensional low-pass filter \( h_1 \) and the two-dimensional directional filter \( P \). The corresponding shearlet filters along with their frequency response, learned weights and learned potential functions are shown in Fig. 6

Fig. 6
figure 6

The five left columns show the frequency response, time-domain filters and corresponding potential functions of the first cone of the learned shearlet system. The inset numbers show the values of the corresponding weighting factor \( \lambda _{j, k} \). The five unlabeled right columns show the second cone

Figure 6 also shows that the shearlet system only approximately fulfills the assumption (38) and (39). We analyze the shearlet system with respect to the assumption of disjoint support (38) by visualizing the pair-wise cosine similarity of the magnitude of the spectra in Fig. 7. In detail, the figure shows \( \langle \frac{|\gamma _{\tilde{\jmath },\tilde{k}}|}{\Vert |\gamma _{\tilde{\jmath },\tilde{k}}|\Vert }, \frac{|\gamma _{j,k}|}{\Vert |\gamma _{j,k}|\Vert } \rangle _{\mathbb {R}^{n}} \), for \( \tilde{\jmath }, j \in \{ 1, 2 \} \) and \( \tilde{k}, k \in \{ -2,\dotsc , 2 \} \) and both cones. Although less for the learned shearlet system, the plot is dominated by the main diagonal, indicating that the corresponding spectra are almost non-overlap**. To meet the theoretical assumptions, it would be possible to penalize \( \langle \frac{|\gamma _{\tilde{\jmath },\tilde{k}}|}{\Vert |\gamma _{\tilde{\jmath },\tilde{k}}|\Vert }, \frac{|\gamma _{j,k}|}{\Vert |\gamma _{j,k}|\Vert } \rangle _{\mathbb {R}^n} \) for \( \tilde{\jmath }\ne j \) and \( \tilde{k} \ne k \) during training.

The fact that the spectra are not constant over their support raises the question of how to choose \( \xi _a \) that best approximates (39). During training and evaluation, we simply chose \( \xi _a = \Vert |\gamma _a|\Vert _\infty \), where \( a \) is a two-index ranging over the chosen scale-shearing grid. It remains an open question how the violation of the constraints (38) and  (39) influences the diffusion, and if there exists a better choice for \( \xi _a \).

Fig. 7
figure 7

Cosine similarity between the magnitude the spectra of different levels, shearings and cones of the initial (a) and learned (b) shearlet system. A system exactly fulfilling the assumption (38) would be unpopulated off the main diagonal

4.2 Image Denoising

To exploit our patch-based prior for whole-image denoising, following [43], we define the expected patch log-likelihood of a noisy image \( y \in \mathbb {R}^n \) with variance \( \sigma ^2(t) = 2t \) as

$$\begin{aligned} {\text {epll}}_\theta ^{\textrm{filt}}(y, t) = \sum _{j=1}^{\tilde{n}} p_j^{-1} \log f^{\textrm{filt}}_\theta (P_j y, t). \end{aligned}$$
(56)

Here, \( \tilde{n} \) denotes the total number of overlap** patches (e.g. for \( n = 4 \times 4 \) and \( a = 3 \times 3 \), \( \tilde{n} = 4 \), ), \( P_i: \mathbb {R}^n \rightarrow \mathbb {R}^a \) denotes the patch-extraction matrix for the \( i \)-th patch and \( p_i = \bigl ( \sum _{j=1}^{\tilde{n}} P_j^\top P_j \bigr )_{i, i} \) counts the number of overlap** patches to compensate for boundary effects (see [57, Appendix B] for a more rigorous discussion). The wavelet- and shearlet-based priors can act on images of arbitrary size.

Let \( \log f_\theta \) be either \( {\text {epll}}_\theta ^{\textrm{filt}}, \log f_\theta ^{\textrm{wave}}\), or \( \log f_\theta ^{\textrm{conv}} \). We consider two inference methods: The one-step empirical Bayes estimate

$$\begin{aligned} \hat{x}_{\textrm{EB}}(y, t) = y + \sigma ^2(t) \nabla _{\hspace{-2.22214pt}1} \log f_\theta (y, t) \end{aligned}$$
(57)

corresponds to the Bayesian MMSE estimator. Notice that, in the case of \( \log f_\theta = {\text {epll}}_\theta ^{\textrm{filt}} \) the estimator

$$\begin{aligned} \begin{aligned} \hat{x}_{\textrm{EB}}(y, t)&= y + \sigma ^2(t) \nabla _{\hspace{-2.22214pt}1} {\text {epll}}_\theta ^{\textrm{filt}}(y, t) \\&= y + 2t \sum _{j=1}^{\tilde{n}} p_j^{-1} P_j^\top \nabla _{\hspace{-2.22214pt}1} \log f^{\textrm{filt}}_\theta (P_j y, t) \end{aligned} \end{aligned}$$
(58)

computes patch-wise MMSE estimates and combines them by averaging. This is known to be a sub-optimal inference strategy, since the averaged patches are not necessarily likely under the model [43]. The discussion of algorithm utilizing patch-based priors for whole-image restoration is beyond the scope of this article. We refer the interested reader to the works of [43, 57] for a detailed discussion on this topic. In addition, we refer to our previous conference publication [7], in which we present a proximal gradient continuation scheme that slightly improves over the empirical Bayes estimate by allowing patch-crosstalk.

The second inference method we consider is the stochastic denoising algorithm proposed by [58] and summarized in Algorithm 3. In detail, this algorithm proposes a sampling scheme to approximately sample from the posterior of a denoising problem when utilizing diffusion priors. This is achieved by properly weighting the score \( \nabla \log f_\theta \) with the gradient of the data term while annealing the noise level. Sampling from the posterior, as opposed to directly computing MMSE estimates with an empirical Bayes step, is known to produce sharper results when utilizing modern highly expressive diffusion models [58, 59]. We chose \( \epsilon = 5 \times 10^{-6} \), \( \sigma _{{C}} = 0.01 \) and the exponential schedule \( \sigma _i = \sqrt{2t} \bigl (\frac{\sigma _{{C}}}{\sqrt{2t}}\bigr )^{i/{C}} \), using \( B = 3 \) inner loops and \( C = {100} \) diffusion steps.

Algorithm 3
figure c

Stochastic image denoising algorithm from [58].

Let \( x \in \mathbb {R}^n \) denote a test sample from the distribution \( f_X \), and let \( \hat{x} \) denote the estimation of \( x \) given \( y_t = x + \sqrt{2t} \eta \) where \( \eta \sim \mathcal {N}(0, \textrm{Id}_{\mathbb {R}^n}) \), through either of the discussed inference methods. In Table 1, we show a quantitative evaluation utilizing the standard metrics peak signal-to-noise ratio (PSNR) \( 10\log _{10}\frac{n}{\Vert \hat{x} - x\Vert _2^2} \) and structural similarity (SSIM) [60] with a window size of 7 and the standard parameters \( K_1 = {0.01} \) and \( K_2 = {0.03} \). The column with the heading “Patch-GSM” utilized the Gaussian scale mixture parametrization discussed in Sect. 5.1. The results are obtained for one run of the algorithms, i.e., we did not compute the expectation over the noise (neither in the construction of \( y_t \) nor during the iterations of the stochastic denoising algorithm). However, we did not observe any noteworthy deviation when performing different runs of the experiments.

The quantitative evaluation shows impressive results of the model based on shearlet responses, despite having very little trainable parameters. In particular, it performs best across all noise levels and inference methods, with the exception of the one-step empirical Bayes denoising at \( \sigma = 0.2 \). There, the patch-based model with \( a = 15 \times 15 \) performs best, but notably has about 50 times the number of trainable parameters. By leveraging symmetries between the cones in the shearlet system, the number of trainable parameters could even be approximately halved. These symmetries are strongly apparent in Fig. 6, where the potential functions of the second cone (rightmost 5 potentials) are almost a perfect mirror image of the potential functions of the first cone (leftmost 5 potential functions).

Additionally, the table reveals that the empirical Bayes estimator beats the stochastic denoising in every quality metric. This is not surprising, as—in expectation—it is the optimal estimator in the MMSE sense, which directly corresponds to PSNR. Comparing the qualitative evaluation in Fig.  8 (empirical Bayes) to Fig. 9 (stochastic denoising) we do not observe that sharper images using the stochastic denoising algorithm; we are unsure why.

Table 1 Quantitative denoising results in terms of PSNR and SSIM using one-step empirical Bayes denoising the stochastic denoising algorithm from [58]. The intervals indicate the 0.95 confidence region, bold typeface indicates the best method
Fig. 8
figure 8

Qualitative denoising results for one-step empirical Bayes denoising. In the rows, the noise standard deviation ranges in \( \sigma \in \{ 0.025, 0.05, 0.1, 0.2 \} \). The inlays show a zoomed region (magnifying factor 3), and the absolute difference of the reconstruction to the ground truth image (0  \(\frac{1}{3}\)). The accompanying quantitative results are shown in Table 1

Fig. 9
figure 9

Qualitative denoising results using the stochastic denoising algorithm from [58]. In the rows, the noise standard deviation ranges in \( \sigma \in \{ 0.025, 0.05, 0.1, 0.2 \} \). The inlays show a zoomed region (magnifying factor 3), and the absolute difference of the reconstruction to the reference image (0  \(\frac{1}{3}\)). The accompanying quantitative results are shown in Table 1

The analysis of posterior variance is out of the scope of this paper. However, techniques for analyzing the posterior induced by diffusion models are also readily applicable to our models. In particular, we refer to [58] or related papers such as [61] for an in-depth discussion of these techniques.

4.3 Noise Estimation and Blind Image Denoising

Within this and the following subsection, we describe two applications that arise as a by-product of our principled approach: Noise estimation (and, consequently, blind denoising) and analytic sampling. For both, we utilize the model based on filter responses as a stand-in but emphasize that similar results hold also for the models based on wavelet and shearlet responses.

The construction of our model allows us to interpret \( f^{\textrm{filt}}_\theta (\,\cdot \,, t) \) as a time-conditional likelihood density. Thus, it can naturally be used for noise level estimation: We assume a noisy patch \( y \) constructed by \( y = x + \sigma \eta \), where \( x \sim f_X \), \( \eta \sim \mathcal {N}(0, \textrm{Id}_{\mathbb {R}^n}) \) and \( \sigma \) is unknown. We can estimate the noise level \( \sigma \) by maximizing the likelihood of \( y \) w.r.t. to the diffusion time \( t \)\(\hat{t} = \mathop {\mathrm {arg\,max}}\limits _t f^{\text {filt}}_\theta (y, t) \)—and recover the noise level via \( \sigma = \sqrt{2\hat{t}} \).

To demonstrate the feasibility of this approach, Fig. 10 shows the expected negative-log densityFootnote 7\( \mathbb {E}_{p \sim f_X, \eta \sim \mathcal {N}(0, \textrm{Id})} {\bigl [} l_\theta (p+\sigma \eta , t) {\bigr ]} \) over a range of \( \sigma \) and \( t \). The noise level estimate \( \sigma \mapsto \mathop {\mathrm {arg\,min}}\limits _t \mathbb {E}_{p \sim f_X, \eta \sim \mathcal {N}(0, \textrm{Id})} {\bigl [} l_\theta (p + \sigma \eta , t) {\bigr ]} \) perfectly matches the identity map \( \sigma \mapsto \sqrt{2t} \).

In addition, we can leverage this noise level estimation procedure to perform blind heteroscedastic denoising with the same model as follows: First, for all \( \tilde{n} \) overlap** patches \( P_j y \) in the corrupted image, we estimate the noise level through \( \hat{t}_j = \mathop {\mathrm {arg\,max}}\limits _t f^{\textrm{filt}}_\theta (P_j y, t) \). Given the noise levels \( \hat{t}_{{j}} \), we can estimate the clean image with an empirical Bayes step of the form

$$\begin{aligned} \hat{x}_{\textrm{blind}}(y) = y + 2 \sum _{j=1}^{\tilde{n}} \hat{t}_j p_j^{-1} P_j^\top \nabla _{\hspace{-2.22214pt}1} \log f^{\textrm{filt}}_\theta (P_j y, \hat{t}_j), \end{aligned}$$
(59)

where for each patch \( P_j y \) we utilize the estimated noise level \( \hat{t}_j \).

In Fig. 11, the original image is corrupted by heteroscedastic Gaussian noise with standard deviation 0.1 and 0.2 in a checkerboard pattern, which is clearly visible in the noise level map. In the restored image and the absolute difference to the reference, the checkerboard pattern is hardly visible, indicating that the noise level estimation is robust also when confronted with little data.

Fig. 10
figure 10

Expected normalized negative-log density along with the noise estimate \( \sigma \mapsto \mathop {\mathrm {arg\,min}}\limits _t \mathbb {E}_{p \sim f_X, \eta \sim \mathcal {N}(0, \textrm{Id})}{\bigl [}l_\theta (p+\sigma \eta , t) {\bigr ]} \), \( \sigma \mapsto \sqrt{2t} \); (left) and the slices at \( \sigma \in \{ 0.1, 0.2, 0.3, 0.4 \} \) (right)

Fig. 11
figure 11

Noise estimation and blind denoising. Top left: Image \( y \) corrupted with heteroscedastic Gaussian noise in a checkerboard pattern with standard deviation 0.1 and 0.2. Top right: Noise level map (0  0.5). Bottom left: One-step empirical Bayes denoising result using (59). Bottom right: Absolute difference to the reference image (0  \(\frac{1}{3}\))

4.4 Sampling

A direct consequence of Corollary 1 is that our models admit a simple sampling procedure: The statistical independence of the components allows drawing random patches by

$$\begin{aligned} Y_t = \sum _{j=1}^J \frac{k_j}{\Vert k_j\Vert ^2} U_{j, t}, \end{aligned}$$
(60)

where \( U_{j, t} \) is a random variable on \( \mathbb {R}\) sampled from the one-dimensional GMM \( \psi _j(\,\cdot \,, w_j, t) \). The samples in Fig. 12 indicate a good match over a wide range of \( t \). However, for small \( t \) the generated patches appear slightly noisy, which is due to an over-smooth approximation of the sharply peaked marginals around \( 0 \). This indicates that the (easily adapted) discretization of \( \mu _l \) equidistant over the real line is not optimal. We discuss alternative parametrizations in Sect. 5.

Fig. 12
figure 12

Ground-truth samples from the random variable \( Y_t \) (top) and samples generated by the analytic sampling procedure (60) (bottom)

5 Discussion

5.1 Alternative Parametrizations

The potential functions of the models based on filter- and wavelet responses shown in Figs. 3 and 4 exhibit leptokurtic behavior, which has been noticed quite early in the literature [44, 62,63,64,65,66]. To model these leptokurtic potential functions, our parametrization relies on one-dimensional GMMS with a priori chosen equidistant means on the real line. The GMM is a very natural choice in our framework, as the Gaussian family is the only function family closed under diffusion (i.e., convolution with a Gaussian, cf. the central limit theorem). However, as a consequence, the discretization of the means over the real line has to be fine enough to allow proper modeling of the leptokurtic marginals. Thus, the majority of the learnable parameters are actually the weights of the one dimensional Gaussian mixtures. This motivates the consideration of other expert functions \( \psi \).

An extremely popular choice for modeling the distribution of filter responses on natural images is the Student-t expert [44, 63]

$$\begin{aligned} x \mapsto \biggl ( 1 + \frac{x^2}{2} \biggr )^{-\alpha }. \end{aligned}$$
(61)

As outlined above, the convolution of this function with a Gaussian cannot be expressed in closed form. However, there exist approximations, such as the ones shown in [67] or [68, Theorem 1], which we recall here for completeness: Let \( X \) be a random variable on \( \mathbb {R}\) with density

$$\begin{aligned} f_X(x) = \frac{\Gamma (\frac{\nu + 1}{2})}{\sqrt{\nu \pi }\Gamma (\frac{\nu }{2})} \bigl ( 1 + \frac{x^2}{\nu } \bigr )^{-\frac{\nu + 1}{2}}, \end{aligned}$$
(62)

where \( \Gamma (z) = \int _0^\infty t^{z-1}\exp (-t)\,\textrm{d}t \) is the Gamma function, and let \( Y_t \) be a random variable defined as previously. Then, \( f_{Y_t} = \lim _{N\rightarrow \infty } f_{Y_t}^{(N)} \) where

$$\begin{aligned} \begin{aligned}&f_{Y_t}^{(N)}(y) = \frac{\exp \bigl ( -\frac{y^2}{4t} \bigr )\Gamma \bigl ( \frac{\nu + 1}{2} \bigr )}{\sqrt{4t\pi }\Gamma \bigl (\frac{\nu }{2}\bigr ) \bigl ( \frac{4t}{\nu } \bigr )^{\frac{\nu }{2}}} \times \\&\quad \sum _{n=0}^{N}\biggl ( \frac{1}{n!} \Bigl ( \frac{y^2}{4t} \Bigr )^n \Psi \Bigl ( \frac{\nu + 1}{2}, \frac{\nu }{2} + 1 - n, \frac{\nu }{4t} \Bigr ) \biggr ) \end{aligned} \end{aligned}$$
(63)

with the confluent hypergeometric function of the second kind (also known as Tricomi’s function, or the hypergeometric \( U \) function) [69] \( \Psi \).

We show \( - \log f_{Y_t}^{(N)} \) for different \( N \) and \( t > 0 \) in Fig. 13, along with \( - \log f_{Y_t} \) which we computed numerically. Notice that (63) is composed of two terms: A Gaussian with variance \( 2t \) and an infinite polynomial in the even powers, filling up the tails of the distribution. Thus, it is not surprising that the approximation fails to model the tails of the distribution when \( t \) is small, and becomes better as \( t \) increases and the distribution approaches a Gaussian.

Fig. 13
figure 13

Forchini’s [67] approximation \( -\log f_{Y_t}^{(N)} \) (see (63)) of the density of the sum of a t- and a normally distributed random variable with standard deviation \( \sqrt{2t} \)

Another popular expert function is the Gaussian scale mixture (GSM)

$$\begin{aligned} x \mapsto \int _{-\infty }^{\infty } (2\pi z^2\sigma ^2)^{-\frac{1}{2}} \exp \biggl ( -\frac{x^2}{2z^2\sigma ^2} \biggr ) f_Z(z)\, \textrm{d}z \end{aligned}$$
(64)

which has been used in the context of modeling both the distributions of filter [70, 71] as well as wavelet responses [72, 73]. Here, \( f_Z \) is the mixing density of the multiplier \( Z \). Thus, GSMs can represent densities of random variables that follow

$$\begin{aligned} X = ZU \end{aligned}$$
(65)

where \( Z \) is a scalar random variable and \( U \) is a zero mean Gaussian (see [74] for conditions under which a random variable can be represented with a GSM). In practice, for our purposes we model the mixing density as a Dirac mixture \( f_Z = \sum _{i=1}^I w_i \delta _{z_i} \) with \( (w_1,\dotsc ,w_I)^\top \in \triangle ^I \) and \( z_i \) a priori fixed. Then, the GSM expert reads

$$\begin{aligned} \psi _j^{\text {GSM}}(x, w_j, t) = \sum _{i=1}^I w_{ji} (2\pi z_i^2(t))^{-\frac{1}{2}} \exp \biggl ( -\frac{x^2}{2z_i^2(t)} \biggr ), \nonumber \\ \end{aligned}$$
(66)

where without loss of generality we set \( \sigma = 1 \).

To show the practical merit of this parametrization in our context, we train a patch model using \( b = 7 \) with the following choice of \( z_i \): As the GMM experiments indicated that the discretization of the means was a bit too coarse, we chose \( z_i = 0.01 \times 1.4^{i - 1} \), such that \( z_1 = 0.01 < 0.016 = \sigma _0 \). The idea outlined in Sect. 3.1 naturally extends to such models: Diffusion (for the \( j \)-th feature channel) amounts to rescaling \( z_i^2 \mapsto z_i^2 + 2t\Vert k_j\Vert ^2 \).

We show learned filters and their corresponding potential functions and activation functions when utilizing a GSM in Fig. 14, where we used \( I = {20} \) scales. The number of learnable parameters is \( (a - 1)(a + I) \), which is 3312 when modeling \( a = 7 \times 7 \) patches with our choice of \( I \). This is considerably less than the 5376 parameters for the GMM, which, as discussed in Sect. 4.4, seems to still be discretized too coarsely. This might indicate that a GSM parametrization is more fit for this purpose. Indeed, the quantitative analysis presented in Table 1 shows superiority of the patch-based GSM model over the patch-based GMM. However, note that the GMM parametrization is strictly more versatile as it does not assume a maximum at 0. For instance, GSMs cannot model the potential functions of the model based on shearlet responses (Fig. 6).

Fig. 14
figure 14

Learned filters \( k_j \) (top, the intervals show the values of black and white, respectively, amplified by a factor of 10), potential functions \( -\log \psi _j^{\textrm{GSM}}(\,\cdot \,, w_j, t) \) and activation functions \( -\nabla \log \psi _j^{\textrm{GSM}}(\,\cdot \,, w_j, t) \)

5.2 Designing More Expressive Models

All architectures discussed until now are shallow in the sense that they model the distribution of filter responses (either directly, or through wavelets or shearlets). A possible extension of our work would be to consider deep networks, i.e., networks with more than one layer. Indeed, many popular image restoration frameworks, such as trainable nonlinear reaction diffusion [75] or the cascade of shrinkage fields [76] employ trainable Gaussian mixture potential functions (often referred to more generally as radial basis splines). However, they are typically trained as point estimators in a classic discriminative (task-specific) framework, and have not been studied in the context of diffusion priors. We quickly note that the diffusion in the trainable nonlinear reaction diffusion is a diffusion in image space, whereas our framework considers diffusion in probability space. Extending the idea of diffusion in probability space to deep networks is non-trivial. We believe that such models can only be tackled by approximating the diffusion PDE.

5.2.1 Wavelets: Modeling Neighborhoods

In essence, the model based on wavelet responses described in Sect. 3.2 models the histogram of wavelet coefficients in different sub-bands. However, it does not take the spatial neighborhood (neither in its own sub-band nor of siblings or parents) into account. There have been many attempts at making these types of models more powerful: Guerrero-Colon et al. [77] introduce mixtures of GSMs to model the spatial distribution of wavelet coefficients in and across sub-bands. The authors of [78] extend this idea to mixtures of generalized Gaussian scale model mixtures. We believe that these extensions can be used also in our work. In particular, modeling disjoint neighborhoods leads to a block diagonal structure in the product GMM, which can be efficiently inverted. However, modeling disjoint neighborhoods is known to introduce artifacts [73]. Still, such models can be globalized, e.g. by utilizing ideas similar to the expected patch log-likelihood [43], which amounts to applying a local model to overlap** local neighborhoods individually and averaging the results.

Another interesting research direction with applications to generative modeling would be to condition the distribution of the wavelet coefficients on their parent sub-bands. Notice that when utilizing conditioning and modeling local neighborhoods, we essentially recover the wavelet score-based generative model of Guth et al. [79]. Their model uses the score network architecture proposed in [80], but we believe that modeling local neighborhoods could yield results that are close to theirs.

5.3 Patch Versus Convolutional Model

One of the major open questions in this work is the relationship between the models based on filter responses and shearlet responses.

We again want to emphasize that they are distinctly different: The former “only” models the distribution of filter responses, essentially forming a histogram. In particular, the distribution of filter responses of natural images on arbitrary filters will always exhibit leptokurtic behavior [62, 65], with sharp peaks at \( 0 \) (see our learned potential functions of the model based on filter responses in Fig. 3). The experts in the model based on shearlet responses do not model the marginal distribution of filter responses, but takes into account the non-trivial correlation of overlap** patches This leads to significantly more complex expert functions with multiple minima, sometimes different from zero (see our learned potential functions of the model based on shearlet responses in Fig. 6). Although quite well known in the literature [43, 57, 75, 81], this distinction is sometimes overlooked (e.g. when [44] chose the restrictive Student-t potential functions in their convolutional fields-of-experts model). To the best of our knowledge, this paper is the first in proposing strategies to learn patch-based and convolutional priors in a unified framework.

The assumption of non-overlap** spectra of the filters in the convolutional model (38) is in analogy to the assumption of pair-wise orthogonality of the filters in the patch model (19): From (38) immediately follows that \( \langle \mathcal {F}k_j, \mathcal {F}k_i \rangle _{\mathbb {C}^n} = 0 \) when \( i \ne j \). Thus, in some sense, the convolutional model becomes a patch-based model in Fourier space. However, the relationship remains unclear and deserves being investigated further.

The second assumption—that the spectra are constant over their support (39)—restricts the space of admissible filters quite heavily. Unfortunately, we did not find a way to relax this constraint and we believe that it cannot be relaxed without losing exact diffusion. However, we think that the constraint can be relaxed such that the diffusion PDE is fulfilled within some error bounds.

6 Conclusion

In this paper, we introduced PoGMDMs as products of Gaussian mixture experts that allow for an explicit solution of the diffusion PDE of the associated density. For models acting on filter, wavelet, and shearlet responses, we derive conditions for the associated filters and potential functions such that the diffusion PDE is exactly fulfilled. Our explicit formulation enables learning of image priors simultaneously for all diffusion times using denoising score matching. Numerical results demonstrated that PoGMDMs capture the statistics of the underlying distribution well for any diffusion time. As a by-product, our models can naturally be used for noise estimation and blind heteroscedastic denoising.

Future work will include the design of multi-layer architectures for which the diffusion can be expressed analytically, or approximated within some error bounds. In addition, the learned models could be evaluated on more involved inverse problems such a deblurring or even medical imaging. Further, the extensive evaluation of the model based on filter responses in terms of sampling the distribution and performing heteroscedastic blind denoising can also be applied to the models based on wavelet and shearlet responses. Finally, the connection between the models based on filter and shearlet responses deserves being investigated further.