1 Introduction

The reconstruction and subsequent novel view synthesis based on 2D input images of complex 3D scenes is a long-standing research problem in computer vision. Recently, the introduction of neural radiance fields (NeRFs) [30] has led to enormous progress regarding the rendering of photorealistic views from multi-view images. NeRFs predict radiance and volume density from a 3D spatial location and viewing direction using a multilayer perceptron (MLP). The outputs of this MLP can be rendered by established volume rendering techniques [15].

Most NeRF variants are restricted to well-defined capturing environments, with noise-free images of a static scene with known camera parameters. Therefore, most NeRF variants produce inferior rendering results when the input images suffer from different types of blur. The most prominent examples are motion blur, where the camera changes position during exposure time leading to superposition of multiple views, and defocus blur that is caused by a suboptimal choice of camera aperture. There are multiple approaches that address the issue of motion blur [8, 21, 40, 53], defocus blur [59] or both [20, 26, 37] in the context of NeRFs. However, while there is previous work that has shown great progress in accelerating NeRF training and rendering times [4, 10, 31, 48], most of the existing blur-resistant contributions do not leverage the potential of these accelerations. This can be attributed to the fact that most of the blur-resistant strategies rely on dynamically generated rays, while the accelerated NeRF methods leverage explicit density volume representations to prune low density voxels. These strategies are based on the assumption of static rays.

In this paper, we propose NeRF-FF, a novel plug-in method that can be combined with a multitude of NeRF and NeRF-like novel view synthesis strategies, which significantly improves the rendering quality based on blurred input images. NeRF-FF calculates image masks that contain in-focus or only slightly out-of-focus regions of the input images that lead to a beneficial trade-off between image quality and scene coverage. To achieve this, NeRF-FF first identifies regions that are in-focus by applying the discrete Laplace operator, which detects sharp edges that are predominantly prevalent in in-focus regions. These feature points are leveraged to estimate the visible and in-focus volume for each image in the 3D scene space using depth maps from a NeRF, which was trained before. This volume has the shape of a pyramid cropped by a near and far plane. We refer to it as Focus Frustum.

These Focus Frustums are then refined by expanding their near and far plane such that each point in the relevant scene geometry is part of at least a certain percentage of the Focus Frustums. The projection of dense scene points encompassed by the Focus Frustums into the respective image space provides image masks indicating the pixels, a subsequent NeRF model can be trained on, to obtain unblurred novel views. Thus, we provide a trade-off between the reconstruction of the complete scene geometry and usage of optimal, i.e., in-focus or only slightly out-of-focus, views.

Our experiments show that the resulting image quality of NeRF-FF is comparable to state-of-the-art strategies. At the same time, NeRF-FF can be used in combination with accelerated NeRF variants like Instant-NGP [31], which is not easily integrable with state-of-the-art methods like PDRF [37]. The proposed combination of NeRF-FF and Instant-NGP [31] outperforms the state of the art by two orders of magnitude regarding training time, bringing it down to under 1 min (see Fig. 1).

In summary, this paper offers the following contributions:

  • We propose NeRF-FF, a plug-in strategy that estimates image masks based on Focus Frustums (FFs), i.e., the visible volume in scene space that is in-focus, that allow a subsequent NeRF model to omit out-of-focus image regions during training, including a reasonable choice of hyperparameters.

  • We provide a quantitative analysis showing that the combination of NeRF-FF with Instant-NGP [31] yields resulting novel views that are comparable to the state of the art in quality while reducing the required computation time by two orders of magnitude. This results in end-to-end runtimes under 1 min on end-consumer hardware.

Fig. 1
figure 1

NeRF-FF in combination with Instant-NGP (NeRF-FF + iNGP) accelerates training times on blurry input images in comparison with the state of the art by two orders of magnitude, while significantly improving Instant-NGP’s novel view synthesis capabilities on these inputs. Results obtained from defocus real dataset by Ma et al. [26] using an RTX 3090

2 Related work

Neural Radiance Fields (NeRFs). NeRFs [30] model the radiance of a static 3D scene as implicit representation, which can be used by classical volume rendering techniques [15] to perform photorealistic novel view synthesis—the rendering of an unknown viewing direction based on a set of input images. This approach has gained popularity in recent years, leading to many follow-up works that have extended NeRF capturing capabilities. There are NeRF variants that can handle dynamic scenes [22, 35, 3.5). The pipeline of NeRF-FF is illustrated in Fig. 2.

3.1 Initial depth estimates

Defocus blur manifests depending on the distance of surfaces from the camera plane. In order to distinguish in-focus regions from out-of-focus regions, our approach leverages spatial information from the scene space. Therefore, a map** between points in 2D image space and their corresponding points in scene space is required. We train a preliminary Instant-NGP [31] instance on the blurry input images i with the color information \(I_i \in [0, 255]^{3 \times W \times H}\), to obtain per view depth maps \(D_i \in \mathbb {R}^{W \times H}\). We notate the color information of a pixel that is part of image i and has position (xy) as \(I_i(x,y)\) and call that pixel’s depth \(D_i(x,y)\). The depth maps \(D_i \in \mathbb {R}^{W \times H}\), that are later used for initial Focus Frustum estimation (Fig. 2b) and mask generation (Fig. 2d), are derived from the rays that are deployed during the volume rendering process of Instant-NGP inference. Even though the reconstructed appearance of the preliminary Instant-NGP model suffers from the blurred image inputs, the estimated scene geometry is adequate for the purpose of generating sufficiently accurate depth maps. Note, that any approach which reliably estimates a dense depth map from the blurry input images can be applied in this stage.

3.2 Discrete Laplace operator as DoF indicator

In-focus image regions generally contain sharp edges, be it through textured surfaces or object boundaries, which are greatly reduced by defocus blur [3, 16]. These edges can be detected using the discrete Laplace operator \(\Delta \). Pixels \(I_i(x,y)\), that exhibit high values after applying the discrete Laplace operator on grayscaled input images, indicate sharp edges and are therefore used as indicators for in-focus image regions—modeled as a set of indicator pixels \(P_i\) per image \(I_i\) with

$$\begin{aligned} P_i = \left\{ (x,y) | \Delta (I_i)(x,y) > \tau \right\} , \end{aligned}$$
(1)

where \(\Delta (I_i)(x,y)\) denotes the value of a pixel \(I_i(x,y)\) after applying the discrete Laplace to the grayscaled version of \(I_i\) and \(\tau \) denotes a threshold for the results of the discrete Laplace operator that functions as a hyperparameter for NeRF-FF. A suitable value range for \(\tau \) is experimentally identified in Sect. 4.2.

3.3 Initial Focus Frustum estimation

Based on the focus indicator pixels \(P_i\) and the corresponding depth maps \(D_i\), we can compute the near and far planes that are parallel to the image plane and in between which the in-focus regions are located. We call the volume that lies between these planes in the respective camera view frustum Focus Frustum. The Focus Frustum represents the visible in-focus volume in the scene space for the corresponding image. Sharp edges often occur on object boundaries, which exhibit unstable behavior regarding their associated depth, depending on whether the depth of the foreground or background object is considered. These regions are vulnerable to even slight inaccuracies in the geometry estimation of the preliminary NeRF model. Therefore, it is likely that the sets of focus indicators \(P_i\) contain pixels that correspond to out-of-focus regions.

To mitigate the influence of these erroneous focus indicators, we compute the depths of the near and far planes \(n_i, f_i\) of the Focus Frustum \( FF _i\) based on the depth distribution of the focus indicator pixels:

$$\begin{aligned} n_i&= M - \sigma _{M}(P_i^<) \end{aligned}$$
(2)
$$\begin{aligned} f_i&= M + \sigma _{M}(P_i^>) \end{aligned}$$
(3)

where M is the median of the depth values of the focus indicator points in \(P_i\) according to the depth map \(D_i\), \(\sigma _\mu (X)\) denotes the standard deviation of the elements in set X in relation to \(\mu \), and \(P_i^<\) (\(P_i^>\)) is the subset of elements in \(P_i\) with a corresponding depth smaller (greater) than M. This follows the intuition that the depth of in-focus points in images with defocus blur follows a normal distribution. The probability of a point being in-focus is highest in the center of the distribution and decreases for closer and further points.

Based on these near and far planes, our initial Focus Frustums \( FF _i\) are defined as

$$\begin{aligned} FF _i = \{ \textbf{p}\ \in \mathbb {R}^3 |\ \textbf{p} \cdot \textbf{v}_i \in [n_i, f_i] \ \wedge \ K_i [R_i|t_i] \textbf{p} \in I_i\}, \end{aligned}$$
(4)

where \(\textbf{v}_i\) denotes the forward facing vector of the camera pose that captured image i in world coordinates, \(K_i\) denotes the internal and \([R_i|t_i]\) denotes the external camera parameters of that camera. The first term of Eq. 4 signifies the depth of the corresponding volume lying between the near and far plane defined by \(n_i, f_i\), while the second term defines the points’ localization within the view frustum, i.e., visible volume in scene space, for the camera view corresponding to image i.

Figure 2b illustrates the process of estimating Focus Frustums based on the focus indicator points \(P_i\), the depth maps \(D_i\) and the corresponding camera views.

3.4 Focus Frustum refinement

Our approach provides image masks that regulate which image regions a subsequent NeRF model is trained on. The Focus Frustums dictate which depth ranges are considered for the generation of these masks by omitting pixels that get their color mainly from spatial positions outside of the Focus Frustum. Therefore, a low coverage of the scene geometry by the Focus Frustums leads to sparse inputs for these regions. Most NeRF variants suffer from significant quality loss for such sparse inputs. A low coverage of the scene volume can be caused by an insufficient amount of Laplacian features in in-focus image regions, e.g., due to textureless surfaces, or spatial positions that are in-focus in only a few or no images. To mitigate this issue, we iteratively refine the Focus Frustums by expanding their DoF, i.e., the distance between their near and far plane. For a set \( S \) of randomly sampled points in scene space with high optical density in the preliminary NeRF—indicating this scene space is occupied by relevant scene geometry—we examine the amount of Focus Frustums they are encompassed by (see Eq. 4). We refine the depths \(n_i, f_i\) of the near and far plane for each Focus Frustum \( FF _i\). We notate the updated near and far plane of the resulting Focus Frustums \( RFF _{i}\) as \(\hat{ n }_i, \hat{ f }_i\). The refined near and far planes’ depths are chosen, such that the accumulated change in depth R (Eq. 5) is minimal, while any spatial position \(\textbf{p} \in S \) is part of at least a certain fraction \(\varrho \) of the refined Focus Frustums.

\(\varrho \) is a hyperparameter of NeRF-FF that trades off scene coverage against reconstruction quality of regions that are depicted in the Focus Frustums. We show the influence of different values of \(\varrho \) in Sect. 4.2.

For the calculation of R, we only consider Focus Frustums with a corresponding view frustum that encompasses the respective spatial position \(\textbf{p}\), i.e., only Focus Frustums are considered that could potentially encompass \(\textbf{p}\) if an arbitrarily large change of the near or far plane is performed.

$$\begin{aligned} R = \sum _i ((n_i - \hat{ n _i}) + (\hat{ f _i} - f_i)) \end{aligned}$$
(5)

Figure 2c illustrates the refined Focus Frustums obtained in the previous pipeline step. They were expanded to encompass relevant scene geometry that was not sufficiently captured by the initial estimates.

3.5 Mask generation

Given the refined Focus Frustums \( RFF _i\), we calculate corresponding image masks \(\textit{Mask}_i\) with addressable pixels \(\textit{Mask}_i(x,y)\) at positions (xy) that discriminate positions in image space depicting structures within the Focus Frustum against ones that depict structures on the outside. To generate these masks, we leverage the depth values from the depth maps \(D_i\) (Sect. 3.1). A pixel is activated in an image mask \( Mask _i\) if its corresponding depth value is part of the refined Focus Frustum’s focus range \([\hat{n}_i, \hat{f}_i]\):

$$\begin{aligned} Mask _i(x,y) = {\left\{ \begin{array}{ll} 1, &{} \quad \text {if }D_i(x,y) \in [\hat{n}_i, \hat{f}_i]\\ 0, &{} \quad \text {otherwise.} \end{array}\right. } \end{aligned}$$
(6)

Note that since all pixels in the depth map are visible in their corresponding image, it is not necessary to consider the other boundaries of the Focus Frustum. A subsequent NeRF model is trained on the blurry input images only considering pixels \(I_i(x,y)\) with \(\textit{Mask}_i(x,y)=1\). This is illustrated in Fig. 2e. Due to the general nature of the results of NeRF-FF in the form of these masks, NeRF-FF is compatible with most NeRF variants.

4 Experiments

4.1 Implementation details

Training. NeRF-FF provides image masks for a subsequent training process with a NeRF variant. In our experiments, we evaluate NeRF-FF in combination with Instant-NGP (NeRF-FF + iNGP). Therefore, we employ Instant-NGP [31] as a preliminary and a subsequent NeRF model. Note that current state-of-the-art methods are not easily usable in combination with Instant-NGP because Instant-NGP employs custom CUDA kernels, which are hard to integrate with existing strategies [37]. The preliminary NeRF is trained for 2,000 iterations and the subsequent NeRF for 4,000 iterations on a single NVIDIA RTX3090. We use the standard Instant-NGP training parameters.

Table 1 Quantitative results for real-world scenes from the dataset by Ma et al. [26]
Table 2 Quantitative results for real-world scenes from the dataset by Wu et al. [59]

Datasets and evaluation.

We perform our experiments on the dataset introduced by Ma et al. [26], which provides 5 synthetic and 10 real-world scenes containing images suffering from defocus blur. Furthermore, we present the capabilities of NeRF-FF on the dataset presented by Wu et al. [59], which contains 6 real-world scenes.

For the synthetic scenes of Ma et al. [26], we train one model per scene on the training split. We compare the synthesized novel views for the views in the evaluation splits that are rendered using that model with the respective ground truth image. For our analysis, we consider the median of the results per scene. Each real-world scene of Ma et al. [26] contains a set of in-focus reference images. For each provided reference image, we train a model on any image in the respective scene dataset except the examined reference image, comparing the synthesized image to the reference. For each scene, we report the median of these results in Table 1. The average results are reported in Table 3.

For the real-world scenes of Wu et al. [59], a triplet of images consisting of two out-of-focus and one in-focus image is provided per view. Analogous to the evaluation method of Wu et al., we train a model on the out-of-focus images from \(\frac{8}{9}\) of the views and compare the synthesized images of the remaining \(\frac{1}{9}\) of scenes to their respective in-focus reference image. For each scene, we evaluate different splits until the union of these splits’ evaluation sets contains at least 20 images. Per scene we report the median of these results in Table 2.

4.2 Hyperparameter optimization and ablation

Our approach relies on a set of hyperparameters (\(\tau \), \(\varrho \)). The threshold \(\tau \) determines which values, that result from the application of the discrete Laplace operator, are considered indicators for in-focus regions (see Sect. 3.2). \(\varrho \) indicates the minimum fraction of Focus Frustums each randomly sampled scene point should be encompassed by after the FF refinement step (Sect. 3.4). This hyperparameter defines the trade-off between the sharpness of the image regions that are used for training and coverage of the scene.

In this section, we identify reasonable values for these parameters by comparing resulting images regarding their visual quality measured by the peak signal-to-noise ratio (PSNR) and the structural similarity index (SSIM) [66]. The experiments are conducted on the blurred scenes from the real dataset by Ma et al. [26].

To identify a suitable value for \(\varrho \), we consider values \(\varrho \in [0, 0.1, \ldots , 1.0]\). Note that \(\varrho =0\) corresponds to omission of the Focus Frustum Refinement step and \(\varrho =1\) is equal to the training of the subsequent NeRF model without applying NeRF-FF beforehand. For this experiment, we set \(\tau =80\), which constitutes an educated guess that turns out to be optimal for the dataset at hand. Figure 3 shows the results of that evaluation, which indicate that \(\varrho =0.4\) results in best performance for PSNR and SSIM, while significantly lower or higher values for \(\varrho \) lead to dramatically decreased performance. Therefore, we chose \(\varrho =0.4\) for our subsequent experiments.

Fig. 3
figure 3

Comparison of PSNR and SSIM of NeRF-FF + iNGP for different values of the hyperparameters \(\varrho \) and \(\tau \)

Table 3 Average results for synthetic and real-world scenes from the dataset by Ma et al. [26]
Fig. 4
figure 4

Qualitative comparison of NeRF-FF + iNGP (ours) with the current state of the art on the real-world defocus dataset of Ma et al. [26]. The depicted scenes are cake, caps, cupcake, cups and daisy (left to right)

Figure 3 also shows the results for different values of \(\tau \), which determines the threshold for the results of the discrete Laplace operator, employed for identifying DoF indicator pixels. The experiment is conducted with \(\varrho = 0.4\), since we have established it as an optimal parameter. The results indicate that low values of \(\tau \) lead to bad performance. In these cases, even weak signals are considered indicators for in-focus regions, leading to feature points in regions that are essentially out-of-focus but exhibit strong texture. On the other hand, high values of \(\tau \) lead to slowly decreasing performance due to the fact that even some sharp edges in in-focus regions are not considered relevant. The decline for higher values of \(\tau \), however, is small in comparison with low values of \(\tau \). This is most likely attributed to the fact that the indicator points are only used for the initial Focus Frustum estimates, which are then further refined. Smaller initial Focus Frustums, which occur for high values of \(\tau \), are then alleviated by the subsequent Focus Frustum refinement. Following the results of this experiment, the value of \(\tau =80\) is identified as the optimal choice for subsequent experiments, as it maximizes PSNR and SSIM values.

4.3 Novel view synthesis from blurry inputs

Evaluation Methodology. In this section, we summarize the results of our proposed approach in combination with Instant-NGP (NeRF-FF + iNGP) on the synthetic and real-world dataset of Ma et al. [26] and the dataset proposed by Wu et al. [59]. Similar to previous works, we conduct our quantitative analysis of the examined approaches by comparing the rendered images to their respective reference image regarding their PSNR, SSIM, and LPIPS.

Real-World Comparison.

Table 3 and Figs. 4 and 5 show that our model produces results that are comparable to SOTA methods. We compare our approach to Deblur-NeRF [26], DP-NeRF [20] and PDRF [37] that leverage 3D scene information for the estimation of a blur kernel while simultaneously incorporating dynamic ray generation. Ma et al. [26] show that these methods significantly outperform the process of deblurring single images beforehand by using strategies like KPAC [45]. Our results show that NeRF-FF + iNGP leads to images with PSNR values that are on par with state-of-the-art work. Regarding SSIM, our approach consistently outperforms these methods by a fair margin. The results indicate that Instant-NGP outperforms these methods regarding SSIM as well, suggesting that these improvements in structural similarity can be attributed to the usage of that NeRF variant. However, our approach trails the visual quality of the reference deblurring strategies regarding its LPIPS score by a slight margin. While this indicates inferior quality regarding human perception, we consider this difference in rendering quality negligible in comparison with the achieved acceleration. Our approach only requires 45 s of training time on average, which constitutes an acceleration of two orders of magnitude in comparison with state-of-the-art methods. This training time encompasses both the execution of NeRF-FF, including the preliminary NeRF training, as well as the training of the Instant-NGP model.

The results of our experiment on the dataset by Wu et al. [59] are illustrated in Table 2 and Fig. 6. These results indicate that we consistently produce images of better quality than DoF-NeRF [59], which physically models defocus blur by estimating the camera optics—especially the lens—as a multilayer perceptron.

Synthetic Comparison. Table 3 shows the results of NeRF-FF + iNGP on the synthetic dataset of Ma et al. [26]. On this dataset, our approach underperforms in comparison with the state-of-the-art approaches. An analysis of the intermediary pipeline stages shows that this is caused by a low-quality dense reconstruction of the synthetic scenes leading to depth maps of inferior quality. This is caused by fully blurred images in the dataset. We discuss this issue further in Sect. 5.

Fig. 5
figure 5

Comparison of average PSNR, SSIM, LPIPS and training times between NeRF + iNGP and state-of-the-art methods on the real defocus dataset by Ma et al. [26]. The results indicate that NeRF + iNGP achieves comparable visual quality while decreasing the required training time by at least two orders of magnitude. For \(\uparrow \) (\(\downarrow \)), higher (lower) values indicate better results

Fig. 6
figure 6

Qualitative comparison of DoF-NeRF [59] and NeRF-FF + iNGP (ours) on the real-world dataset provided by Wu et al. [59] on the scenes kendo (top), camera (middle) and amiya (bottom)

4.4 Compatibility with other technologies

Table 4 illustrates the results of NeRF-FF when applied in combination with different volumetric novel view synthesis strategies. For these experiments, we employed NeRF-FF in combination with nerfacto—a NeRF variant—and splatfacto, which is based on 3D Gaussian Splatting [18]. These strategies are both implemented in Nerfstudio [50].

The applied evaluation strategy is analogous to the one presented in Sect. 4.1. The results show that applying the masks generated by NeRF-FF to the input images of nerfacto and splatfacto produces higher-quality rendering results, manifesting in improvements in the observed metrics PSNR, SSIM and LPIPS. The impact of NeRF-FF on the training duration is negligible. These results further underline that NeRF-FF is compatible with a multitude of other training time optimized NeRF or NeRF-like strategies.

Table 4 Average results for real-world scenes from the dataset by Wu et al. [59] for other runtime-optimized methods in combination with NeRF-FF. For each scene, nerfacto [50] is trained for 6,000 iterations, splatfacto [18, 50] for 20,000 iterations

5 Limitations and future work

Dense Reconstruction Quality. Low quality of the preliminary NeRF’s dense reconstruction subsequently reduces the quality of image masks and therefore the overall result of NeRF-FF and a subsequent NeRF model significantly. This inferior performance is caused by mismatches between in-focus indicators and the respective spatial position of the scene geometry that mainly contributes to this pixel’s rendered color. These mismatches lead to erroneous depth estimates of the in-focus indicators, negatively influencing the initial Focus Frustum estimates. An analysis of the occurring failure cases during our experiments shows that low-quality dense reconstructions often occur when the inputs contain images without in-focus regions which leads to floater artifacts—scene volume with high estimated optical density with no counterpart in the ground truth scene—in the reconstructed scene. Therefore, NeRF-FF is not suited for datasets containing fully blurred images. Future work could enhance the robustness of NeRF-FF by automatically pruning these images from the training dataset.

Fig. 7
figure 7

Floater artifacts manifesting in the volumetric scene representation of Instant-NGP (a). This results in aberrated depth maps, impurifying the generated in-focus masks. As a result, the floater remains in the trained model augmented by NeRF-FF (b)

Floater artifacts. Some results of NeRF-FF + iNGP suffer from floater artifacts in the subsequent reconstruction of the scene based on the masked images (Fig. 7). These floater artifacts are also prevalent in unmasked Instant-NGP models and reduce image quality depending on the camera position. Recently, approaches to mitigate the influence of floater artifacts either by changing the applied training process [2] or by removing them after the training process [12, 57, 58] have been discussed. Integrating these strategies into the training process could further improve the visual quality of the results.

6 Conclusion

In this paper, we propose NeRF-FF, a plug-in method that enables the processing of partially defocus blurred input images for a multitude of NeRF variants. Our method leverages the discrete Laplace operator to detect in-focus regions in images to estimate in-focus volumes—Focus Frustums (FF)—on a per-image base. By iteratively expanding these Focus Frustums, our approach reaches full scene coverage while maintaining high visual quality. NeRF-FF is compatible with accelerated NeRF variants like Instant-NGP, offering qualitative results comparable to SoTA methods like Deblur-NeRF [26], DP-NeRF [20] and PDRF [37] while reducing training times below 1 min on end-consumer hardware. This corresponds to a relative speed-up of two orders of magnitude.