Abstract
In recent years, the rapid development of deep learning technologies has advanced the field of Face Super-Resolution (FSR), particularly with the introduction of numerous methods based on CNNs and Transformers. However, employing these architectures individually often overlooks the relationship between global and local information, resulting in a lack of detail and clarity in FSR images. To address this issue, we propose a novel pixel-enhanced CNN-Transformer hybrid network for FSR, called PEFormer. For the CNN branch, we design a DetailCNN that utilizes Central Cross Differential Convolution (CCDC) to extract gradient features from adjacent local regions of the face, and subsequent feature enhancement yields details-rich local features. For the Transformer branch, we design a Pixel-level Triple-path Transformer (PTT), which includes a global feature-extracting Transformer, a Pixel-Feature Attention Block (PFAB) that supplements high-frequency features and pixel information, and residual connections. These three combined paths may capture pixel-level long-range visual dependencies. The cascading of DetailCNN and PTT enables frequent interaction between local and global information, improving the extraction of deep features. Additionally, we employ a progressive image reconstruction structure according to different upscaling factors, reinforcing multi-scale feature information for high-quality reconstructed images. Moreover, considering the strong identity attributes of face images, an identity loss function is incorporated during the training to refine the details of facial components, bringing them closer to the target image and more aligned with human perception. Experiment results indicate that our proposed method effectively enhances facial feature details and improves FSR performance, showing improvements in both qualitative and quantitative metrics compared to existing methods.
Data availability
CelebA: https://mmlab.ie.cuhk.edu.hk/projects/CelebA.htmlFFHQ: https://github.com/NVlabs/ffhq-datasetHelen: http://www.ifp.illinois.edu/~vuongle2/helen/.
References
Jiang, J., Wang, C., Liu, X., et al.: Deep learning-based face super-resolution: A survey. ACM Comput. Surv. (CSUR). 55(1), 1–36 (2021). https://doi.org/10.1145/3485132
Ge, S., Zhao, S., Li, C., et al.: Low-resolution face recognition in the wild via selective knowledge distillation. IEEE Trans. Image Process. 28(4), 2051–2062 (2019). https://doi.org/10.1109/TIP.2018.2883743
Zhang, K., Zheng, D., Li, J., et al.: Coupled discriminative manifold alignment for low-resolution face recognition. Pattern Recogn. 147, 110049 (2024). https://doi.org/10.1016/j.patcog.2023.110049
Chen, C., Gong, D., Wang, H., et al.: Learning spatial attention for face super-resolution. IEEE Trans. Image Process. 30, 1219–1231 (2021). https://doi.org/10.1109/TIP.2020.3043093
Liu, H., Han, Z., Guo, J., et al.: A Noise robust face hallucination framework via cascaded model of deep convolutional networks and manifold learning. In IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6 (2018). https://doi.org/10.1109/ICME.2018.8486563
Jiang, K., Wang, Z., Yi, P., et al.: Dual-path deep fusion network for image hallucination. IEEE Trans. Neural Networks Learn. Syst. 33(1), 378–391 (2022). https://doi.org/10.1109/TNNLS.2020.3027849
Chen, Y., Tai, Y., Liu, et al.: FSRNet: End-to-end learning face super-resolution with facial priors. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2492–2501 (2018). https://doi.org/10.1109/CVPR.2018.00264
Kalarot, R., Li, T., Porikli, F.: Component attention guided face super-resolution network: CAGface. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp. 370–380, (2020). https://doi.org/10.1109/WACV45572.2020.9093399
Ma, C., Jiang, Z., Rao, Y., et al.: Deep face super-resolution with iterative collaboration between attentive recovery and landmark estimation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 5569–5578, (2020). https://doi.org/10.1109/CVPR42600.2020.00561
Liang, J., Cao, J., Sun, G., et al.: SwinIR: Image restoration using swin transformer. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 1833–1844, (2021). https://doi.org/10.1109/ICCVW54120.2021.00210
Wang, Z., Cun, X., Bao, J., et al.: UFormer: A general u-shaped transformer for image restoration. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 17683–17693, (2022). https://doi.org/10.1109/CVPR52688.2022.01716
Zhang, X., Zeng, H., Guo, S., et al.: Efficient long-range attention network for image super-resolution. In European conference on computer vision, pp. 649–667 (2022). https://doi.org/10.1007/978-3-031-19790-1_39
Wang, Y., Lu, T., Zhang, Y., et al.: TAnet: A new paradigm for global face super-resolution via transformer-CNN aggregation network. ar**v Preprint ar**v. (2021). https://doi.org/10.48550/ar**v.2109.08174 :2109.08174
Bao, Q., Liu, Y., Gang, B., et al.: SCTANet: A spatial attention-guided CNN-transformer aggregation network for deep face image super-resolution. IEEE Trans. Multimedia. 25, 8554–8565 (2023). https://doi.org/10.1109/TMM.2023.3238522
Gao, G., Xu, Z., Li, J., et al.: CTCNet: A CNN-transformer cooperation network for face image super-resolution. IEEE Trans. Image Process. 32, 1978–1991 (2023). https://doi.org/10.1109/TIP.2023.3261747
Duan, H., Long, Y., Wang, S., et al.: Dynamic unary convolution in transformers. IEEE Trans. Pattern Anal. Mach. Intell. (2023)
Duan, H., Wan, F., Sun, R., et al.: Wearable-based behaviour interpolation for semi-supervised human activity recognition. Inf. Sci. 120393 (2024). https://doi.org/10.1016/j.ins.2024.120393
Baker, S., Kanade, T.: Hallucinating faces. In Proceedings Fourth IEEE international conference on automatic face and gesture recognition, pp. 83–88 (2000). https://doi.org/10.1109/AFGR.2000.840616
Huang, H., He, H., Fan, X., et al.: Super-resolution of human face image using canonical correlation analysis. Pattern Recogn. 43(7), 2532–2543 (2010). https://doi.org/10.1016/j.patcog.2010.02.007
Shi, J., Liu, X., Qi, C.: Global consistency, local sparsity and pixel correlation: A unified framework for face hallucination. Pattern Recogn. 47(11), 3520–3534 (2014). https://doi.org/10.1016/j.patcog.2014.04.023
Chen, L., Pan, J., Jiang, J., et al.: Robust face super-resolution via position relation model based on global face context. IEEE Trans. Image Process. 29, 9002–9016 (2020). https://doi.org/10.1109/TIP.2020.3023580
Jiang, J., Chen, C., Ma, J., et al.: SRLSP: A face image super-resolution algorithm using smooth regression with local structure prior. IEEE Trans. Multimedia. 19(1), 27–40 (2016). https://doi.org/10.1109/TMM.2016.2601020
Dong, C., Loy, C.C., He, K., et al.: Image super-resolution using deep convolutional networks. IEEE Trans. Pattern Anal. Mach. Intell. 38(2), 295–307 (2016). https://doi.org/10.1109/TPAMI.2015.2439281
Zhang, Y., Tian, Y., Kong, Y., et al.: Residual dense network for image super-resolution. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2472–2481 (2018). https://doi.org/10.1109/TPAMI.2020.2968521
Han, W., Chang, S., Liu, D., et al.: Image super-resolution via dual-state recurrent networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1654–1663 (2018). https://doi.org/10.1109/CVPR.2018.00178
Zheng, H., Wang, X., Gao, X.: Fast and accurate single image super-resolution via information distillation network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 723–731 (2018). https://doi.org/10.1109/CVPR.2018.00082
Liu, J., Tang, J., Wu, G.: Residual feature distillation network for lightweight image super-resolution. In Proceedings of the European Conference on Computer Vision, pp. 41–55 (2020). https://doi.org/10.1007/978-3-030-67070-2_2
Yang, L., Liu, C., Wang, P., et al.: HiFaceGAN: Face renovation via collaborative suppression and replenishment. In Proceedings of the ACM International Conference on Multimedia, pp. 1551–1560 (2020). https://doi.org/10.1145/3394171.3413965
Zhang, M., Ling, Q.: Supervised pixel-wise GAN for face super-resolution. IEEE Trans. Multimedia. 23, 1938–1950 (2021). https://doi.org/10.1109/TMM.2020.3006414
Ojala, T., Pietikainen, M., Harwood, D.: Performance evaluation of texture measures with classification based on Kullback discrimination of distributions. In Proceedings of 12th international conference on pattern recognition, 1, pp. 582–585 (1994). https://doi.org/10.1109/ICPR.1994.576366
Shi, W., Caballero, J., Huszár, F., et al.: Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1874–1883 (2016)
Liu, Z., Luo, P., Wang, X., et al.: Deep learning face attributes in the wild. In Proceedings of the IEEE international conference on computer vision, pp. 3730–3738 (2015)
Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 4401–4410 (2019)
Le, V., Brandt, J., Lin, Z., et al.: Interactive facial feature localization. In 12th European Conference on Computer Vision, Part III 12, pp. 679–692 (2012). https://doi.org/10.1007/978-3-642-33712-3_49
Wang, Z., Bovik, A.C.: A universal image quality index. IEEE. Signal. Process. Lett. 9(3), 81–84 (2002). https://doi.org/10.1109/97.995823
Zhang, R., Isola, P., Efros, et al.: The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 586–595 (2018)
Dong, C., Loy, C.C., He, K., et al.: Image super-resolution using deep convolutional networks. IEEE Trans. Pattern Anal. Mach. Intell. 38(2), 295–307 (2015). https://doi.org/10.1109/TPAMI.2015.2439281
Lim, B., Son, S., Kim, H., et al.: Enhanced deep residual networks for single image super-resolution. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp. 136–144 (2017)
Lu, T., Wang, Y., Zhang, Y., et al.: Face hallucination via split-attention in split-attention network. In Proceedings of the 29th ACM international conference on multimedia, pp. 5501–5509 (2021). https://doi.org/10.1145/3474085.3475682
Wang, Q., Wu, B., Zhu, P., et al.: ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 11534–11542 (2020)
Woo, S., Park, J., Lee, J., et al.: CBAM: Convolutional block attention module. In Proceedings of the European conference on computer vision (ECCV), pp. 3–19 (2018)
Fu, J., Liu, J., Tian, H., et al.: Dual attention network for scene segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 3146–3154 (2019)
Author information
Authors and Affiliations
Contributions
XL contributed to conceptualization and methodology. XG was involved in methodology and manuscript writing and provided software. YC contributed to data interpretation and manuscript editing. GC was involved in validation and manuscript editing. TY and YC contributed to supervision and manuscript editing.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Conflict of interest
All authors declare that they have no conflict of interest.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Lu, X., Gao, X., Chen, Y. et al. PEFormer: a pixel-level enhanced CNN-transformer hybrid network for face image super-resolution. SIViP (2024). https://doi.org/10.1007/s11760-024-03395-8
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11760-024-03395-8