1 Introduction

The rapid and extensive progress of internet and networking technologies has simplified the replication, alteration, reproduction, and distribution of multimedia content through physical transmission media. This occurs during communication, information processing, and data storage, all at a low cost and without compromising the quality of the content. Therefore, securing data and maintaining digital information from upcoming hackers threats is primordial. Different data hiding techniques were proposed to resolve this problem, such as cryptography, steganography, and digital watermarking. This last one consists in embedding the signature into the original content and then trying to detect it after different manipulations are applied to the marked content. Watermarking is used for several applications, such as content protection, copyright management, content authentication, and tamper detection. Figure 1 illustrates several widely recognized applications of watermarking.

Fig. 1
figure 1

watermarking applications

In the last two decades, many traditional watermarking approaches have been proposed to secure different types of media, such as image [84], 2D video [17]. Although many traditional watermarking algorithms are proposed for video, 3D models, and audio, deep learning models are yet to focus on these areas. Indeed, to our knowledge, there is only one proposed paper for 3D models [93], two papers for video and no paper exploring deep learning models for audio watermarking.

As deep learning-based watermarking is a relatively recent area of research, current surveys concentrate on traditional algorithms. Since 2020, a few survey papers have been proposed concerning deep learning image watermarking, but no survey paper has focused on deep learning-based video watermarking techniques. Byrnes et al. [17] proposed a comprehensive survey regarding deep data hiding models unifying digital watermarking and steganography. Zhang et al. [95] also presents a brief survey on deep learning-based data hiding, steganography, and watermarking for images. Li et al. [51] provides an overview of watermarking of deep learning models, and [29] gives a brief survey of image watermarking based on deep neural network architecture.

As deep learning-based watermarking continues to expand, and several works were recently proposed for video, it is important to summarize and compare the current methods proposed for image and video. This survey aims to briefly classify traditional watermarking techniques for image and video and discuss existing deep learning models for image and video watermarking. We present future directions in deep learning-based video watermarking areas that research may take. The key contributions of the survey are as follows:

  • This survey briefly classifies and compares the existing traditional watermarking techniques proposed for image and video.

  • We provide a classification of deep learning-based watermarking techniques based on the network architecture and the embedding domain.

  • We discuss and compare the most popular deep learning-based image and video watermarking techniques to give researchers a clear understanding of the practical challenges of deep learning-based watermarking.

  • We also present some future directions for deep learning-based video watermarking.

The rest of the paper is organized as follows. In Sect. 2, we introduce a review of the traditional image and video watermarking techniques by classifying them based on the embedding target and the used domain. Section 3 presents a comparison of image watermarking techniques utilizing deep learning methods, focusing on their network architecture. The next section details the small number of existing deep learning-based video watermarking techniques and shows their advantages. Sect. 5 discusses video deep learning-based watermarking challenges and gives some suggestions to researchers in this domain. Eventually, in Sect. 7, we draw some conclusions and highlight some directions for future works.

2 Traditional Image and Video Watermarking Techniques Review

Watermarking is a branch of data hiding technology that hides information in digital content to be transmitted securely in the network. Information-hiding technology mainly includes steganography, covert communication, and watermarking. This technique protects digital content against several security problems, such as illegal data distribution, usage, duplication, manipulation, and storage. Indeed, it embeds a signature into the original content and then tries to detect it after different manipulations are applied on the marked content. Usually, a robust watermarking technique should be invisible. Watermarking is an important research area, thanks to its use in several media applications such as copyright protection and owner Identification, copy control and fingerprinting, content authentication and integrity verification, broadcast monitoring, indexing, and medical applications.

2.1 Watermarking Terminology

The watermarking process comprises two main stages: signature embedding and signature detection. Embedding is the stage in which a signature containing the author’s information or copyright information is embedded within a hosting multimedia content through a specific embedding method, as shown in Fig. 2. First, the hosting content, an image, a video, or a 3D model, is eventually transformed depending on the chosen embedding target (DWT, DCT, FFT, etc.). Then, the signature is generated by scrambling watermark information randomly by using a secret key to enhance the security of the embedding method. Watermark can also be generated by applying several encryption algorithms as proposed in [14, 19, 53, 69]. The obtained mark is embedded within the selected coefficients, which will then be brought back into the original domain to obtain the marked content.

Fig. 2
figure 2

Embedding stage

The signature detection stage tries to extract the embedded watermark and it is usually decomposed in the same steps of the embedding stage as shown in Fig. 3. Given a marked media, the same transformation used at embedding will be applied and the detection algorithm will be applied to the obtained coefficients. The signature detection stage may require knowledge of the original content. In this case, we say that a watermarking algorithm is non-blind. Contrary, if the watermark is recovered without resorting to the comparison between the original media and the marked one, the watermarking algorithm is blind.

Fig. 3
figure 3

Detection stage

If the signature contains a sequence of N bits, it can be read from the marked media. In this case, the watermarking algorithm is called multi-bit watermarking. However, in the 0-bit watermarking, the detector tries to decide whether a known signature is present in the given media. In several applications, the two types can be required where the detector must verify at the first time the presence of the signature and if so, identify which message is encoded.

Any watermarking technique must satisfy three main requirements: invisibility, capacity, and robustness. Based on applications, these requirements evaluate the performance of watermarking systems. In the case of invisible watermarking, the market and the original content should be perceptually indistinguishable from humans. This fidelity can be evaluated qualitatively by asking a group of people to confirm the visual quality of the marked content or quantitatively by calculating several criteria. The standard criteria used to evaluate the invisibility quantitatively are the mean peak signal-to-noise ratio (MPSNR) and mean structural similarity index (MSSIM). In the case where the marked content is an image, PSNR is calculated as shown in the following equation:

$$\begin{aligned}&\displaystyle PSNR = 10\log _{10} \frac{255^{2}}{MSE} (dB) \end{aligned}$$
$$\begin{aligned}&\displaystyle MSE = \frac{1}{M \times N} \sum _{m=1}^{M} \sum _{n=1}^{N} [f(m,n) - f_{w}(m,n)]^{2} \end{aligned}$$

where M \(\times \) N is the size of the image, f and \(f_{w}\) are the original and marked images, and MSE is the mean square error between f and \(f_{w}\). In the case of a video and if the number of marked frames is K, we calculate the Mean PSNR as follows:

$$\begin{aligned} MPSNR = \frac{1}{K}\sum _{k=1}^{K} PSNR_{k} \end{aligned}$$

Despite its simplicity, PSNR or MPSNR cannot sometimes provide subjective evaluation results, so SSIM or MSSIM are introduced to evaluate visual quality of the marked image or video quality. The MSSIM is defined as follows:

$$\begin{aligned}&\displaystyle MSSIM = \frac{1}{k}\sum _{k=1}^{K} SSIM(f_{k},f_{kw}) \end{aligned}$$
$$\begin{aligned}&\displaystyle SSIM(f_{k},f_{wk})=\frac{(2 \mu _{f_{k}} \mu _{f_{kw}} + C_{1}) (2 \sigma _{f_{k}f_{kw}} + C_{2})}{(\mu _{f_{k}}^{2} + \mu _{f_{kw}}^{2} + C_{1}) (\sigma _{f_{k}}^{2} + \sigma _{f_{kw}}^{2} + C_{2})} \end{aligned}$$

where \(\mu _{f_{k}}\) and \(\mu _{f_{kw}}\) are the mean values of the original image and the marked one, respectively; \(\sigma _{f_{k}}\) and \(\sigma _{f_{kw}}\) are the variances of the original image and the marked one. \(\sigma _{f_{k}f_{kw}}\) denotes the covariance of the original image and the marked one; and \(C_{1}\) and \(C_{2}\) are two stability constants. We note that there exist watermarking techniques that are visible, but their use is limited to specific applications.

The second requirement is capacity (also called payload) which presents the quantity of embedded information in host media. For several applications, if the watermarking technique needs high invisibility, it is necessary to reduce the signature capacity to avoid too much modification in the host media.

The last requirement is robustness which is the ability to extract the embedded signature even when the marked media undergoes several signal processing manipulations. These manipulations include non-malicious attacks that are unintentional processing that may perturb the embedded signature such as geometric operations (translation, rotation, scaling), noises add, and filtering which can be applied to image or video content and malicious attacks which try to damage or remove the embedded signature. Among these attacks, we distinguish compression attacks and collusion which are specific to video content. Note that, depending on the application, not all watermarking techniques are robust against the same manipulations.

Referring to the robustness level, techniques can be classified into robust, fragile, and semi-fragile watermarking. Robust watermarking requires the watermark to resist noisy operations, as well as geometric or non-geometric manipulations. This class of watermarking is used in different applications such as copyright protection, broadcast monitoring, copy control, and fingerprinting. If the embedded signature is lost or altered after the application of the host content, the watermarking is fragile. This class of watermarking is usually used for integrity verification and content authentication applications. The last type of watermarking is the semi-fragile class that is robust against some attacks, but it fails after malicious manipulations. This class can be used for image authentication applications.

Bit error rate (BER) and normalized correlation (NC) are used to evaluate the robustness of a given watermarking. These two metrics are calculated to compute the dissimilarity between the embedded signature and the extracted one after applying different attacks to the marked content. In fact, the BER provides the percentage of erroneous bits during the transmission, and it is given by the following equation where S is the original signature, S’ is the extracted one, \(\sum _{i} Ber_{i}\) is the number of bit in error, and \(\sum _{i} Btrans_{i}\) is the total number of transmitted bits:

$$\begin{aligned} BER(S,S^{'}) = \frac{\sum _{i} Ber_{i}}{\sum _{i} Btrans_{i}} \end{aligned}$$

The NC calculates the similarity between two media. It is a value in the range [0,1] where a higher value proves a better similarity between media. Given an original and an extracted signatures S and \(S^{'}\), NC metric is calculated as follows:

$$\begin{aligned} NC(S,S^{'})= \frac{1}{WH} \sum _{i=0}^{W-1} \sum _{j=0}^{H-1} \delta (S_{i,j}, S^{'}_{i,j}) \end{aligned}$$


$$\begin{aligned} \delta (S_{i,j}, S^{'}_{i,j})= {\left\{ \begin{array}{ll} 1, &{} \text {if } S= S^{'} \\ 0, &{} \text {otherwise } \end{array}\right. } \end{aligned}$$

The signature capacity, invisibility, and robustness are mutually restricted. Indeed, the most difficult challenge in the research of image and video watermarking area is how to choose embedding target that minimize the visual impact and have a high robustness and an acceptable capacity in the same technique.

2.2 Robust Traditional Image and Video Watermarking Techniques Classification

The main criterion used for image and video watermarking techniques classification is the embedding domain which can be spatial, frequency or hybrid domain.

Spatial watermarking embeds signature by directly modifying the luminance or the chrominance of the original image or video frame pixels. Spatial techniques are characterized by their low complexity and high invisibility. However, they suffer from the lack of robustness against several attacks. The main spatial domain techniques proposed for image and video watermarking include least significant bit (LSB) modification, spread spectrum modulation, and so on.

Concerning image content, LSB is the most used for the spatial domain where the least significant bit of several selected pixels is modified to embed signature [42]. Despite the simplicity of the LSB technique, its robustness is very poor. The spread spectrum techniques were proposed as an effective spatial watermarking where the original video frames are scanned to obtain a one-dimensional signal and the signature is modulated by spread spectrum technology and inserted in the video [60]. Other spatial video watermarking techniques were also proposed in [8, 48, 82] to improve robustness against attacks. However, the application of these techniques is limited due to their poor robustness, especially with the development of video coding technology.

Frequency domain-based watermarking converts the original content (image or video frames) using a chosen transform and then modify the obtained coefficients to embed the signature. After that, the coefficients are converted back to the spatial domain to obtain the marked content. The most used frequency domain transforms for image watermarking are the discrete cosine transform (DCT) [38, 54, 75, 80, 83], discrete Fourier transform (DFT) [23, 37, 64, 70] discrete wavelet transform (DWT) [5, 32, 46, 85], and singular value decomposition (SVD) [4, 81]. Every frequency transform presents its own advantages and disadvantages where some transforms are robust against several attacks while they fail against others. For example, the spatial domain usually ensures robustness against translation and noises but it does not resist to compression and filtering contrary to DCT which is robust against rotation, filtering, and JPEG compression but it fails to resist noises. To resolve this problem, several image watermarking algorithms are based on the hybrid domain which combines different transforms with spatial domain together to profit from the advantages of these transforms [2, 13, 76]. Note that these algorithms ensure the best trade-off between robustness, capacity, and invisibility.

Concerning video content, like image, the common frequency domain transforms include DCT [18, 34, 49, 89], DWT [15, 30, 72, 79], and SVD which is usually combined with another transform as DWT [33, 77] and DCT [61]. As concluded for the image, the robustness of the video watermarking techniques depends on the characteristics of the chosen transform. However, to better improve performance, many watermarking algorithms use the hybrid domain that combines the advantages of the different transformations. Therefore, different techniques were proposed combining DCT and DWT [39, 73] or combining different transforms with the spatial domain as suggested in [44].

Since video content can be considered as a set of frames, any image watermarking technique can be adopted for video watermarking by embedding the signature into spatial redundancy of all or some selected frames. However, image-based techniques cannot resist video-specific attacks. In fact, video is also defined by temporal information which makes its processing more sensitive and the temporal redundancy in a video gives more chances to hackers to estimate signatures by using malicious attacks such as collusion. This last attack and frame-based attacks such as compression frames drop** and swap** should be considered by researchers when develo** watermarking techniques for video. To resist these attacks, different techniques based on temporal information, such as mosaic [45], multi-sprites [11], and Krawtchouk moments [10], have been proposed and they proved their good robustness against malicious attacks, especially against collusion attack.

As video data is nowadays frequently used and transmitted on the Internet, the compression process is usually applied to reduce video size. However, watermarking techniques based on the original video decode the video during signature embedding and detecting stages and can destroy the signature and deteriorate the visual quality. To resolve this problem, a new class of video watermarking algorithms has emerged where the compressed domain is used. These algorithms embed signatures into compressed videos and combine the embedding stage with corresponding video coding standards which include MPEG [21, 22, 90], H.264 [27, 98], and H.265 [24, 55, 71]. Compressed domain-based watermarking is robust against several attacks such as filtering, noises, and compression.

In summary, the classification of the traditional robust image and video watermarking techniques is illustrated in Fig. 4.

Fig. 4
figure 4

Classification of the traditional robust image and video watermarking techniques

3 Basic Concepts of Deep Learning-Based Watermarking

With the success of deep learning in computer vision and image processing domains, it has been adopted for various tasks. Recently, deep learning models have attracted the attention of researchers in data hiding techniques including steganography [9, 25] and watermarking.

3.1 General Framework of Deep Learning-Based Watermarking Schemes

Deep learning-based watermarking usually uses an encoder–decoder based on convolutional neural networks (CNNs) structure to train models and to embed them in a robust and invisible way the signature. It is more efficient than traditional watermarking thanks to its advantage to be retrained to resist several attacks. In addition, it does not need an expert to develop the embedding method. Finally, the black-box nature of deep learning models allows for improving security.

The deep learning-based watermarking scheme is decomposed into three main stages as shown in Fig. 5. The first stage is the encoder which embeds the signature in the original content. The second stage is attack simulation and finally, a signature is extracted using the decoder network stage. Thanks to the iterative learning process, the embedding is more robust against attacks applied during the second stage, and the extraction network improves the integrity of the extracted signature. The main advantage of deep learning-based watermarking over traditional watermarking is that it can be easily retrained for various applications and different attacks instead of being designed from scratch.

Fig. 5
figure 5

Encoder–decoder architecture stages for digital watermarking

An image or video watermarking scheme based on deep learning works as follows:

  1. 1.

    Training the encoder network to embed input messages to original content where the main goal is to minimize an objective function. This function calculates both the difference between original content and marked content and between the embedded and extracted signatures.

  2. 2.

    Applying different attacks to the marked content through distortion layers. These attacks can include different forms of manipulations such as crop** and compression.

  3. 3.

    Extracting the embedded message from distorted content using the decoder network.

3.2 Neural networks Architectures Used in Watermarking

Deep learning frameworks utilize automatic learning to capture hierarchical information directly from training data, eliminating the need for manual feature representations. Specifically, a deep network takes raw input data, such as an image or audio signal, and performs a map** operation. Due to their impressive capability to imitate human brain learning abilities and engage in more natural interactions, deep learning techniques have gained widespread usage in data hiding and image processing applications.

Two deep learning models are widely used in watermarking techniques: convolutional neural network (CNN) and generative adversarial network (GAN).

CNNs are well suited for different applications such as classification and recognition, thanks to their efficiency in data representation with limited number of parameters [50]. The CNN algorithm is a specialized multilayer perceptron primarily developed for extracting and recognizing two-dimensional image details. The CNN architecture typically consists of multiple layers, including an input layer, convolutional layers, pooling layers, and an output layers shown in Fig. 6. The CNN initiates by taking an input image and subjecting it to a series of convolutions and subsampling operations. Each convolution layer comprises a collection of filter matrices, which are multiplied with the preceding image matrix to extract significant features referred to as output channel maps. Subsequently, pooling layers are employed to decrease the dimensions of the input map while preserving crucial information. Max pooling, a subsampling technique, selects the maximum value within each block. Nonlinearity is introduced into the network through activation functions like the rectified linear unit (ReLU), which sets negative values to zero. To mitigate overfitting and expedite learning, batch normalization can be employed during network training.

Fig. 6
figure 6

CNN architecture

Concerning GAN, it is a type of neural networks widely employed in unsupervised learning. GAN consists of two neural network models that engage in a competitive process, enabling them to examine, grasp, and replicate the diverse patterns present in a given dataset. In fact, GAN is decomposed of two models: generative and discriminative model. It has the same principle as the encoder–decoder described in Fig. 5 with a difference from the discriminator network which classifies the mixture of encoded and unaltered images that are given to it (Fig. 7). The use of these discriminative networks can greatly improve data imperceptibility.

Fig. 7
figure 7

GAN architecture

3.3 Examples of Datasets for Watermarking

To assess the performances of a deep learning-based watermarking scheme, different datasets were used in the literature. Among these datasets, we mention:

  • ImageNet: ImageNet is a widely used dataset in computer vision research, consisting of millions of labeled images across thousands of categories. While not specifically designed for watermarking, it can be used to evaluate the effectiveness of watermarking techniques on various types of images.

  • MS COCO (Microsoft Common Objects in Context): MS COCO is another popular dataset used for object detection and image segmentation tasks. It contains a large collection of images with diverse content, making it suitable for watermarking research.

  • BOSSbase (BOWS-2): BOSSbase is a benchmark dataset for digital image watermarking. It contains 10,000 grayscale images with a resolution of 512x512 pixels. The dataset includes both the original images and the corresponding watermarked versions, making it suitable for evaluating the robustness and imperceptibility of watermarking algorithms.

  • UCF101: UCF101 is a dataset commonly used for action recognition in videos. It consists of 13,320 videos covering 101 action categories. While primarily used for action recognition, it can be employed to evaluate video watermarking techniques on action-based content.

  • The Kinetics dataset is a large-scale video dataset commonly used for action recognition tasks. It consists of approximately 650,000 video clips covering 700 action categories. The dataset is diverse and includes a wide range of human actions captured from YouTube videos. While the Kinetics dataset is not designed specifically for watermarking research, it can still be useful for evaluating certain aspects of watermarking techniques on action-based video content.

4 Deep Learning-Based Image Watermarking Review

While the current research on deep learning-based watermarking predominantly revolves around image watermarking, other forms of media are still in an early stage of development. Only a limited number of works have been proposed for text [1] and 3D images [92]. These approaches offer improved efficiency compared to traditional techniques by leveraging their ability to learn complex insertion patterns that are resilient against various attacks. This robustness is obtained since the networks of deep learning can be easily retrained to become robust to different types of attacks. Moreover, they can target capacity payload or imperceptibility optimization without develo** new algorithms for each different application. Deep learning models are characterized by their high nonlinearity which makes the retrieval of the embedded signature impossible by an adversary.

4.1 Classification of Deep Learning-Based Image Watermarking Schemes

Current deep learning-based image watermarking techniques can be categorized into two classes based on the chosen network architecture. The first class uses the encoder–decoder framework including CNNs where we distinguish techniques which are based CNN encoder–decoder (Fig. 5) and those based on the convolutional auto-encoders which are a special case of the encoder–decoder used in unsupervised-learning scenarios.

Two traditional convolutional auto-encoders for watermark embedding and extraction were proposed in [41]. These auto-encoder CNN models allow for obtaining high invisibility of the embedded signature. Moreover, the watermarking proposed in [41] proved its efficiency in terms of robustness and outperforms the traditional watermarking techniques. Another convolutional auto-encoder-based robust and blind watermarking technique was proposed in [63]. This approach is decomposed into three steps: embedding, attack simulation, and updating. In the second step, the CNN simulates the various attacks while in the updating, the loss function is minimized by updating the model’s weights.

In [78], the authors present a method of watermarking digital images using CNNs. First, an encoder network is used to extract latent features from the cover and secret images. These features are then concatenated to create a marked image. On the receiving end, a CNN is used to retrieve the secret marked image after removing noise variations from the received image using a denoising auto-encoder network.

Ahmadi et al. [3] presents a new approach called ReDMark which uses two full convolutional neural networks (FCNNs) for embedding and extraction. It contains a differentiable attack layer to simulate different distortions. This technique improves robustness against attacks and maximizes the trade-off between robustness and imperceptibility. Zhong et al. [99] proposes a CNN-based watermarking technique which is robust and blind and can be used for several applications. This technique generalizes the watermarking process by training a deep neural network to learn the general rules of watermark embedding and extraction. This technique outperforms the two auto-encoder CNN methods proposed in [41, 63], and allows obtaining greater robustness. Another watermarking model developed in [47] uses a simple CNN for both embedding and extraction. It contains an image pre-processing network that can adapt images of any resolution for the watermarking process and a watermark pre-processing as well as a strength scaling factor to control the trade-off between robustness and imperceptibility.

Luo et al. [57] improve the CNN-based encoder–decoder framework by adopting trained CNNs for attack simulation instead of using a differentiable attack layer. The addition of adversarial components to model training can improve the robustness of the embedded mark. In fact, in [57], the distortions are generated via adversarial training by a trained CNN.

In [68], an optimized deep fusion convolutional neural network (FCNN)-based digital color image watermarking scheme was proposed for copyright protection. It suggests a deep fusion CNN that uses an optimization method as its basis. The octave convolutional module added by the embedding network reduces spatial redundancy and increases the receptive field. The ECO method can help choose a suitable strength factor with great exploration capabilities.

The second class of deep learning-based image watermarking is based on generative adversarial networks (GAN) [28]. Several variants of the GAN exist, and they include Wasserstein GANs (WGANs) and CycleGANs which are used for image watermarking and provide good results in terms of invisibility and robustness. HiDDeN [100] is the first scheme which uses an adversarial discriminator to improve the performance of the watermarking process. It is decomposed of an encoder network which trains to embed an encoded bit string, a decoder network which tries to extract the information from the encoded image and an adversary network which predicts if the image was encoded or not.

ROMark [

Table 1 Comparison of the existing deep learning-based image watermarking techniques