1 Introduction

Digital image/video coding has boomed with the digitalization of information since the late 1950s, as the data size of the original digitalized image or video data increases dramatically and reaches beyond the capability of storage and transmission. During the early stages of image coding, removing spatial statistical redundancy was the main means of image compression, such as Huffman coding [1] and Run-length coding [2]. The concept of transform coding, which transforms the spatial domain into the frequency domain for compression, was first proposed in the late 1960s, including the Fourier transform [3] and Hadamard transform [4]. Later the discrete cosine transform (DCT) was designed for image coding in 1974 by Ahmed et al. [5]. In the case of video, there is significant temporal redundancy in addition to spatial redundancy, which can be reduced by applying temporal prediction. Several early prediction-based coding techniques were introduced during the 1970s, including differential pulse-code modulation (DPCM) [6], frame difference coding [7], and block-based motion prediction [8]. A prototype of a hybrid prediction/transform coding scheme [9] was first proposed in 1979 by Netravali and Stuller, who combined motion compensation with transform coding techniques, commonly referred to as “the first generation” coding scheme. An overview of the historical development of the first-generation methods is provided in [10].

After several decades of development, hybrid prediction/transform coding methods have achieved great success. Various coding standards have been developed and are widely used in a variety of applications, such as MPEG-1/2/4 (Moving Picture Experts Group), H.261/2/3, and H.264/AVC (Advanced Video Coding) [11], as well as AVS (Audio and Video Coding Standard in China) [1215], H.265/HEVC (High Efficiency Video Coding) [16], and H.266/VVC (Versatile Video Coding) [17]. In [11, 1626], the traditional hybrid coding methods have been well reviewed from the historical pulse code modulation (PCM), DPCM coding to HEVC, three-dimensional video (3DV) coding, and VVC.

With the huge number of mobile devices, surveillance cameras, and other video capture devices, the volume of video data is increasing significantly. In the coming era of big data, image and video processing will require more efficient and effective coding techniques. Nevertheless, researchers in this field have also acknowledged the difficulty of further improving performance under the traditional hybrid coding framework. One reason for the performance improvement limitation is that the traditional coding methods only consider the signal properties of images and videos and the room left for improvement is increasingly squeezed with the constraint of objective quality measurement, e.g. peak signal noise ratio (PSNR). As such, many novel coding methods that incorporate the properties of the human visual system (HVS), referred to as the second-generation coding methods [2730], have demonstrated a higher compression ratio over traditional coding methods while maintaining comparable subjective image quality. Compared to the first-generation coding methods, these methods are more dependent on the structural object-related model than on the source signal. From Musmann’s viewpoint [31], model-based coding (MBC) is composed of the first-generation and second-generation methods, which are based on a signal source or structural object-related models. MBC arose and attracted the interest of researchers, research has advanced greatly in this field, and some exciting results have been achieved. For example, in [32], a background picture model-based surveillance video coding method shows at least twice the compression ratio on surveillance videos of the AVC high profile. Moreover, other model-based coding methods display great potential nowadays and achieve obvious improvement over the traditional hybrid coding methods, such as geometric partition video coding [33] and segmentation-based coding [34]. Some MBC methods were also introduced into various coding standards, such as MPEG-4/7, AVS2, HEVC, and VVC. The developments in MBC have been well reviewed in [3542].

Although MBC aims to improve coding efficiency, many challenging problems still limit the effectiveness of the coding process, such as manually designed coding paradigms based on expert knowledge. During the last few years, neural networks, such as convolutional neural networks (CNNs), have demonstrated considerable potential in a variety of fields, including image and video understanding, processing, and compression. In terms of the compression task, neural networks perform transform coding by map** pixel data into quantized latent representations first and then converting them back again into pixels. Such a nonlinear transform holds the potential to map pixels to a more compact latent representation than the transforms of the preceding codecs. Moreover, the parameters in neural networks can be well trained based on massive image and video samples, which facilitates the model to alleviate its reliance on manually designed modules. Considering these excellent characteristics, learning-based coding (LBC) has been recognized as a promising solution for image and video coding.

In this paper, we will present an overview of intelligent video coding (IVC) development from MBC to LBC, in which the two technologies encode videos leveraging knowledge in different manners. The technical roadmap of IVC methods is summarized in Fig. 1. The similarity between MBC and LBC is that similar components, such as transform, quantization, and entropy coding, are adopted to construct the framework to exploit the correlation of textural content and remove redundancy. The difference lies in that the former relies on manually designed modules, while the latter relies on a data-driven strategy or components using machine learning. The rest of the paper is organized as follows. In Sect. 2, a brief introduction to the history of MBC is provided. Section 3 provides an overview of recent advancements in learning-based approaches for visual signal compression, including learned image compression and learning-based video coding. Section 4 introduces our previous attempts and understanding of IVC. In Sect. 5, we discuss the future directions of IVC, specifically from the perspectives of standardized potentials, data security, and generalization. Section 6 concludes this paper.

Figure 1
figure 1

The technical roadmap of intelligent video coding methods including model-based and learning-based compression algorithms

2 Model-based coding

MBC focuses on modeling and coding the structural visual information in the images and videos. The history of MBC can be traced back to the 1950s [43]. In [43], Schreiber et al. proposed a Synthetic Highs coding scheme, where the image content is divided into textures and edges, and they are coded by different approaches, e.g. using statistical coding methods for textures and visual model-based coding methods for edges, which was the predecessor of the current HVS model-based perceptual coding methods. In [38], Pearson clarified the term “model” in MBC, which is explained as object-related models and developed from the source model in signal processing, as shown in Fig. 2. A video sequence containing one or more moving objects is analyzed to yield information about the size, location, and motion of the objects, which is employed to synthesize a model of each object as animation data. The animation data are coded and transmitted to the decoder. Moreover, the residual pixel data, comprising the difference between the original video sequence and the sequence derived from the animated model, are also transmitted to the decoder. The decoder adopts the animation data to synthesize the model, which is subsequently accompanied by the residual pixel data to reconstruct the image sequence. From Musmann’s viewpoint [31], MBC includes pixel MBC, block motion MBC, and object MBC, i.e. the first-generation and second-generation methods. In this paper, we would follow Musmann’s viewpoint and provide MBC classification to present the historical development of the model from the signal source to the object and the content understanding of the objects, as summarized in Table 1. From Table 1, it is observed that the evolution of MBC, from the statistical pixel and block to the geometric partition and structural segmentation, and from the content-aware object to the understanding of the content including knowledge, semantics, and the knowledge of HVS. Moreover, many coding standards based on MBC have been developed, such as MPEG-4/7. In this section, we will give a brief introduction to the methods and standards based on MBC.

Figure 2
figure 2

Principle of model-based coding (MBC) from [38]

Table 1 Classification of MBC approaches

2.1 Model-based coding methods

In the historical evolution of MBC, pixel model-based video coding, e.g. PCM [44], was later ever used for early memory and computation resource-limited applications, and it was replaced with block-based motion model coding later [4548]. However, the rectangular partition of block-based coding is rigid and inefficient for modeling irregular visual signals. As a variation of the block-based motion model, more flexible geometric partitions were proposed for motion compensation, including deformable blocks [49], meshes [50, 51] and triangles [52], and they were also studied for H.264/AVC [53, 54], HEVC [55] and VVC [33]. Although geometric partitions are flexible, they are also constrained by their fixed patterns. Therefore, a more flexible and finer-grained partition is based on the input signal itself, such as contour and segmentation, rather than pre-defined geometric partitions. Graham proposed a two-dimensional contour coding in [56], which can be viewed as a predecessor of segmentation coding, and Biggar first formally utilized a segmented image coder with better performance than the transform coder in [57]. Since then, a variety of studies on segmentation-based coding have been performed, including segmentation-based coding [34, 5862] and segmentation methods [63, 64].

MBC methods mentioned above explore flexible and fine-grained partitions without considering the knowledge of objects or scenes in the world. Since different classes of objects or scenes always exhibit different kinds of appearance and motion patterns, modeling such patterns as knowledge and combining them into coding can further improve the compression ratio for particular image classes. The higher performance also comes with costs that modeling and combining knowledge require considerable manpower for manual design, and knowledge of an object or a scene cannot always be transferred to that of others, resulting in potential limitations on wild scenarios. In the following part of our paper, we review the development of MBC methods using knowledge. Accompanying the emergence of segmentation-based coding, object-based coding is a further prolongation of segmentation coding, where the segmentation may represent one identified object [6567]. In [65], three parameter sets were used to define the motion, shape, and color of an object, which can be used to reconstruct an image by the model-based image synthesis method. In [66], a generic object-based coding algorithm was proposed relying on the definition of a spatial and temporal segmentation of the sequences. Moreover, object-based coding is further applied to special videos, such as surveillance video or 3D video [68, 69], and motion compensation for codecs [70]. Based on the knowledge of the known objects, knowledge and semantic-based coding methods were developed, such as parameterized modeling for the facial animation [7176]. Modeling the scene or image content directly is difficult and restricted in wild scenarios; in contrast, perceptual coding [77126] proposed an integrated and well-optimized GAN-based image compression. Inspired by the advances in GAN-based view synthesis, light field (LF) image compression can achieve significant coding gain by generating the missing views using the sampled context views in LF [119]. In addition, Gregor et al. [148] introduced a homogeneous deep generative model DRAW to their coding framework. Different from previous works, Gregor et al. aimed at conceptual compression by generating the image semantic information as much as possible [128]. Agustsson et al. [129] built an extreme image compression system using unconditional and conditional GANs, outperforming all other codecs under low bit-rate conditions. Agustsson et al. [149] proposed using learned perceptual image patch similarity (LPIPS) [150] as the metric for generator training, which further improves the subjective quality of the reconstructed image.

3.2 Learning-based video coding

In this section, we review the development of learning-based video coding. First, we introduce pure learning-based video coding methods. Second, a combination of deep learning and the hybrid video coding framework is presented. Third, we compare these two coding architectures.

Similar to learning-based image coding frameworks, many novel video coding frameworks are built on neural network models to reduce temporal redundancies. As a natural extension of learning-based image coding methods, 3D auto-encoders are proposed to encode the quantized spatiotemporal features with an embedded temporal conditional entropy model. Chen et al. [130] proposed DeepCoder, which combines several CNN networks with a low-profile x264 encoder for video compression. Wu et al. [151] later applied an RNN-based video interpolation module and combined it with a residual coding module for inter-frame coding. Inspired by the prediction for future frames of generative models [133] utilized the rate-distortion auto-encoder to directly exploit spatiotemporal redundancy in a group of pictures (GoP) with a temporal conditional entropy model. Lombardo et al. [134] followed the VAE-based image compression framework and encoded this representation according to predictions of the sequential network. With the emergence of GANs, using an auto-encoder combined with adversarial training has been regarded as a promising method. Wang et al [138] demonstrated the use of a novel subject-agnostic face reenactment method for video conferencing, achieving an order of magnitude bandwidth savings over the H.264 standard. With the advantage of adversarial training, at a lower bitrate, different from VAE-based video coding methods that tend to reconstruct blurry videos, GAN-based coding models reconstruct the video with a pleasing perceptual quality.

Following hybrid video coding systems, recent studies have demonstrated the effectiveness of deep learning models from five main modules, i.e. intra-prediction, inter-prediction, quantization, entropy coding, and loop filtering. For intra-prediction, Cui et al. [154] proposed an intra-prediction convolutional neural network (IPCNN) to improve the intra-prediction efficiency. Instead of using CNN, Li et al. [3.3 Learning-based coding standards

To enable interoperability between devices manufactured and services provided by different companies, a series of standards targeting intelligent visual data coding have been investigated in the past several years. Several standardization organizations including ISO/IEC (International Organization for Standardization/International Electrotechnical Commission), JPEG (Joint Photographic Experts Group)/MPEG, ITU-T (International Telecommunication Union Telecommunication Standardization Sector), VCEG (Video Coding Experts Group), JVET (Joint Video Experts Team), AVS, IEEE DCSC (Data Compression Standard Committee), MPAI (Moving Picture, Audio and Data Coding by Artificial Intelligence), and others have been creating these standards with many contributions from academia and industry. While most of these visual coding standards have been very successfully deployed in many applications, there are many challenges currently, especially to accommodate the large volume of visual data in limited storage and limited bandwidth transmission links. Compression efficiency improvements are still needed, especially considering emerging data representation formats, from 8K/HDR (high dynamic range) image/video to rich plenoptic formats.

To improve compression efficiency, machine learning technologies, such as deep neural network-based technologies, have shown great potential for many types of visual data. Thus, new standardization activities that exploit this potential are ongoing, some more mature than others, such as learning-based image and video coding, learning-based point cloud coding, and learning-based light-field coding. These standardization efforts attracted significant attention in the aforementioned standardization organizations. The IEEE 1857.11 and JPEG AI group are preparing neural image coding standards in recent years. The MPAI end-to-end video project and enhanced video coding project are also trying to explore neural network-based video coding solutions. The JVET NNVC (neural network-based video coding) and AVS intelligent coding ad-hoc group have released reference models by integrating neural networks into the conventional hybrid framework. All of the above-mentioned standards are advancing neural network-based video coding for future use cases.

4 Our attempts at intelligent coding

LBC compresses the signal data into the compact latent representation containing the non-interpretable knowledge. Moreover, such a mechanism is not analysis-friendly enough to assist downstream machine analysis tasks. A novel LBC paradigm that incorporates more interpretable representation with powerful neural networks may achieve better coding performance, and the interpretable representation may also be beneficial for machine analysis. In this section, we introduce our attempts at such a paradigm, including conceptual image coding, generative video coding, and cross-modal coding.

Inspired by the human visual system (HVS) [177] which perceives visual contents by processing and integrating manifold information into abstract high-level concepts (e.g., structure, texture, and semantics) to form the basis of subsequent cognitive processes [178], conceptual compression has been an active research area in recent years [128, 179182], following the insights of Marr [183] and Guo et al. [184]. Conceptual coding aims to encode images into compact, high-level interpretable representations for high visual quality reconstruction, allowing a more efficient and analysis-friendly compression architecture. At present, multi-layer decoded representations are integrated to synthesize target images in a deep generative fashion. Herein, the main challenges for conceptual coding include how to achieve efficient representation disentanglement, and how to devise effective generative models for high visual-quality reconstruction. Gregor et al. [128] introduced convolutional deep recurrent attentive writer (DRAW) [148], which extends VAE [147] by using RNNs as encoder and decoder, to transform an image into a series of increasingly detailed representations. However, the interpretability of the learned representations for the image is still insufficient and the models in [128] only worked on datasets of small resolutions. Neural video compression also suffers from similar constraints. Typical video compression methods [134] share the same VAE architecture with image compression methods [128] and transform the original sequence into a lower-dimensional representation. However, the interpretability of the learned representations for video still lacks exploration. Therefore, based on the conventional neural network-based image/video compression in Sect. 3, in this section, we introduce interpretable representations, such as structure information or high-level semantic information, into the compression process to enhance the interpretability of the representations for both images and videos.

4.1 Conceptual image coding

We propose encoding images into two complementary visual components [179, 180] as a milestone for conceptual coding of images. The structure and texture representations are disentangled, as demonstrated in Fig. 4 (b), where a typical texture modeling process is illustrated in Fig. 4 (c). The typical image synthesis process is depicted in Fig. 4 (d). A stylized illustration of disentangled structure and texture representations in domain spaces is proposed in our earlier study in Fig. 5. In our proposed dual-layered model of [179, 180], the structure layer is represented by edge maps, and the texture layer is extracted with the variational auto-encoder in the form of low-dimensional latent variables. To reconstruct the original image from the compressed layered features, our other attempt is to integrate the texture layer and structure layer with adaptive instance normalization adopted using a hierarchical fusion GAN method [180]. The benefits of the proposed conceptual compression framework in [180] have been demonstrated through extensive experiments with extremely low bitrates (<0.1 bpp) and high visual reconstruction quality, as well as content manipulation and analysis tasks through extensive experiments. Nevertheless, it is very challenging to model complex textures of the whole image using only a set of variables. In addition, how to build effective entropy models for visual representations has not been explored for joint rate-distortion optimization. In our recent study in [181], the semantic prior modeling for conceptual coding was proposed. Effective texture representation modeling and compression at semantic granularity are explored for high-quality image synthesis and promising coding efficiency. Moreover, we developed a cross-channel entropy model in [181] for joint texture representation compression and reconstruction optimization. Structural modeling was further introduced in our work [185], which proposed a consistency-contrast learning method to optimize the texture representation space by aligning the representation space with the source pixel space, resulting in higher compression performance. Our proposed models in [181, 185] have achieved superior visual reconstruction quality at ultra-low bitrate (<0.1 bpp) compared to the state-of-the-art VVC in the specific application domain.

Figure 5
figure 5

Stylized illustration of the typical conceptual coding

Since the conceptual coding methods pursue visually convincing reconstruction results with minimal bitrate consumption, the LPIPS metric [150] is usually selected as the quantitative perceptual distortion measure except for user study. In our previously established benchmark [186], this metric has been proven to be highly correlated with human visual perception instead of signal fidelity. For performance comparison, the rate-distortion performance in terms of LPIPS of VVC, the typical end-to-end learned image coding method [123] (E2E), our proposed typical conceptual coding methods LCIC [180] and SPM [181] at low bit-rate range over FFHQ [187] and ADE20K [188] outdoor testing sets are displayed in Table 3. The results demonstrate that conceptual coding methods are capable of achieving higher visual reconstruction results at specific domains compared to signal-based compression methods at extremely low bitrates. Moreover, as observed, LCIC behaves less effectively at the more challenging content of ADE20K compared to FFHQ, which consists of regular facial semantic regions. In contrast, SPM achieves remarkable improvements in reconstruction quality on challenging scenes with diverse semantic regions and textures, verifying the effectiveness of the proposed semantic prior modeling mechanism. Moreover, in terms of LPIPS over the ADE20K outdoor testing set, the rate-distortion curves of VVC, SPM [181] and the most recent work CCL [185] are shown in Fig. 6. The comparison results verify the improvement in reconstruction quality brought by applying their proposed consistency-contrast learning method. Compared to previous works, the proposed conceptual image coding demonstrates the superiority towards efficient visual representation learning, high-efficiency image compression (<0.1 bpp), better visual reconstruction quality, and intelligent visual applications (e.g., manipulation and analysis).

Figure 6
figure 6

The rate-distortion curves of SPM [181], CCL [185] and VVC. A lower LPIPS indicates better quality

Table 3 The quantitative results of VVC, E2E [123], and our proposed conceptual coding methods LCIC [180] and SPM [181] on the FFHQ, and ADE20K outdoor testing sets. The LPIPS is selected as the distortion metric

4.2 Generative video coding

Due to the powerful capability of deep generative models, many approaches [134] map the video sequences into latent representations and formulate the framework through generative networks to achieve low-bitrate compression. Based on the image animation model, such as FOMM [189], Konuko et al. [190] developed a generative compression framework for video conferencing. Wang et al. [138] also proposed a neural talking-head video synthesis model for video conference by adaptively extracting 3D keypoints from the input videos, achieving the same visual quality as the H.264/AVC [191] with only one-tenth of the bandwidth. Nevertheless, designing a video compression framework targeting high visual quality under extreme compression ratios (e.g., 1000 times) remains unsolved.

Motivated by recent attempts at layered conceptual image compression, we made the first attempt to utilize disentangled visual representations for extreme human body video compression, DHVC [139]. On the encoder side, the input video sequence is disentangled into structure and texture representations for further efficient compression. A pre-trained structure encoder is adopted to estimate the human pose keypoints of each frame. Similar to motion vectors in traditional video codecs, the displacements of each keypoint coordinate are computed as a feature to represent the motion information between two frames. For bitrate saving, only the structure code of the first frame and the motion codes of subsequent frames are transmitted during encoding. On the other hand, a texture encoder extracts the first frame into a semantic-level texture code that represents the texture information of the input video sequence. To ensure texture consistency across all frames, we introduce contrastive learning [192] for the alignment of texture representations. On the decoder side, the structure codes are reconstructed iteratively while the generator restores the video from texture codes and structure codes. Finally, entropy estimation of texture codes is introduced to establish rate-distortion optimization together with contrastive learning for end-to-end training of the framework, promoting bitrate saving and better reconstruction.

As depicted in Fig. 7, the main structure information of the human body can be efficiently represented by human pose keypoints. A pre-trained pose estimator [193] is employed as the structure encoder \(E_{s}\) to extract the structure information of each frame as the compact structure code. The texture encoder \(E_{t}\) aims to extract image frames into texture representations. To better capture the texture details of each frame, we adopt the decomposed component encoding (DCE) module [194] for semantic-aware texture code embedding.

Figure 7
figure 7

The proposed pipeline using disentangled visual representation for video compression

To assure the texture consistency of all frames in the same video, contrastive learning [192] is introduced for training the texture encoder \(E_{t}\). Instead of using augmentations for building positive samples, the frames in the same video are well-suited for constructing positive samples. Meanwhile, frames in different videos are regarded as negative samples. Moreover, the framework proposes contrastive learning at the semantic level and computes the semantic-wise infoNCE loss [195] with Eq. (1),

$$ \mathcal{L}_{cst} = -\sum_{i=1}^{L}{ \log} \frac{\exp(t_{i}\cdot{t_{i}^{+}}/\tau )}{\sum_{j=1}^{Q}{\exp(t_{i}\cdot{t_{ij}^{-}}/\tau )}} , $$
(1)

where \(t_{i}\), \(t_{i}^{+}\), \(t_{i}^{-}\), τ, L, and Q denote semantic-wise texture parts of an input frame, another frame in the same video, other frames in different videos, a temperature parameter, the number of semantic regions of the image, and the length of negative sets, respectively. This technique enables the encoder to utilize both the similarity of the positive pair (t, \(\mathbf{t}^{+}\)) and the dissimilarity of the negative pairs (t, \(\mathbf{t}^{-}\)). Following MoCo [192], a queue is used for storing negative samples \(t_{i}^{-}\) of previous input frames. In this way, the module conducts contrastive learning efficiently with small batch sizes.

For compression comparisons, the average LPIPS and DISTS results of the Fashion and TaichiHD datasets are shown in Table 4. Noticeably, the bitrate of other compared methods is adjusted slightly above our method. Nevertheless, the proposed framework outperforms all other compression frameworks with the lowest LPIPS and DISTS scores at ultra-low bitrates. Moreover, the quantitative results in Table 4 further validate that integrating with contrastive learning facilitates better visual qualities. In general, our method achieves superior visual quality compared to previous methods due to its disentangled texture and structure representations, resulting in sharper results with more details retained, such as facial features and intricate backgrounds.

Table 4 Comparisons with state-of-the-art video compression methods. Lower scores represent better visual quality. “w/o c.” denotes the proposed model without the proposed contrastive learning techniques

4.3 Cross-modal coding

Conceptual compression frameworks encode images into representations, such as latent variables extracted from deep neural networks, which are not human-comprehensible. Human comprehensible representations, such as text, sketch, semantic map, and attributions, are significant for various applications, such as semantic monitoring and human-centered applications. Semantic monitoring aims to monitor the semantic information, such as identification, human traffic, or car traffic, rather than the raw signal or latent variables. Human-centered applications aim to directly convey the human-comprehensible information of visual data to human users. Therefore, we proposed cross-modal compression (CMC) [197] to take a step forward to transform the highly redundant visual data into a compact, human-comprehensible representation with ultra-high compression ratios.

We proposed a CMC framework, as illustrated in Fig. 8, which consists of four submodules: CMC encoder, CMC decoder, compression domain encoder, and compression domain decoder. The compressing procedure also consists of four steps. First, the CMC encoder compresses the raw signal into a compact and human-comprehensible representation. Second, the compression domain encoder encodes the representation to a bitstream in a lossless way. Third, the compression domain decoder reconstructs the representation from the bitstream in a lossless way. Finally, the CMC decoder reconstructs the signal from the representation with semantic consistency. The bitrate is optimized by finding a compact compression domain, while the distortion is optimized by preserving the semantics in the CMC encoder and decoder.

Figure 8
figure 8

Illustration of the cross-modal compression (CMC) framework

Under such a framework, we will further introduce a paradigm. With the recent advances of image captioning [198] and text-guided image generation [199], generating high-quality text from images and generating high-quality images from the text are more feasible. Therefore, we built an efficient image-text-image CMC paradigm, where the images are compressed into the text domain, which is compact, common, and human-comprehensible. Specifically, a classical CNN-RNN model [198] is adopted as the CMC encoder to compress the image to text, where the image feature is extracted from a CNN with the image as input, and fed to an RNN to generate the text in an autoregressive way. Huffman coding [1] can be used as the compression domain encoder/decoder to reduce the statistical redundancy of text in a lossless way. AttnGAN [199] is used as the CMC decoder to reconstruct images from the text due to its promising performance on text-to-image generation. The effectiveness of CMC is verified via various experiments on several datasets, and the model has achieved encouraging reconstructed results with an ultrahigh compression ratio (4000-7000 times), showing better compression performance than the widely used JPEG baseline [200].

5 Open discussion

Considering the rapid growth of intelligent video coding, it is expected that a more advanced and insightful model will be developed in the near future, further facilitating the coding and representation efficiency of visual signals. Nevertheless, the field of intelligent video coding poses many new research challenges. Below are a few evolving and significant challenges that need to be addressed.

Domain and profiling

There is considerable discussion in the video coding standards community regarding the definition of interoperability and conformance testing. To enable intelligent-video-coding-compliant terminals and systems to decode latent representations without ambiguity, it is necessary to standardize them by defining the appropriate rules and assigning them to syntax elements. At the system level, structural, semantic, and textual representations should be parsed correctly by compatible structural, semantic, or textural decoders. Meanwhile, intelligent-video-coding-compliant networks should be able to understand and process the meanings of the latent representations at the intelligent model level. However, visualizing or analyzing bitstreams of highly compact latent representations poses a considerable challenge in assessing the semantic conformance of existing intelligent video codecs. As such, the introduction of profiles may contribute to defining unambiguous conformance procedures and ensuring interoperability for intelligent video coding. Video coding standards have used profiles and levels to define tools with a restricted level of complexity suitable for specific applications. Similarly, intelligent video coding requires different subsets of latent representations for different applications. Some specialized applications may also need restrictions or extensions of the latents. In this regard, it is a critical issue as to how it should support extensions and specialization in specific domains while at the same time ensuring unambiguous conformance validation, requiring a nontrivial effort.

Data security

In the context of intelligent video coding, latent representations derived from networks involving signal information can be used to reconstruct the entire video stream. Such representations, however, are not encrypted, and therefore pose the risk of sensitive information leakage. As such, trustworthy and robust coding network design plays a central role in real-world applications.

Representation interpretability

To enhance the supporting ability for downstream tasks using the compressed data, it is important to develop latent representations that are highly interpretable. By using such representations, it becomes possible to apply interactive coding techniques, which can enable a range of novel applications such as content editing and immersive interaction. This opens up new opportunities for compression-based approaches to provide versatile features and functionalities beyond traditional video compression methods.

Generalization ability

When standardized coding methods and technologies are ready for implementation and deployment, it becomes crucial to identify the path that intelligent video coding would follow to gain entry into practical application domains while satisfying the objectives that such codecs could satisfy versatile requirements. For example, some intelligent video codecs trained for outdoor scenes might not be an ideal choice for coding facial images. It is not practical to employ multiple models for scene adaptation. Furthermore, the active efforts to harmonize the intelligent video coding standard with other media data standards will facilitate and expedite its adoption in practical domains (e.g., short video on mobile devices and immersive media applications).

6 Conclusion

Intelligent video compression provides a comprehensive suite of compactly representing visual media with the capability of describing intrinsic semantics, which also has the potential to revolutionize current and future multimedia coding applications. In particular, such methods include latent codes for describing the structure, semantics, or motion of the visual data, which facilitate efficient editing, analysis, reconstruction of the decoded data, and access to the data. In addition, extracted latent codes can also describe content preferences and support on-the-fly manipulation and transfer of customized content and styles. In this review, the development roadmap for the history of intelligent video coding has been revisited, along with the methodology for describing the structure and semantics of video data. Furthermore, the paper presents three potential research directions in conceptual coding, cross-modality coding, and generative coding that could potentially provide promising solutions to future visual media coding utility and application scenarios. As a final point, a few evolving and significant challenges are discussed regarding future intelligent video coding deployment in practical real-world scenarios.