Abstract
Modern 3D human pose estimation techniques rely on deep networks, which require large amounts of training data. While weakly-supervised methods require less supervision, by utilizing 2D poses or multi-view imagery without annotations, they still need a sufficiently large set of samples with 3D annotations for learning to succeed.
In this paper, we propose to overcome this problem by learning a geometry-aware body representation from multi-view images without annotations. To this end, we use an encoder-decoder that predicts an image from one viewpoint given an image from another viewpoint. Because this representation encodes 3D geometry, using it in a semi-supervised setting makes it easier to learn a map** from it to 3D human pose. As evidenced by our experiments, our approach significantly outperforms fully-supervised methods given the same amount of labeled data, and improves over other semi-supervised methods while using as little as 1% of the labeled data.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
Most current monocular solutions to 3D human pose estimation rely on methods based on convolutional neural networks (CNNs). With networks becoming ever more sophisticated, the main bottleneck now is the availability of sufficiently large training datasets, which typically require a large annotation effort. While such an effort might be practical for a handful of subjects and specific motions such as walking or running, covering the whole range of human body shapes, appearances, and poses is infeasible.
Weakly-supervised methods that reduce the amount of annotation required to achieve a desired level of performance are therefore valuable. For example, methods based on articulated 3D skeletons can be trained not only with actual 3D annotations but also using 2D annotations [21, Full size image
In this paper, we propose to use images of the same person taken from multiple views to learn a latent representation that, as shown on the left side of Fig. 1(a), captures the 3D geometry of the human body. Learning this representation does not require any 2D or 3D pose annotation. Instead, we train an encoder-decoder to predict an image seen from one viewpoint from an image captured from a different one. As sketched on the right side of Fig. 1(a), we can then learn to predict a 3D pose from this latent representation in a supervised manner. The crux of our approach, however, is that because our latent representation already captures 3D geometry, the map** to 3D pose is much simpler and can be learned using much fewer examples than existing methods that rely on multiview supervision [31, 31, 55] exploit multi-view geometry in sequences acquired by synchronized cameras, thus removing the need for 2D annotations. However, in practice, they still require a large enough 3D training set to initialize and constrain the learning process. We will show that our geometry-aware latent representation learned from multi-view imagery but without annotations allows us to train a 3D pose estimation network using much less labeled data.
Geometry-Aware Representations. Multi-view imagery has long been used to derive volumetric representations of 3D human pose from silhouettes, for example by carving out the empty space. This approach can be used in conjunction with learning-based methods [44], by defining constraints based on perspective view rays [15, 45], orthographic projections [50], or learned projections [29]. It can even be extended to the single-view training-scenario if the distribution of the observed shape can be inferred prior to reconstruction [8, 56]. The main drawback of these methods, however, is that accurate silhouettes are difficult to automatically extract in natural scenes, which limits their applicability.
Another approach to encoding geometry relies on a renderer that generates images from a 3D representation [9, 16, 35, 52] and can function as a decoder in an autoencoder setup [1, 39]. For simple renderers, the rendering function can even be learned [5, 6] and act as an encoder. When put together, such learned encoders and decoders have been used for unsupervised learning, both with GANs [3, 43, 46] and without them [17]. In [40, 41], a CNN was trained to map to and from spherical mesh representations without supervision. While these methods also effectively learn a geometry-aware representation based on images, they have only been applied to well-constrained problems, such as face modeling. As such, it is unclear how they would generalize to the much larger degree of variability of 3D human poses.
Novel View Synthesis. Our approach borrows ideas from the novel view synthesis literature, which is devoted to the task of creating realistic images from previously unseen viewpoints. Most recent techniques rely on encoder-decoder architectures, where the latent code is augmented with view change information, such as yaw angle, and the decoder learns to reconstruct the encoded image from a new perspective [36, 37]. Large view changes are difficult. They have been achieved by relying on a recurrent network that performs incremental rotation steps [51]. Optical flow information [23, 53] and depth maps [7] have been used to further improve the results. While the above-mentioned techniques were demonstrated on simple objects, methods dedicated to generating images of humans have been proposed. However, most of these methods use additional information as input, such as part-segmentations [18] and 2D poses [19]. Here, we build on the approaches of [4, 49] that have been designed to handle large viewpoint changes. We describe these methods and our extensions in more detail in Sect. 3.