Keywords

1 Introduction

Most current monocular solutions to 3D human pose estimation rely on methods based on convolutional neural networks (CNNs). With networks becoming ever more sophisticated, the main bottleneck now is the availability of sufficiently large training datasets, which typically require a large annotation effort. While such an effort might be practical for a handful of subjects and specific motions such as walking or running, covering the whole range of human body shapes, appearances, and poses is infeasible.

Weakly-supervised methods that reduce the amount of annotation required to achieve a desired level of performance are therefore valuable. For example, methods based on articulated 3D skeletons can be trained not only with actual 3D annotations but also using 2D annotations [21, Full size image

In this paper, we propose to use images of the same person taken from multiple views to learn a latent representation that, as shown on the left side of Fig. 1(a), captures the 3D geometry of the human body. Learning this representation does not require any 2D or 3D pose annotation. Instead, we train an encoder-decoder to predict an image seen from one viewpoint from an image captured from a different one. As sketched on the right side of Fig. 1(a), we can then learn to predict a 3D pose from this latent representation in a supervised manner. The crux of our approach, however, is that because our latent representation already captures 3D geometry, the map** to 3D pose is much simpler and can be learned using much fewer examples than existing methods that rely on multiview supervision [31, 31, 55] exploit multi-view geometry in sequences acquired by synchronized cameras, thus removing the need for 2D annotations. However, in practice, they still require a large enough 3D training set to initialize and constrain the learning process. We will show that our geometry-aware latent representation learned from multi-view imagery but without annotations allows us to train a 3D pose estimation network using much less labeled data.

Geometry-Aware Representations. Multi-view imagery has long been used to derive volumetric representations of 3D human pose from silhouettes, for example by carving out the empty space. This approach can be used in conjunction with learning-based methods [44], by defining constraints based on perspective view rays [15, 45], orthographic projections [50], or learned projections [29]. It can even be extended to the single-view training-scenario if the distribution of the observed shape can be inferred prior to reconstruction [8, 56]. The main drawback of these methods, however, is that accurate silhouettes are difficult to automatically extract in natural scenes, which limits their applicability.

Another approach to encoding geometry relies on a renderer that generates images from a 3D representation [9, 16, 35, 52] and can function as a decoder in an autoencoder setup [1, 39]. For simple renderers, the rendering function can even be learned [5, 6] and act as an encoder. When put together, such learned encoders and decoders have been used for unsupervised learning, both with GANs [3, 43, 46] and without them [17]. In [40, 41], a CNN was trained to map to and from spherical mesh representations without supervision. While these methods also effectively learn a geometry-aware representation based on images, they have only been applied to well-constrained problems, such as face modeling. As such, it is unclear how they would generalize to the much larger degree of variability of 3D human poses.

Novel View Synthesis. Our approach borrows ideas from the novel view synthesis literature, which is devoted to the task of creating realistic images from previously unseen viewpoints. Most recent techniques rely on encoder-decoder architectures, where the latent code is augmented with view change information, such as yaw angle, and the decoder learns to reconstruct the encoded image from a new perspective [36, 37]. Large view changes are difficult. They have been achieved by relying on a recurrent network that performs incremental rotation steps [51]. Optical flow information [23, 53] and depth maps [7] have been used to further improve the results. While the above-mentioned techniques were demonstrated on simple objects, methods dedicated to generating images of humans have been proposed. However, most of these methods use additional information as input, such as part-segmentations [18] and 2D poses [19]. Here, we build on the approaches of [4, 49] that have been designed to handle large viewpoint changes. We describe these methods and our extensions in more detail in Sect. 3.