1 Introduction

Biometrics have attracted much interest in the last decade, for the high number of applications they can have in industrial and academic fields [1].

Recently, soft biometrics have been clearly defined in [2] as a set of traits providing information about an individual, even though the lack of distinctiveness and permanence does not allow to authenticate individuals. These traits can be either continuous (e.g., height and weight) or discrete (e.g., gender, eye color, ethnicity, etc.).

Soft biometrics have widespread use for security, digital signage, domotics (the study of the realization of an intelligent home environment), home rehabilitation, artificial intelligence, and so on. Even socially assistive technologies are a new and emerging field where these solutions could considerably improve the overall human-machine interaction level (for example with autistic individuals [3], as well as with people with dementia [4] and, generally, for elderly care [5]).

In the last few decades, computer vision, as well as other information science fields, have largely investigated the problem of the automatic estimation of the main soft biometric traits by means of mathematical models and ad hoc coding of the visual images. In particular, the automatic estimation of gender, race, and age from facial images are among the most investigated issues, but there is still a lot of open challenges especially for race and age. The extraction of this kind of information is not trivial due to the ambiguity related to the anatomy of each individual and his lifestyle. In particular, in race recognition, the somatic traits of some population could be not well defined: for example, one person may exhibit some features more than another one. Similar considerations apply to age estimation, where the appearance of biological age could be very different from the chronological one.

The recent efforts led to a shared and effective framework that includes an initial facial detection, a subsequent description of facial traits, a data reduction step (optional), and a final classification step. The effectiveness of this framework has been already experimentally proved in [6] where some key aspects (like the data reduction strategy) were definitively clarified. Nevertheless, in the literature, there is no comprehensive study that investigates which methods and techniques are best suited to address the involved algorithmic steps, i.e., to give evidence of the best configuration of the aforementioned framework in terms of facial descriptors, learning strategies, and numerical scale of the data in input to the classifier. Moreover, the performance of the framework under real world conditions (even in the presence of scaling and rotation of the head) has not been fully addressed. To fill this gap, in this paper, a detailed study about different algorithmic configurations for estimating soft biometric traits is presented. First, some of the most used facial descriptors for gender, race, or age estimation were selected, and then, their performances, by using each of them into the considered framework, were compared. This experimental phase was carried out on publicly available face datasets, considering different working strategies (i.e., the balancing/unbalancing of the training dataset and the numerical range of the input measurements). Then, the best configuration was thoroughly tested, for all the involved descriptors, on ad hoc image sequences acquired in a real environment and also considering changes of the head in both position and scale.

The considered feature descriptors were those that, from a detailed study of the literature (reported in next session), demonstrated to be the best suited for encoding the biometric traits both in terms of accuracy and computational efficiency. In particular, this paper investigates local binary pattern (LBP) [7, 8], compound local binary pattern (CLBP) [9] (both oriented to texture analysis), histogram of oriented gradient (HOG) [10] (designed for shape analysis), and one descriptor that represents a trade-off between the two aspects: Weber local descriptor (WLD) [11]. They were evaluated in a consolidated framework where data reduction is achieved by Fisher linear discriminant analysis (LDA) [12] and then support vector machines (SVMs) are used both for classification (race and gender) and regression (for age estimation). SVM [13, 14] is one of the most exploited method in discrete biometric estimation problems whose superiority has been well proved in preliminary studies about the performance of different classifiers [15, 16].

Briefly, the main contributions of this paper are as follows:

  1. 1.

    A survey of existing methods and techniques addressing soft biometric issues, in order to discover the most affordable framework;

  2. 2.

    An exploration of various configurations about the components of the framework, in terms of learning strategies and numerical scale of data in input to the classifier, in order to discover the best way to exploit it in soft biometric contexts;

  3. 3.

    The evaluation of the performances of the aforementioned framework when using the different descriptors also in real-world conditions (even in the presence of scaling and head rotation).

The remainder of the paper is organized as follows: Section 2 discusses the existing literature for the considered biometric issues; in Section 3, the framework for age, gender, and race classification is explained in detail while in Section 4, the experimental setup and results are detailed and discussed. Finally, conclusions and future developments are discussed in Section 5.

2 Related works

The study of algorithms investigating soft biometric traits has attracted considerable interest in the last two decades. In the following subsections, the most relevant solutions for each specific addressed problem (such as gender, race, and age) are reported.

2.1 Gender

Automatic gender recognition is the most debated soft biometric issue and, over the years, many approaches have been proposed in literature. Three exhaustive surveys can be found in [17, 18] and [19]. Early studies mainly exploited geometrical-based methods; Brunelli and Poggio [20] computed 18 distances between face key points and then they used them to train a hyper basis function network classifier, whereas 22 normalized vertical and horizontal fiducial distances starting from 40 manually extracted facial points were used by Fellous in [21]. This approach has been recently reassessed by Mozzafari et al. [22] who used the aspect ratio of face ellipse as geometric feature to reinforce appearance information.

In the early 2000s, appearance-based methods were introduced in their basic form, i.e., by evaluating the pixel intensity values. With these approaches, the amount of useless data is a crucial issue that Abdi et al. [23] managed by a principal components analysis (PCA), whereas Tamura et al. [24] investigated gender estimation capabilities from pixel intensity on small-sized facial images. Conversely, the most interesting progress was made with the introduction of the LBP descriptor [7] to encode information about facial appearance. LBP, in combination with SVM, was used by Lian and Lu [25] for multi-view gender classification. It was also exploited by Yang and Ai [26] to classify age, gender, and ethnicity. LBP descriptor was employed in [27] in combination with intensity and shape features to get a multi-scale fusion approach, while Ylioinas et al. [28] combined it with contrast information in order to achieve a more robust classification. To extract the most discriminative LBP features, Shan [29] proposed an AdaBoost selection method and many other variants have also been proposed to get more informative features, e.g., local Gabor binary map** pattern [30, 31] and local directional pattern [32].

Besides LBP descriptor, other appearance-based solutions have also been investigated. In fact, the use of HOG was proposed in [33] and [34], whereas the spatial Weber local descriptor (SWLD) was proposed in [35] and active shape models (ASM) were exploited by Gao and Ai in [36]; Saatci and Town [37] used some features extracted by trained active appearance models (AAM) [38] and supplied them to SVM classifiers arranged into a cascade structure in order to optimize overall recognition performance.

Scale invariant feature transform (SIFT) presents some advantages in terms of invariance to image scaling, translation, and rotation; accordingly, the use of SIFT does not require preprocessing stage, like accurate face alignment. Demirkus et al. [39] exploited these characteristics using a Markovian classification model, and Wang et al. [40] extracted SIFT descriptors at regular image grid points and combined them with global shape contexts of the face, adopting AdaBoost for classification.

The desire to imitate nature recently led researches to adopt Gabor filter to copy the cortex view system. In [41], 2-D Gabor wavelets at different scales and eight orientations were extracted, selected by AdaBoost and classified by a Fuzzy SVM that indicated the degree of one’s person face belonged to female/male faces. Scalzo et al. [42] extracted a large set of features using Gabor and Laplace filters which were fused by genetic algorithms. Gabor filters, also inspired the modeling of biological inspired features (BIF) employed by Meyers and Wolf [43] for face processing. Finally, Guo et al. in [34] demonstrated that the performance of the gender recognition is affected significantly by human age.

2.2 Race

In the last few years, the race estimation problem has been addressed through several approaches. In [44], the skin color, the forehead area, and the lip color were used in order to estimate ethnicity, whereas in [62] and [63], where complete overviews are given via a description of the problem from a generic point of view and successively with well-structured presentations of the most significant works. The earliest works of age estimation were based on the cranio-facial development theory [64]. A mathematical model describing the shape of the human cranio was the key of this theory, firstly exploited by Kwon and da Vitoria Lobo [65]. Successively, Farkas [66] exploited the face anthropometry, the science of measuring human face size and proportions: 57 facial key points were used to measure distances of interest and a model for age evaluation was provided. Unfortunately, these approaches were proved to be unreliable for the evaluation of age in adult people but sufficiently discriminative just in the growing age.

A more interesting approach was proposed by Lanitis et al. [67] who introduced an aging function (valid for younger ages) related to the features vector of raw AAM parameters [38]. In a similar way, Liu et al. [68] proposed, in recent years, the use of AAM in association with PCA and an SVM classifier experiencing interesting results.

An innovative prospective was instead given by Geng et al. [69] who used an EM-like iterative learning algorithm to model the incomplete aging pattern extracted by observing the same individual at different ages. A similar approach, but oriented to find a common pattern among different individuals, was proposed in [70] and [71].

Another major trend, in the aging-related facial features extraction, focuses on the use of an appearance model that can be handled as a global or local features extraction issue. The first attempts in this research line were in [72, 73] where texture and shape features were jointly exploited to increase the descriptor robustness in order to estimate the human ages through a multiple-group classification scheme with 5-year intervals taking also advantage from the gender knowledge (since the aging patterns are different for males and females). Successively, the interest in appearance-based descriptors arose exponentially and LBP was used for appearance features extraction in an automatic age estimation system proposed in [74], whereas some variants were proposed and tested in [75, 76]. Gabor features were also tried for age estimation purposes [77] demonstrating their discriminative power. BIF [78, 79] in aging problems were also explored by Guo et al. in [80] and Lian et al. in [81].

Recently, many other approaches were investigated. Lou et al. in [82] tested the use of HAAR features and SVM for a human-robot interaction application based on the age estimation. A learning-based encoding was proposed by Alnajar et al. in [83] showing that an image encoding based on the learning feedback can improve the overall estimation accuracy. One of the most recent works exploited the information derived from the dynamic of facial landmarks in video sequences [84].

An approach to gender and age classification problems from video sequences, which encodes and exploits the correlation between the face images through manifold learning, was proposed in [85]. Finally, in [86], a study of the influence of demographic appearance in face recognition was proposed: gender (male and female), race/ethnicity (Black, White, and Hispanic), and age group (18–30, 30–50, and 50–70 years old) were taken into account comparing six different solutions.

3 System overview

In this section, the common framework used to address different soft biometric issues (gender, age, and ethnicity) is introduced: it is able to automatically extract soft biometrics information from generic video sequences through a complex pipeline of operational blocks. The pipeline is schematically depicted in Fig. 1: the first step is devoted to the detection and normalization of faces in the running image. The outcomes of the first step are the input of the following step, aimed at extracting a new representation of facial regions, depending on the selected descriptor. In particular LBP, HOG, SWLD, and CLBP are considered in this paper.

Fig. 1
figure 1

Block diagram of the estimation framework: the raw frames are processed in order to obtain a reliable class estimation of the people in the scene

The set of extracted features is then projected (through LDA) on a more suited subspace where it is possible to include most of the initial discriminative information in a reduced number of variables. Finally, the retained variables are given as input to the learned SVM classifier that assures efficiency both in terms of accuracy and computational load.

The following subsections will detail each step giving information, also, about the family of comparing descriptors.

3.1 Face detection and normalization

In this step, human faces are detected in the input images and then a normalization operation is done. Normalization is a fundamental preprocessing step since the subsequent algorithms work better if they can evaluate input faces with predefined size and pose. The face detection is performed by means of the well-known Viola-Jones [87] approach (which combines increasingly more complex classifiers in a cascade) based detectors for face and head-shoulder (which helps to handle low resolution images) detection. Whenever a face is detected, the face normalization is carried out as follows: the system, at first, fits an ellipse to the face blob (exploiting facial features color models) in order to rotate it to a vertical position and hence a Viola-Jones-based eye detector searches the eyes. Finally, eye positions, if detected, provide a measure to crop and scale the frontal face candidate to a standard size of 65×59 pixels. The above face registration procedure is schematized in Fig. 2.

Fig. 2
figure 2

The face detection and normalization step: the face is cropped and aligned in order to guarantee a standard pose to the features extraction step

The registered face is then modeled using different features (average color using red-green normalized color space and considering just the center of the estimated face container; eyes patterns; whole face pattern) in order to re-detect it, for tracking purposes, in the subsequent frames [88]. Finally, it is given as input to the successive step for the features extraction.

3.2 Features extraction

After the face detection and normalization step, a suitable family of descriptors has to be chosen in order to build a vector of features able to encode distinctive local structures in the facial region. The selected descriptors should be highly distinctive, i.e., with low probability of mismatch but, at the same time, their computational load and sensibility to the noise, such as changes in illumination, scaling, rotation, skew, etc., have to be kept low. The choice of the best family of descriptors is not straightforward: many different approaches have been introduced and effectively used in different academic fields, but the best strategy in the context of soft biometric estimations does not emerge from the literature. For this reason, in this paper, four different data descriptors have been considered and experimentally compared. For each of them, a short description is given in the following subsections.

3.2.1 LBP

LBP is a local feature widely used for texture description in pattern recognition. The original LBP assigns a label to each pixel. Each pixel is used as a threshold and compared with its neighbor. If the neighbor is higher, it takes value 1; otherwise it takes value 0. Finally, the thresholded neighbor pixel values are concatenated and considered as a binary number that becomes the label for the central pixel. A graphical representation is shown in Fig. 3 a and defined as follows:

$$ \,\text{LBP}_{P,R}(x_{c})=\sum\limits_{p=0}^{P-1} u(x_{p}-x_{c})2^{p} $$
((1))
Fig. 3
figure 3

LBP descriptor: labeling procedure (a) and spatial histogram concatenation (b)

where u(y) is the step function, x p the neighbor pixels, x c the central pixel, P the number of neighbors, and R the radius. To account for the spatial information, the image is then divided into sub-regions. LBP is applied to each sub-region, and an histogram of L bins is generated from the pixel labels. Then, the histograms of different regions are concatenated in a single higher dimensional histogram, as represented in Fig. 3 b.

3.2.2 CLBP

CLBP is an extension of the LBP descriptor designed essentially for facial expression recognition [9]. Compared to LBP, CLBP encodes the sign of the differences between the center pixel and its P neighbors as well as the magnitude of the differences. This involves a code length of 2P-bit: P for the signs and P for the magnitudes (Fig. 4). That is,

$$ \text{CLBP}_{P,R}(x_{c})=\sum_{p=0}^{P-1} u(x_{p},x_{c})2^{p} $$
((2))
Fig. 4
figure 4

CLBP labeling: the labeling is done for both intensity and magnitude; finally, the code is split

$$ u(x_{p},x_{c})= \left\{\begin{array}{ll} 00 & x_{p}-x_{c}<0, |x_{p}-x_{c}|\le M_{\text{avg}} \\ 01 & x_{p}-x_{c}<0, |x_{p}-x_{c}|> M_{\text{avg}}\\ 10 & x_{p}-x_{c}\ge0, |x_{p}-x_{c}|\le M_{\text{avg}}\\ 11 & \text{otherwise} \end{array}\right. $$
((3))

where M avg is the average magnitude of the difference between the center and the neighbors values.

As an example, considering a neighborhood 3×3, the CLBP encodes a vector of 16 bits long provided by 8 bits for the magnitude an 8 bits for the sign patterns. In particular, the two patterns are built by concatenating the corresponding values of the bit sequence (1, 2, 5, 6,.., 2P-3, 2P-2) and (3, 4, 7, 8,.., 2P-1, 2P) of the original CLBP code. In this way, two images are obtained from the original one, by replacing each pixel with the two sub-CLBP patterns. A feature representation of the original image is then obtained by concatenating the histograms of the two images. In order to account for spatial information, the image is firstly divided into sub-regions and then CLBP is applied to each of them, similarly to the spatial LBP operator.

3.2.3 HOG

Local object appearance and shape can often be characterized rather well by the distribution of local intensity gradients or edge directions, even without precise knowledge of the corresponding gradient or edge positions. HOG is a well-known feature descriptor based on the accumulation of gradient directions over the pixel of a small spatial region referred to as a “cell” and in the consequent construction of a 1-D histogram. Although HOG has many precursors, it has been used in its mature form in scale invariant features transformation [89] and widely analyzed in human detection by Dalal and Triggs [10]. This method is based on evaluating well-normalized local histograms of image gradient orientations in a dense grid. Let L be the image to analyze. The image is divided into cells (Fig. 5 a) of size N×N pixels, and the orientation θ of each pixel x=(x x ,x y ) is computed (Fig. 5 b) by means of the following relationship:

$$ \theta(x)=\tan^{-1}\frac{L(x_{x},x_{y}+1)-L(x_{x},x_{y}-1)}{L(x_{x}+1,x_{y})-L(x_{x}-1,x_{y})} $$
((4))
Fig. 5
figure 5

HOG extraction features representation: image is divided into cells with size of N×N pixel each one (a). The orientation of all pixels is computed (b), and the histogram of the orientations of the cell is built (c-d). Finally, all the orientation histograms are concatenated in order to construct the final features vector (e)

The orientations are accumulated in an histogram of a predetermined number of bins (Fig. 5 c, d). Finally, histograms of each cell are concatenated in a single spatial HOG histogram (Fig. 5 e). In order to achieve a better invariance to noise, it is also useful to contrast-normalize the local responses before using them. This can be done by accumulating a measurement of local histogram energy over larger spatial regions, named blocks, and using the results to normalize all of the cells in the block. The normalized descriptor blocks will represent the HOG descriptor.

3.2.4 Spatial WLD

WLD [11] is a robust and powerful descriptor inspired to Weber’s law. It is based on the fact that the human perception of a pattern depends not only on the amount of change in intensity of the stimulus but also on the original stimulus intensity. The proposed descriptor consists of two components: differential excitation (DE) and gradient orientation (OR).

Differential excitation allows to detect the local salient pattern by means of a ratio between the relative intensity difference of the current pixel against its neighbors. The computational approach is similar to that used for LBP (a complete description is given in [11]). Moreover, the OR of the single pixel is considered.

When DE and OR are computed for each pixel in the image, we construct a 2-D histogram of T columns and M×S rows where T is the number of orientations and M×S the number of bins for the DE quantization with the meaning of [11]. Figure 6 shows how 2-D histogram is mixed in such a way to obtain a 1-D histogram that is more suitable for successive operations.

Fig. 6
figure 6

WLD histogram construction process: the algorithm computes the DE and OR values for each pixel and constructs the 2-D histogram. The 2-D histogram is split into a M×T matrix, where each element is an histogram of S bins. Finally, the whole 1-D histogram is composed by the concatenation of the previous matrix rows

Furthermore, since for our purpose we had to account also for spatial information, we chose to use the SWLD approach [35] that, as for the spatial LBP, splits the image into sub-regions and computes an histogram for each of them. Finally, the histograms are concatenated in an ordered fashion.

3.3 Features subspace projection

The number of used features for face description highly affects both the computational complexity and the accuracy of classification. Indeed, a reduced number of features allows SVM to use easier functions and to perform a better separation of clusters. In this work, LDA was considered to get a better data representation: this choice resulted from the high suitability experienced in traditional classification problems and in the preliminary experiments reported in [6].

3.3.1 LDA

LDA [12] is an alternative subprojection approach. In LDA, within-class and between-class scatters are used to formulate criteria for class separability. The optimizing criterion in LDA is the ratio of between-class scatter to the within-class scatter. The solution obtained by maximizing this criterion defines the axes of the transformed space. Moreover, in LDA analysis, the number of non-zero generalized eigenvalues, and so the upper-bound in eigenvectors numbers, is c−1 where c represents the number of classes. Belhumeur at al. [12] demonstrated the robustness of this approach for face recognition and showed how LDA can sometimes be more discriminative. It is important to remark that the upper bound of eigenvectors for LDA projection approach is a key factor also in terms of computational complexity that makes the workload of SVM lighter.

3.4 SVM classification

After data projection, in the proposed approach, gender, race, and age classifications were performed by using SVMs. In particular, three different SVM learning tasks were used: two-class support vector classification (SVC) for the gender problem; multi-class SVC for the race problem; and support vector regression (SVR) for the age estimation. All of these classification tasks can be biased by the data range (the j-th features of the data vector could take value in a different range with respect to the i-th one), and then, the opportunity of a scaling operation has to be analyzed in terms of classification performance. To analyze the scaling effect for soft biometric estimations, the following scaling approach was evaluated. Let x i,MAX and x i,MIN be the maximum and minimum values for the i-th features in the training data (a list of labeled features data vector) and [y i,MAX,y i,MIN] the desired range after the scaling and N f the total number of features. The equation for the scale becomes

$$ y_{i}=\frac{(x_{i}-x_{i,\text{MIN}})\cdot(y_{i,\text{MAX}}-y_{i,\text{MIN}})}{x_{i,\text{MAX}}-x_{i,\text{MIN}}}+y_{i,\text{MIN}} $$
((5))

where x i is the initial value of the i-th feature and y i the scaled one. Finally, the scaling model is constructed by the N f pairs {x i,MAX,x i,MIN} and the pair {y i,MAX,y i,MIN}.

With regard to SVC, the C-support vector classification (C-SVC) learning task implemented in the well-known LIBSVM [90] was used. Given training vectors \(\boldsymbol {x}_{i} \in \mathbb {R}^{n}, i=1,\cdots,l\), in two classes, and a label vector \(\boldsymbol {y} \in \mathbb {R}^{l}\) such that y i ∈{1,−1}, C-SVC [13, 14] solves the following primal optimization problem:

$$\begin{aligned} \min_{\boldsymbol{w}, b, \xi} & \quad \frac{1}{2}\boldsymbol{w}^{T}\boldsymbol{w}+C\sum_{i=1}^{l} \xi_{i}\\ \text{subject to} & \quad y_{i}(\boldsymbol{w}^{T} \phi(\boldsymbol{x}_{i})+b)\geq 1-\xi_{i},\\ &\quad \xi_{i}\geq 0, i=1,\cdots,l, \end{aligned} $$

where ϕ(x i ) maps x i into a higher-dimensional space and C>0 is the regularization parameter. Due to the possible high dimensionality of the vector variable w, usually, the following dual problem is solved:

$$\begin{aligned} \min_{\boldsymbol{\alpha}} & \quad \frac{1}{2}\boldsymbol{\alpha}^{T} Q \boldsymbol{\alpha} - \boldsymbol{e}^{T}\boldsymbol{\alpha} \\ \text{subject to} & \quad \boldsymbol{y}^{T}\boldsymbol{\alpha} = 0,\\ & \quad 0\leq \alpha_{i} \leq C, i=1,\cdots,l, \end{aligned} $$

where e= [ 1,⋯,1]T is the vector of all ones, Q is an l×l positive semidefinite matrix, Q ij y i y j K(x i ,x j ), and K(x i ,x j )≡ϕ(x i )T ϕ(x j ) is the kernel function. After the dual problem is solved, the next step is to compute the optimal w as follows:

$$ \boldsymbol{w} = \sum\limits_{i=1}^{l} y_{i}\alpha_{i} \phi(\boldsymbol{x}_{i}) $$
((6))

Finally, the decision function is

$$ \texttt{sgn}(\boldsymbol{w}^{T} \phi (\boldsymbol{x}) +b) = \texttt{sgn} \left({\sum\limits_{i=1}^{l} y_{i} \alpha_{i} K(\boldsymbol{x}_{i}, \boldsymbol{x}) +b} \right) $$
((7))

Such an approach is suitable only for the two-class gender problems. For race classification, a multi-class approach was used, resorting to a “one-against-one” procedure [91]. Let k be the number of classes, then k(k−1)/2 classifiers are constructed where each one trains data from two classes. The final estimation is returned by a voting system among all the classifiers. Many other methods are available for multi-class SVM classification; in [92], a detailed comparison is given with the conclusion that “one-against-one” is a competitive approach.

With regard to SVR, instead, the ε-SVR learning task was used. Given training points { (x i ,z i ),⋯,(x l ,z l )} where \(\boldsymbol {x}_{i} \in \mathbb {R}^{n}\) is a features vector and z i R is the target output, ε-SVR [13, 14] solves the following primal optimization problem:

$$\begin{aligned} \min_{\boldsymbol{w}, b, \boldsymbol{\xi}, \boldmath{\xi}^{*}} & \quad \frac{1}{2}\boldsymbol{w}^{T}\boldsymbol{w}+C\sum_{i=1}^{l} \xi_{i}+C\sum_{i=1}^{l} \xi_{i}^{*},\\ \text{subject to} & \quad \boldsymbol{w}^{T} \phi(\boldsymbol{x}_{i})+b - z_{i} \le \epsilon + \xi_{i}, \\ &\quad z_{i} - \boldsymbol{w}^{T}\phi(\boldsymbol{x}_{i}) - b \le \epsilon + \xi_{i}^{*}, \\ &\quad \xi_{i}, \xi_{i}^{*}\geq 0, i=1,\cdots,l, \end{aligned} $$

with C>0 and ε>0 as parameters. In this case, the dual problem is as follows:

$${} {\fontsize{7.8pt}{9.6pt}\selectfont{\begin{aligned} \min_{\boldsymbol{\alpha}, \boldsymbol{\alpha}^{*}} & \quad \frac{1}{2}\left(\boldsymbol{\alpha} - \boldsymbol{\alpha}^{*}\right)^{T} Q (\boldsymbol{\alpha} - \boldsymbol{\alpha}^{*}) + \epsilon\sum\limits_{i=1}^{l} \left(\alpha_{i} + \alpha_{i}^{*}\right) + \sum\limits_{i=1}^{l} z_{i}\left(\alpha_{i} - \alpha_{i}^{*}\right) \\ \text{subject to} & \quad \boldsymbol{e}^{T} (\boldsymbol{\alpha} - \boldsymbol{\alpha}^{*})= 0,\\ & \quad 0\leq \alpha_{i},\alpha_{i}^{*} \leq C, i=1,\cdots,l, \end{aligned}}} $$

and the approximate solution function for the dual problem is as follows:

$$\sum_{i=1}^{l}\left(\alpha^{*}-\alpha\right)K(\boldsymbol{x}_{i},\boldsymbol{x})+b $$

Finally, in the SVM, the output model is reported.

4 Experimental setup and results

In this section, the experimental evaluation of the framework devoted to soft biometric estimations is reported. In particular, the abovementioned biasing factors were evaluated on different data benchmarks incorporating also real-world contexts with challenging operating conditions.

Three different system settings were considered:

  • the family of descriptors used to represent the facial features;

  • the composition of the training set in terms of representatives of the classes to be predicted;

  • the scaling of the input variables in the features vector to be supplied to the classifier.

Successively, the best configurations for training and scaling were tested over two real-world sources of bias:

  • the spatial resolution of the facial image to be analyzed;

  • the head pose of the subject in the scene.

To carry out the aforementioned analysis, two different experimental phases were performed. The first one was carried out by using a large set of faces (more than 55,000) built by combining images available from different public datasets. The aim of this phase was to evaluate the performance of the soft biometrics estimation framework over images acquired under (quasi) ideal conditions (high resolution and frontal faces): in this way, the variability in classification performance when using balanced/unbalanced training sets, different descriptors, and data projection strategies, with and without data scaling, can be accurately evaluated. In the second phase, images acquired in a real environment were used instead: in this way, boundary conditions were relaxed by introducing changes in head pose (i.e., non-perfectly frontal face) and varying the distance from the camera (i.e., different spatial resolutions) with the aim to test the robustness of the different framework configurations under these two additional real-world biases. Both experimental phases are detailed in the following subsections.

4.1 Experimental phase 1

This experimental phase was carried out on two datasets built by properly merging two of the most representative publicly available repositories containing face images: the Morph [93] and Feret [94]. They consist of face images of people of different gender, race, and age, provided with complete annotation data. Moreover, both repositories present some replicated subject sampled in different dates (same person pictured in different years). The resolution of facial images is 200×240, 400×480, or 512×768 pixels.

Starting from the available images, a first experimental dataset was built without taking into account the balancing of the cardinality of each class. It consists of 46,669 male subjects and 9246 female subjects. The considered races were White, Black, Asian, and Hispanic with a cardinality respectively of 11,843, 41,334, 601, and 1889 subjects. Finally, the age distribution is represented in Fig. 7.

Fig. 7
figure 7

Age distribution histogram: the age distribution of subjects in the merged dataset is affected by a critical unbalancing

Moreover, two more balanced datasets were also built as subsets of the main one. The first one counts 8500 entries for each class of gender, whereas the second one is related to the race and counts 600 entries for each class (more balanced age group subsets cannot be built given the data distribution in the starting repositories).

The experimental results were obtained, on the aforementioned datasets, by means of an accurate cross validation procedure: the dataset under investigation was randomly split into k folds (k=5) and, for each of the k validation steps, k−1 folds were used for training, whereas the remaining fold was used for evaluating the estimation/validation capabilities. The k folds were built using a strategy to retain, in each of them, the same ratio of elements for the classes taken in the considered dataset: for example, in each fold, the ratio of male/female individuals was kept about to 1:10. The class separation for the gender estimation task was straightforward. On the other hand, for race estimation, facial images were split into four categories: White, Black, Asian, and Hispanic. For the age issue, subjects were split into a number of categories equal to the number of the considered ages; the number of classes to be predicted was 42.

Face detection and normalization were then performed on each image of the selected k-1 training folds and then the vectors of features were extracted, properly projected on the subspace defined by the LDA algorithm and finally supplied to the SVMs. After the training phase, the one-out fold was tested by using the available models. At the end of the iterative training/test process of the k fold method, the accuracy results were averaged.

The LBP algorithm1 was spatially processed over a 5 × 5 grid, 8 neighbors, and 1 pixel radius; the CLBP algorithm was developed as reported in [9] and applied with the same parameters used for LBP. For the HOG operator, the VLFeat library 2 was used with standard parameters as suggested in [10] leading to a feature vector of 2016 elements; the WLD operator was implemented following the work in [35] and setting T, M, and S, respectively, to 8, 4, and 4 over a 4 × 4 grid of the image having a final vector of features of 2048 elements. SVM (for gender and race classification and for age regression) was used as implemented in the LIBSVM library [90], and the radial basis function (RBF) kernel, \(\phantom {\dot {i}\!}K(\boldsymbol {x},\boldsymbol {y})=e^{-\gamma \Vert \boldsymbol {x}-\boldsymbol {y}\Vert ^{2}}\), was used as suggested in [95] for non-linearly separable problems. Finally, the regularization parameter C was set to 1 and γ to 1/N f where N f is the number of features. In the case of scaling of the data in input to the SVM, the target range was [0,1] for each variable in the LDA vector of features.

The results of the experimental phase 1 for gender and race are reported in Tables 1 and 2, respectively.

Table 1 Gender confusion tables
Table 2 Race confusion tables

For the age estimation, that is a regression problem, the pairs predicted age/real age of the whole k-fold procedure were considered and the average estimation error (blue line) and the standard deviations (red lines) are reported on the left side in Figs. 8, 9, 10, and 11. Let i be a specific age and N i the corresponding number of entries; the age estimation error for the j-th entry of the age i is defined as:

$$ e_{i,j}=\hat{i}_{j}-i_{j} $$
((8))
Fig. 8
figure 8

LBP ages mean error: left column shows the mean age estimation error (blue line) and the standard deviation of age estimation error (red line). Right column represents the empirical age estimation error distribution among all ages. (a) LBP—not scaled. (b) LBP—scaled

Fig. 9
figure 9

HOG ages mean error: left column shows the mean age estimation error (blue line) and the standard deviation of age estimation error (red line). Right column represents the empirical age estimation error distribution among all ages. (a) HOG—not scaled. (b) HOG—scaled

Fig. 10
figure 10

SWLD ages mean error: left column shows the mean age estimation error (blue line) and the standard deviation of age estimation error (red line). Right column represents the empirical age estimation error distribution among all ages. (a) SWLD—not scaled. (b) Scaled

Fig. 11
figure 11

CLBP ages mean error: left column shows the mean age estimation error (blue line) and the standard deviation of age estimation error (red line). Right column represents the empirical age estimation error distribution among all ages. (a) CLBP—not scaled. (b) CLBP—scaled

where \(\hat {i}_{j}\) is the predicted age and i j is the age annotated in the dataset. Using the previous definition, it follows that the mean age estimation error for the age i can be defined as

$$ \overline{e}_{i} = \frac{1}{N_{i}} \sum_{j=1}^{N_{i}} e_{i,j} $$
((9))

and the error variance as

$$ {\sigma^{2}_{i}} = \frac{1}{N_{i}}\sum_{j=1}^{N_{i}}\left(e_{i,j}-\overline{e}_{i}\right)^{2} $$
((10))

Finally, for each age i, the RMS error (RMSE i ) was computed as

$$ {\text{RMSE}}_{i} = \sqrt{\frac{1}{N_{i}}\sum_{j=1}^{N_{i}}(e_{i,j})^{2}} $$
((11))

and the total RMS error (RMSE) as

$$ {\text{RMSE}} = \frac{1}{N_{a}}\sum_{j=1}^{N_{a}}{{\text{RMSE}}}_{j} $$
((12))

where N a is the number of ages between 20 and 60.

The histograms of the whole age estimation errors over all ages and samples were constructed and plotted on the right side of Figs. 8, 9, 10, and 11.

Results for gender and race were considered in terms of confusion tables where the rows are referred to the prediction (subscript P) and the columns to the true class (subscript T). In detail, the first two rows represent the prediction for the non-scaled case (superscript N), while the second two rows are referred to the scaled one (superscript S). The subtables (a), (c), (e), and (g) show the results for the unbalanced dataset; on the other hand, the subtables (b), (d), (f), and (h) refer to the results of the balanced subset. The total accuracy (TA) for both the scaled and non-scaled cases is reported in the last two rows of each subtable.

More specifically, Table 1 shows that, for gender estimation, the CLBP descriptor, using a balanced dataset and non-scaled values as input to the classifier, returned the best total classification accuracy (94.5 %). This trend was also confirmed if the average of correct estimations (taking into account all the possible combinations of data scaling and dataset balancing) corresponding to each descriptor is considered. In fact, the CLBP classification rate was of 92.2 % which overcame the equivalent measurements obtained through the comparing descriptors (HOG 87.3 %, SWLD 88.4 %, and LBP 87.6 %).

With regard to the opportunity to scale in the range [0,1], the features in input to the SVM and the experiments evidenced that non-scaled data better preserved the embedded information: the average correct estimations (taking into account all the possible descriptors and combinations of dataset balancing) were 92.3 % in the case of non-scaled input and 85.4 % in the case of scaled input.

Finally, concerning the opportunity to use a balanced dataset for training, experiments evidenced that balanced dataset led to an average correct estimation (taking into account all the possible descriptors with no data scaling) of 93.7 % in the case of balanced dataset and 91.0 % in the case of unbalanced dataset.

To summarize, for gender estimation, the best configuration of the proposed framework can be assumed to use a CLBP classifier, learned on balanced dataset and working on non-scaled input data.

Race estimation problem is a very different problem due to the larger number of classes (four classes), the presence of borderline cases, and the critical unbalancing of subjects among classes (there are just 600 Asian people against the 42,000 Black subjects).

Using the same evaluation criteria reported above, Table 2 shows that, for race estimation, the CLBP descriptor, using an unbalanced dataset and non-scaled values as input to the classifier, returned the best total classification accuracy (79.4 %).

The average correct estimations (taking into account all the possible combinations of data scaling and dataset balancing) corresponding to each descriptor were LBP 71.5 %, HOG 56.1 %, SWLD 62.6 %, and CLBP 72.4 %.

Concerning the opportunity to scale in the range [0,1], the variables of the feature vectors and experiments evidenced that, taking into account all the possible descriptors and combinations of dataset balancing, scaled data gave the best average correct estimations (67.0 %). However, an in-depth analysis of the confusion tables reveals as this result is mainly biased by the HOG descriptor, operating on the unbalanced dataset by means of non-scaled data, bad outcomes (48.3 %). With regard to all the other descriptors, the non-scaled data approach is not worse than the scaled one. Taking into account all the possible descriptors, the unbalanced dataset for training led to a better average correct estimation both in the case of scaled data (75.1 % for unbalanced case and 58.8 % for balanced case) and non-scaled data (69.0 % for unbalanced case and 59.7 % for balanced case).

The unbalanced dataset worked better because the number of Asiatic representatives is lower than the remaining classes, and therefore, in the case of the balanced case, a low number of samples can be used for each class. In other words, if the number of representatives for each class of race is sufficiently large, the combination of CLBP, balanced dataset, and non-scaled input to the SVM seemed to be the best choice.

A compact view of the performance, for gender and race estimation, is presented in Fig. 12 by means of bar diagrams. In particular, Fig. 12 a refers to gender estimation, whereas Fig. 12 b refers to race estimation.

Fig. 12
figure 12

Gender (a) and race accuracy (b) comparison: each group of columns represents (from left to right) the unbalanced non-scaled (U-N), the unbalanced scaled (U-S), the balanced non-scaled (B-N), and the balanced scaled (B-S) configuration

Finally, the age estimation was addressed as a regression problem and, in this case, having 42 classes to be predicted, the balanced dataset was not built (since there were not enough representatives in each class). The experimental results, reported in Figs. 8, 9, 10, and 11, demonstrated that the CLBP without data scaling was the best performing descriptor (RMSE=7.8 years). The best average correct estimation (taking into account both combinations of data scaling/not scaling) was again obtained with CLBP (RMSE=8.9 years) which overcame the other descriptors (LBP- RMSE=9.5 years; HOG- RMSE=10.4 years; SWLD- RMSE=10.4 years). Concerning the opportunity to scale the data, experimental results demonstrated that non-scaled data led to a better average regression accuracy (RMSE=8.7 years) with respect to the case in which data fed into SVR were scaled in the range [0,1] (RMSE=10.9 years). It is interesting to observe that the error distribution in the age estimation problem (see plots on the right of Figs. 8, 9, 10, and 11) revealed a Gaussian trend (around the zero value) with a steep decrease, demonstrating that in the most of age classification cases, the estimations were quite correct.

4.2 Experimental phase 2

In this phase, experiments were carried out on facial images acquired in a real environment: in this way, boundary conditions were relaxed with respect to previous experimental phase. In particular, changes in head pose (i.e., non-perfectly frontal face) and distance from the camera (i.e., different spatial resolutions) were also considered as factors of classification/regression bias. A dataset of seven subjects (five males and two females, all belonging to the white race and age ranging from 23 to 48 years) was built on purpose with 900 images per subject: in 300 images, subjects were at a distance of 75 cm from the camera and they assumed a close frontal pose (this subset of images were labelled as I FC, where the subscript stands for frontal-close); in other 300 images, subjects were at a distance of 75 cm from the camera but they performed a head rotation in the range [10°, 30°] (this subset of images was labelled as I RC, where the subscript stands for rotated-close). Finally, in the remaining 300 images, subjects were at a distance of 165 cm from the camera and they assume a close frontal pose (this subset of images were labelled as I FF, where the subscript stands for frontal-far). In the case where the subjects were closer to the camera, the resolution of the facial patch is about [240 × 180] pixels, whereas in the case where the subjects were farther from the camera, their facial patch is of about [100 × 70] pixels. Some examples of the three aforementioned subsets of images are reported in Fig. 13 where columns 1–3 report frontal-close images, columns 4–6 report rotated-close images, and columns 7–9 report frontal-far images.

Fig. 13
figure 13

Face examples: some pose examples for the dataset acquired in the lab. The first three columns are referred to the frontal pose, the second three columns to the rotated pose, and the last three columns to the far distance pose

Experiments were carried out as follows: the facial datasets built in the experimental phase 1 was used to set the projection (through LDA), classification/regression (through SVM), and scaling models. These models were then stored and exploited, without any intervention, during all the soft biometric estimations carried out in the experimental phase 2.

In this case, due to the dataset composition, the measurement of the error for age prediciton was slightly modified as follows: let N s the numbers of subjects and N f the numbers of frames for each subject (given a pose configuration).

The age estimation was treated with an RMS age estimation error.

$$ \text{RMSE}_{j}=\sqrt{\frac{1}{N_{f}} \sum_{i=1}^{N_{f}} \left(\hat{\theta}_{i,j}- \theta_{j} \right)^{2}} $$
((13))

where \(\hat {\theta }_{i,j}\) is the age estimation for the i-th frame of the j-th subject and θ j is the real age of the j-th subject. The mean RMS error (for all the subjects) is defined as

$$ \text{RMSE}=\frac{1}{N_{s}}\sum_{j=1}^{N_{s}}\text{RMSE}_{j} $$
((14))

Finally, let \(e_{i,j}=\left (\hat {\theta }_{i,j}- \theta _{j} \right)\) be the age estimation error for the i-th frame of the j-th subject and \(\overline {e}_{j}\) its mean among the N f frames. The error variance of the j-th subject is as follows:

$$ {\sigma^{2}_{j}} = \frac{1}{N_{f}}\sum_{i=1}^{N_{f}}(e_{i,j}-\overline{e}_{j})^{2} $$
((15))

and the total mean variance is defined as

$$ \sigma^{2} = \frac{1}{N_{s}}\sum_{j=1}^{N_{s}}{\sigma^{2}_{j}} $$
((16))

The experimental outcomes of this phase are reported in Tables 3, 4, and 5 and Figs. 14 and 15. In particular, tables report gender, race, and age estimation correctness for all the descriptors and over all the three experimental conditions (frontal-close images I FC, rotated-close images I RC, frontal-far images I FF). It is very important to highlight here that the best configurations (revealed by experimental phase 1) for balancing and scaling strategies were used and that, due to the lack of statistical significance (i.e., few subjects), the aim of the test on I FC images was not to evaluate the robustness of the framework (as largely tested in the previous subsection) but to have a reference point in the evaluation of the influence of pose and resolution. From this point of view, tables have to be exploited to understand how much the performance of the various configurations of descriptor, dataset balancing, and scaling strategies are influenced by pose and resolution changes. From the results, it is evident that the descriptor that is less sensitive to the image resolution is HOG (where a decrease of 0.4 % for gender, 0.9 % for race, and 0.4 years for age can be observed) and this is due to the fact that it is edge based (able to capture coarse texture information, as opposed to the other investigated methods able to represent finer structures). Unfortunately, the HOG descriptor experienced the maximum decrease in performance in the case of changes in pose (12.6 % for gender, 7.3 % for race, and 2.4 years for age) where the most robust behavior was obtained by the CLBP (0.7 % for gender, 2.9 % for race, and 0.3 years for age) which, by using a more sophisticated encoding strategy, was able to handle variations in facial patterns.

Fig. 14
figure 14

Gender (a) and race accuracy (b) comparison for lab dataset: each group of columns is referred to (from left to right) the frontal-close images (I FC), the rotated-close images (I RC), and the frontal-far images (I FF) configuration

Fig. 15
figure 15

Age RMSE (a) and variance (b) for lab dataset: each group of columns is referred to (from left to right) the frontal-close images (I FC), the rotated-close images (I RC), and the frontal-far images (I FF) configuration

Table 3 Gender estimation performance for the lab dataset
Table 4 Race estimation performance rate for the lab dataset
Table 5 Age estimation RMSE for the lab dataset

5 Conclusions

In this paper, a detailed study about the performance of soft biometric estimations, in terms of accuracy as well as robustness under real-world conditions, has been presented. Firstly, different facial descriptors (LBP, HOG, SWLD, CLBP), training datasets, and data scaling approaches have been extensively tested in a recently introduced common framework including a data reduction step (making use of linear discriminant analysis) and a final SVM classification step. Successively, robustness in unconstrained scenario, with scaling and rotation of the target with respect to the acquisition device, has been studied. Experimental evaluations have been carried out on benchmark data built from both publicly available datasets and from image sequences acquired in lab environment. Results highlighted that the best average performances have been reached by the CLBP operator in all the estimation problems. This work proved that the non-scaled features vector approach is preferable to the scaled one and that balanced class cardinalities during dataset building can give better performances. Moreover, tests have been performed over the lab dataset with relaxed boundary conditions which proved that the CLBP operator outperforms the other descriptors in terms of average estimation performance in the presence of scaling and rotation. Future works will aim at extending the lab dataset both in terms of number of subjects and pose and resolution variability, in order to have more significative statistics.

6 Endnotes

1 The on-line implementation, available at http://www.bytefish.de. was used.

2 http://www.vlfeat.org.