1 Introduction

Nowadays, the development of telemedicine is progressing exponentially thanks to the availability of commercial measurement devices that are increasingly accessible both in terms of cost and usability. In Ambient Assisted Living (AAL) research area one of the main objectives is the development of technologies that can intervene to help elderly people in the different stages of aging. Indeed, AAL is intended to develop innovation to keep people connected, healthy, active, and happy into their old age. It concerns the development of products and services that make a real difference in people’s lives, for those who face some of the challenges of aging and those who care for older people who need help [1].

In the AAL context, continuous monitoring of vital parameters plays a key role in the prevention for example of specific diseases related to cardiac and/or respiratory activity. Specifically, human vital signs, such as Heart Rate (HR), Breath Rate (BR), and Blood Oxygen Saturation (SpO2) are generally used as common indicators for monitoring the physiological status of the human body.

Traditional methods for vital signs measurement are usually contact-based. Consequently, wearable smart sensors such as smartwatches, smart garments, and wristbands have spread widely in the market since they allow signal processing algorithms to provide for example discrete HR or BR data from the acquisition of a raw signal [2,3,4]. However, it is well known that there are also some generally accepted clinical standards for the evaluation and subsequent monitoring of the previously introduced vital signs. For HR monitoring the widely diffused technique is to obtain pulse frequency from a wearable photoplethysmogram [5] or the use of ECG monitoring [6]. On the other hand, BR monitoring is typically performed via thoracic bioimpedance from ECG leads or capnography [7], while SpO2 monitoring is generally obtained through the usage of a two-wavelength pulse oximeter, either as a standalone device or integrated into a vital sign monitor [8]. Wearable sensors have the advantage of higher localization accuracy and tracking, but they are more intrusive. Also, wearable sensor-based monitoring demands older adults with varying degrees of cognitive levels, to remember to wear the devices as well as charge the devices to electricity quite often. Instead, contactless sensors are less intrusive and can monitor activities or vital signs in real-life, naturalistic environments without causing any interference with an individual's daily routines. Also, the COVID-19 pandemic has contributed to increased usage of noncontact technology for monitoring vital signs. There is no doubt that this feature is preferred by frail and elderly people who are generally reluctant to use sensors belonging to the wearable category.

In the era of ubiquitous computing, mobile devices such as smartphones and tablets are omnipresent, and cameras/webcams are available as detectors in some cases (e.g., for fall detection, activity recognition, or affective computing) and so are particularly interesting for the noncontact measurement of vital signs. The literature survey showed that current approaches for contactless vital sign estimation from RGB images can be split into two main classes: a) motion-based methods and b) methods that evaluate changes in color intensity. The latter are the most popular and high-performance methods. They are based on the remote plethysmography signal (rPPG) analysis acquired by the vision sensor [9, 10]. Compared to standard PPG techniques, the advantage of this approach is that it does not require wearable accessories and allows easy adaptation to different requirements in various applications, such as emotional communication in the field of human–machine interaction [11], monitoring the driver's vital signs in the automotive industry [12], and estimation of other cardiovascular parameters in addition to HR estimation [13, 14]. Also, rPPG is a low-cost, non-contact, and pervasive technique and it is based on the following consideration: the blood absorbs the light in a superior way compared to the surrounding tissues and the variations in the volume of the blood influence the transmission and reflection of the light (Fig. 1). This leads to subtle color changes on human skin, which are invisible to the human eye, but which can be recorded by a simple RGB vision sensor [4].

Fig. 1
figure 1

Working principle of remote PPG (rPPG)

The scientific literature related to the topic concerning rPPG signal extraction from video streams is very dynamic. The existing remote vital signs estimation methods mainly perform the following tasks: 1) extraction of a specific region of interest (ROI); 2) temporal signal estimation and subsequent denoising, detrending, and normalization; 3) consolidating the obtained signal for rPPG estimation; and 4) eventually, vital sign estimation using rPPG. There exist several ways to perform these steps in the literature, and the choice of ways to accomplish these tasks determines the efficacy of the vital sign estimation.

For example, in [15] a blind source separation (BSS) technique (Independent Component Analysis – ICA) was applied by authors on the recorded RGB color channels extracted from a webcam to estimate HR and using non-Gaussianity as the criterion for independence. In the work described in [16], the authors proposed a chrominance-based method (CHROM) to separate the specular reflection component from the diffuse reflection component which contains pulsatile physiological signals, both reflected from the skin and based on the dichromatic reflection model. In 2014, De Haan and van Leest [17] defined a Blood-Volume Pulse (PBV) vector that contains the signature of the specific blood volume variations in the skin, removing noise and motion artifacts. Yan et al. [18], proposed a signal-weighted analysis method to extract the PPG signal. Here the authors used a strategy of taking the weighted average of the raw traces of the RGB colour spectrum. The main idea was to maximize the signal and minimize the noise using wavelet transform to extract the PPG signal. In the work of Poh et al. [19] the spatial distribution of the HR-related information on the subject's face was investigated. They proposed to select an ROI of 60% of the width of the full face and full height. By applying ICA, vital signs like HR, BR and Heart Rate Variability (HRV) were subsequently quantified and compared to corresponding measurements using Food and Drug Administration-approved sensors. High degrees of agreement were achieved between the measurements across all physiological parameters.

To extract HR using a low-cost vision sensor (i.e., webcam), Cheng et al. [20] designed a robust method exploiting illumination variation by combining joint blind source separation (JBSS) and ensemble empirical mode decomposition (EEMD) to suppress the noise caused by illumination variation. However, this method is not free from motion artifacts as it only considers stationary subjects. Moreover, the illumination considered here was mostly controlled, which restricts the real-time applicability of the proposed method.

More recently, as with many computer vision and signal processing applications, deep learning (DL) methodologies have shown promise in map** the complex physiological processes for remote vital signs measurement. In the work of Yu et al. [21] a neural architecture search (NAS) was implemented to automatically find the best-suited backbone 3D Convolutional Neural network (CNN) for remote PPG signal extraction. Specifically, in this work, a special 3D convolution operation, namely temporal difference convolution (TDC), was designed to help track the ROI and improve the robustness in the presence of motion and poor illumination. Then, NAS was performed based on two gradients-based NAS methods [22] to form a backbone network for rPPG signal extraction and vital signs estimation. Moreover, the same work proposed two data augmentation methods to prevent data scarcity. In [23] a multi-hierarchical CNN able to estimate HR quickly was described, where only 15 s of face video was required for effectively reconstructing the rPPG signal. A three-layer 3D CNN was used to extract low-level facial feature maps from RGB face videos. A skin map was also generated based on low-level feature maps to emphasize skin regions with stronger signals. Next, a weight mask was constructed by performing feature fusion on the skin map and the channel-wise feature map. Finally, the high-level feature map was multiplied by the weight mask by channels and fed into an rPPG signal extractor. Also, in [24] a 2D CNN was learned to extract color information of the ROI pixels. Here, noise was further removed from the extracted information by using a GAN-style module. A discriminator that accesses high-quality rPPG signals was used to guide a generator to reconstruct a noiseless rPPG signal and subsequently extract HR. This noise-removing technique can be applied in other rPPG methods to improve the performance. Lokendra et al. [25] proposed a novel HR estimation method named AND-rPPG, that improves the HR estimation by denoising the temporal signals using action units (AUs) and DL architecture. The proposed method has utilized different networks to denoise specific ROI because different facial expressions induce different noise characteristics depending on facial regions.

However, both methodological macro-categories introduced earlier (classical contactless approaches and approaches involving the use of DL) have some limitations. The algorithmic pipelines designed and implemented for vital signs estimation are complicated with lower accuracy compared with DL-based methods. Moreover, some of them need special devices, which generally are not low-cost. On the other hand, DL-based approaches do not need complex signal processing algorithms and achieve high precision, but most of them only estimate heart rate and cannot work in real time because of the DL model. In addition to the methodological complexity, the topic under consideration has additional critical issues that have been only partially addressed in scientific works, such as subject motion, image optimization, ambient light illumination, spectrum analysis, and motion artifacts. Also, most of the previous studies of non-contact video-based vital sign monitoring have been on healthy volunteers, sometimes in sunlight only, and have usually concentrated on HR estimation or BR estimation, without considering other important vital signs such as SpO2. According to the authors' knowledge, a critical issue poorly addressed in the relevant scientific literature relates to the operation of algorithms/pipelines when the observed subject is an elderly person.

Because the best-performing contactless vital parameter estimation methodologies are based on the time course of a signal based on color changes within specific ROIs, the presence of wrinkles, moles, and folds could distort the output of the value of the specific vital parameter to be estimated. In addition, other critical issues that have emerged in the literature originate from the strong influence of ambient lighting conditions on the time course of the raw RGB signals used for plethysmography signal extraction. To summarize, in this paper, we describe the implementation of a novel pipeline designed specifically for estimating the vital signs of elderly people. For this reason, the pipeline integrates algorithmic methodologies capable of handling the well-known problems inherent in measuring vital signs for this category of subjects. The operation of the pipeline, in addition to being optimized for elderly subjects, is also guaranteed to estimate the vital parameters of other categories of end-users (middle-aged and young).

The remainder of this paper is structured as follows. Section 2 explains our proposed solution for vital signs estimation, providing details on both hardware (detailing the components used for the proposed prototype solution) and software aspects (detailing each step integrated into the algorithmic pipeline). The considered datasets and the results are presented in Section 3. Finally, Section 4 shows both our conclusions and discussions on some ideas for future work.

2 Material & methods

This section begins with an overview of the hardware involved and tested for the implementation of the proposed pipeline. Following this, a detailed description of the algorithmic steps that are designed and implemented for contactless vital signs estimation is given.

2.1 Hardware description

The proper operation of the pipeline is ensured by any low-cost commercial camera (even the built-in webcam on a laptop) and through a modern and widely diffused processing unit, easily available on the market and generally used for the development of open platforms. For video stream acquisition the selected camera could have the functionality and features detailed below. The first hardware specification required is to have available as input data an RGB image with a resolution of at least 640 × 480 pixels and with the end-user's facial region present inside. In addition, it is further required that the orientation of the face should be between two angles of rotation to the optical center of the camera, varying between -30° and + 30° horizontally (pan angle) and -20° and + 20° vertically (tilt angle), to always ensure the presence of the selected ROIs in the acquired video stream. A further requirement relates to the lighting of the environment within which a brightness level of at least 100 lx must be measured (with the light level referring to general lighting, 1 lx = 1 lm/sqm). A final but fundamental requirement is inherent in the end-user's position relative to the optical center of the commercial camera. To return the appropriate extracted rPPG signal, the end-user will have to be positioned at a distance within the following range: 0.5 mt. – 1.5 mt.

On the hardware side, the current version of the proposed pipeline integrates a webcam manufactured and marketed by Logitech™ (Fig. 2a). The selected webcam is very light (weighing 162 g) and has small dimensions (height: 43.3 mm, width: 94 mm, depth: 71 mm). These features make it versatile from the point of view of installation as it can be placed on a table, on a laptop, or PC monitor. In addition, it is equipped with a USB connection cable of 1.5 m. in length that allows its use even at such a distance from the elaboration unit.

Fig. 2
figure 2

(a) Logitech C920 HD Pro webcam and (b) Raspberry Pi 4 Model B

Regarding technical specifications, the Logitech C920 HD Pro webcam allows it to capture video/images at the following resolutions: 1920 × 1080p (30 fps) and 720 × 576p (30 fps). In addition, it has a fixed diagonal field of view of 78°, automatic illumination correction, and finally possesses autofocus. Also, to promote the deployment of the proposed solution while maintaining its limited cost, the latest product in the popular Raspberry Pi computer versions (Raspberry Pi 4 Model B) was selected as the processing unit (Fig. 2b). It has Broadcom BCM2711, quad-core Cortex-A72 ARM v8, 64-bit 1.5 GHz processor,4 GB LPDDR4 memory, LAN, Bluetooth 5.0, Gigabit Ethernet, 2 USB 3.0 and 2 USB 2.0, 40 general-purpose input/output (GPIO) pins and a micro-SD card slot for loading operating system and data storage.

2.2 Proposed pipeline for vital signs estimation

The input of the proposed pipeline is represented by a stream of images acquired from a commercial camera. It is important to highlight that in this work the camera need not be positioned frontally to the subject's face as indicated in previous literature work, as effective pipeline operation is ensured even with non-frontal face images.

This is due to the algorithmic choices made for ROI detection. The algorithmic pipeline consists of two main blocks. The pre-processing stage integrates face detection, ROI extraction, and a series of algorithmic steps useful to managing changes in ambient lighting that may eventually occur. In addition, within the pre-processing stage, algorithmic methodologies were designed and implemented to improve the extraction of the plethysmography signal through the elimination of the ROIs of any moles, wrinkles, and skin folds. An overview of the pipeline with a block diagram representation is shown in Fig. 3.

Fig. 3
figure 3

Proposed algorithmic pipeline designed and implemented for vital signs estimation

2.2.1 Pre-processing

As regards the pre-processing stage, one of the most important algorithmic steps is related to the detection of a face in the streaming video acquired by the commercial camera used, a topic still widely discussed within the scientific community of the sector. Face detection is the foundation of a contact-less vital signs estimation system based on temporal evaluation of signals extracted from the R, G, and B color channels, and the reliability of the results has a great impact on the performance of the entire pipeline. Here, to make face detection as independent of orientation and/or distance as possible (within the constraints given in Section 2.1 "Hardware Description") Mediapipe library [26] was used. Mediapipe is generally used for face mesh prediction since it is a lightweight face geometry solution. It estimates or predicts 468 3D face landmarks in real-time containing information on different facial areas, inferring the estimated surface geometry of the human face in real-time. It uses Machine Learning (ML) to create 3D surface geometry, requiring only one camera input, and does need a depth sensor. Using the lightweight model architecture, the system provides the real-time performance necessary for a correct temporal interpretation of signals. Next, A facial image-based rPPG algorithm also requires a process of selecting an ROI within the found facial region for efficient signal extraction. Moreover, the literature in this research area mentioned that ROI affects also computational load in the rPPG method. There are a lot of studies [27, 28] that raised the problem of designating the best ROI inside the facial region. They assumed that there would be a protruding part of the blood vessel distribution, and consequently, different facial areas were evaluated like forehead, left and right cheeks, nose, mouth, nasal dorsum, and chin. As a result, the cheeks were selected as excellent ROI, since they represent the regions where the rPPG signal is strongest. Accordingly, in the present work, the rPPG signal was processed from the color information extracted from the right and left cheek, which is identified by the corresponding facial landmarks returned by the Mediapipe library (Fig. 4a) and reported in Table 1.

Fig. 4
figure 4

(a) Identification of ROIs on the face, (b) Wrinkles and fold detection for ROI cleaning

Table 1 Indices of the 10 facial landmarks used for the extraction of ROIs of interest (left and right cheek)

As pointed out several times in previous sections of this paper, the designed pipeline integrates specific algorithmic blocks to improve the quality of the rPPG signal extracted from cheeks. It is well known that skin folds, wrinkles, or moles are present within these facial regions, to a greater extent if the observed subjects are elderly. Consequently, downstream of ROI selection, algorithms were implemented to detect and remove wrinkles, folds, and moles that may be present. About the detection of wrinkles, folds, and moles, several algorithms in the scientific literature can detect such skin characteristics within the ROI. However, these approaches have certain limitations. Most of the previous research was only validated on the forehead region (which does not correspond to the ROI considered in this work). Moreover, existing algorithms were not able to separate coarse wrinkles and fine wrinkles, identify almost exclusively wrinkles and folds in a horizontal direction, and finally do not consider whole regions but only lines. To handle these issues, the proposed pipeline integrates a logical block called 'ROI cleaning'. For wrinkles and fold detection, the approach used is inspired by the work proposed in [29] but incorporates methodological modifications that optimize its operation concerning the topic covered in this paper. Starting from an input image (the cheek region as shown in Fig. 4a), we define it with ROIcheek after converting it to greyscale. The next step involves calculating the gradient (Grad_x, Grad_y) along the horizontal and vertical directions, using the formula (1):

$$\Delta {{\text{ROI}}}_{{\text{cheek}}}\left({\text{x}},{\text{y}}\right)=\left({{\text{Grad}}}_{{\text{x}}},{{\text{Grad}}}_{{\text{y}}}\right)=(\frac{\partial {{\text{ROI}}}_{{\text{cheek}}}}{\partial {\text{x}}}, \frac{\partial {{\text{ROI}}}_{{\text{cheek}}}}{\partial {\text{y}}})$$
(1)

where the two components represent the directional gradients. Generally, in well-established approaches in the literature, the second component (Grad_y) has been used as the input image for the Hybrid Hessian Filter (HHF) for the extraction of only the horizontal lines. Here, we modify the algorithm for vertical line detection. Consequently, the first component (Grad_x) is used as the input for the modified version of HHF. At this point, the Hessian Matrix H at a given scale value Ω is defined by the expression (2):

$$H\left(x,y,\Omega \right)= \left|\begin{array}{cc}{H}_{a}& {H}_{b}\\ {H}_{b}& {H}_{c}\end{array}\right|$$
(2)

where Ha, Hb and Hc are the outputs of second derivative. Each output approximates the convolution of ROIcheek by the Gaussian kernels G1 (Ω), G2 (Ω) with the formulations (3), (4) and (5):

$${H}_{a}\left(x,y,\Omega \right)= {{\text{ROI}}}_{{\text{cheek}}}\left({\text{x}},{\text{y}}\right)* {G}_{1}(\Omega )$$
(3)
$${H}_{b}\left(x,y,\Omega \right)= {{\text{ROI}}}_{{\text{cheek}}}\left({\text{x}},{\text{y}}\right)* {G}_{2}(\Omega )$$
(4)
$${H}_{c}= {|{H}_{a}|}^{T}$$
(5)

The remaining steps for wrinkle detection are described in [30]. The scale value Ω takes a value between 0.5 and 3 and it is determined dynamically by evaluating the area width of the selected ROI. A visual representation of the results of the algorithm described above is shown in Fig. 4b.

On the other hand, for the identification and removal of moles in the ROI, in our proposed pipeline we used a widely diffused approach in machine vision, based on high-level pattern recognition, and named template matching [30]. Using this technique, the objective is to find a piece into the source of the image to detect the best-matched area. Specifically, here the presence in the ROI of moles is assessed through the template matching technique known as Normalized Cross Correlation (NCC) [31] and evaluating the measurement of NCC between two images by the Euclidean Distance [32]. Since a mole tends to appear as a circular small dark region that borders an area of different brightness, the second derivative filter Laplacian of Gaussian (LoG) is used as the template for the detection of this kind of region [33]. LoG integrates a Gaussian smoothing filter and Laplacian operator to detect points that vary dramatically from their surroundings. Since it is a high-pass filter, therefore Gaussian smoothing filter is often used to filter noises. The Laplacian filter operator is a second-order derivative operator as defined in Eq. (6). It is combined with the Gaussian filter as in Eq. (7) to form the so-called LoG.

$$L\left(x,y\right)=\frac{{\partial }^{2 }f(x,y)}{\partial {x}^{2}}+\frac{{\partial }^{2 }f(x,y)}{\partial {y}^{2}}$$
(6)
$$LoG\left(x,y\right)=-\frac{1}{\pi {\partial }^{4}}\left[1-\frac{{x}^{2}+{y}^{2}}{2{\partial }^{2}}\right]{e}^{\frac{{x}^{2}+{y}^{2}}{2{\partial }^{2}}}$$
(7)

With the application of LoG at the output image obtained by the previous algorithmic step (third image of Fig. 4b), only the prominent facial features including moles are detected. To detect the moles accurately, morphological closing is conducted to fill holes and then connected components could be used to extract the moles. The extracted connected components are only mole candidates and need to be verified by height/width ratio and area size. The height/width aspect ratio as in Eq. (8) is used to filter out those parts that are of thin shape:

$$0.3<\frac{height\_mole}{width\_mole}<2.0$$
(8)

The visual result of the algorithmic step previously described is shown in Fig. 5.

Fig. 5
figure 5

(a) image after wrinkles and fold detection, (b) mole region detection, (c) image after ROI cleaning step

The last algorithmic step integrated into the pre-processing consists of minimizing the influence of ambient illumination with respect to the time course of the individual R, G, and B signals extracted from the obtained ROI after the “ROI cleaning” stage. Here, this goal was achieved through the application of "contrast-limited adaptive histogram equalization" (CLAHE) [34]. CLAHE differs from conventional histogram equalization in its contrast constraining and was developed to prevent the over-amplification of noise in Conventional Histogram Equalization (CHE). The algorithm consists of a few steps, which include a) getting every one of the inputs, b) pre-process the inputs, c) preparing each logical locale to deliver gray level map**s and d) interpolating gray level map** kee** in mind the end goal to merge final CLAHE picture.

2.2.2 Feature extraction and vital signs estimation

Following the application of the algorithmic blocks built into the pre-processing stage, the values in the RGB components now are much less disturbed by the motion of the subject and changes in ambient lighting. Moreover, RGB values are regularised thanks to the introduction of the 'ROI cleaning' algorithmic block, which made it possible to eliminate the dependence of these signals on the age of the observed subject. Considering a continuous monitoring scenario, the video segment that could be used for the estimation of vital signs is represented by a sliding temporal window. It refers to the portion of the video through which we obtain the discrete value of the vital parameter. Previous studies related to HR, BR, and SpO2 estimation typically employed window sizes ranging from 20 to 60 s. In our work, we set the length of the window size to make the algorithm more responsive to physiological changes and more practical for real-time implementation of the algorithm. Therefore, the raw RGB signals were segmented using a window size of 30 s, with a 1-s slide between consecutive windows. To constitute temporal signals, the raw RGB signals are obtained by calculating the average pixel value within the output image of the pre-processing stage. After extracting the raw signals for each frame belonging to the sliding window, specific signal processing techniques are employed to improve the quality of the signal for subsequent feature extraction steps. Consequently, detrending is applied to remove the linear trends from the raw signal and since the interest is in the periodicity of the signal, the resulting raw signal is then normalized by dividing it by its maximum absolute value and smoothed using a sliding average filter. In this way, a raw clarified signal is obtained for each color channel with the absence of noise. From this point onwards, the algorithmic pipeline provides two different declinations for the evaluation of vital signs, which are described in detail in the following subsections.

HR and BR estimation

After the application of pre-processing steps, to achieve realistic values of HR and BR, it is necessary to filter out the obtained raw RGB signals to remove frequencies that are unrealistic. Consequently, the raw RGB signals are applied to a third-order band-pass filter with ideal behavior to delete high and low-frequency noise. The filter removes components that exist outside the following frequency band: [0.7 Hz—4 Hz] for HR and [0.15 Hz—0.4 Hz] for BR. These cut-off frequencies were commonly used in previous studies, and they correspond to HR measurements between 42 and 240 bpm and between 9 to 24 breaths per minute. Next, the pre-processed temporal trace in RGB space signals, which can be indicated as R(t), G(t), and B(t), is the input to any subsequent rPPG method applied in the time domain for the extraction of the pulse signal and subsequent HR/BR estimation (Fig. 6).

Fig. 6
figure 6

Schematic representation of the algorithmic steps implemented for the estimation of HR and BR

In the literature, there are well-known and widely used rPPG methods like PCA, Independent Component Analysis (ICA), FastICA, and Green Channel method. Each of these methodologies has advantages and disadvantages. The BSS show limited success mainly in uncontrolled contexts. On the other hand, PCA suffers from the presence of motion artifacts in original signals. Finally, the Green Channel method has as a limitation the assumption that the plethysmography information is contained exclusively in this channel color. Following these considerations and evaluating the most recent results presented in the scientific literature, the chrominance-based method (CHROM) is used for the temporal reconstruction of the rPPG signal. This methodology was introduced to address the weaknesses of the other methodologies previously introduced. From a theoretical point of view, the CHROM method is based on the considerations below. When the skin reflects light, it consists of two main components (according to the concept of dichromatic reflection pattern described in [16]): the first component is referred to as 'diffuse reflection' and the second as 'specular reflection'. As far as diffuse reflection is concerned, variations in intensity are due to cardiac cycles, whereas the specular reflection component shows the color of the illuminant and no pulse signal. The two components taken together form the color we observe. The relative contribution of specular and diffuse reflections depends on the angles formed between the skin, the camera, and the light source in the environment. Consequently, these components are highly dependent on the movement of the observed subject concerning the camera, and the variation of light, during an observation period. The advantage of using the CHROM methodology is the elimination of the specular reflection component obtained by using color difference (chrominance signals).

From a purely mathematical point of view, the CHROM method starts with a zero standard deviation normalization. Next, the three colour components R(t), G(t) and B(t) are projected into two orthogonal chrominance vectors which are expressed through the Eqs. (9) and (10):

$${X}_{CHROM}\left(t\right)=3*R\left(t\right)-2*G(t)$$
(9)
$${Y}_{CHROM}\left(t\right)=1.5*R\left(t\right)+G\left(t\right)-1.5*B(t)$$
(10)

The final pulse signal (that corresponds to rPPG signal) is calculated combining the obtained chrominance vectors by the formula (11):

$$rPPG\left(t\right)={X}_{CHROM}\left(t\right)-\mu {Y}_{CHROM}\left(t\right)$$
(11)

where µ is equal to the ratio of the standard deviations of \({X}_{CHROM}\left(t\right)\) and \({Y}_{CHROM}\left(t\right)\).

The obtained pulse signal may still be subject to artifacts, so it is advantageous to use an appropriate removal technique for such artifacts. Among the different methodologies for removing artifacts from a pulse signal, one of the most widely used and performing methodologies was integrated into the pipeline proposed in the present work. It is based on the evaluation of the deviation of the energy value of the rPPG signal from the energy evaluated in time windows with standard rPPG signal. Specifically, to remove artifacts the following steps are followed: a) calculation of the signal square rPPG(t) within a predefined time window (rPPG(t)2); b) calculation of moving variance signal (σ2 (rPPG(t)2)) and moving median signal (M(rPPG(t)2) inside different sliding windows (10 s and 30 s respectively); c) establishment of decision criterion for artifact removing at time t. The decision criterion can be expressed by the following instruction: If σ2 (rPPG(t)2 > K* M(rPPG(t)2 then t is an artifact. Through experimental pipeline pre-validation sessions, a K value of 20 was determined.

At this point, given an estimate of the plethysmography signal, HR and BR can be estimated using frequency analysis. For this purpose, this signal, which contains a distinct periodicity, is converted to the frequency domain using a discrete Fourier transform. The preferred algorithm used by most authors is the Fast Fourier Transform (FFT). To determine the average HR and BR estimation the frequencies related to the peaks of the power spectrum are considered. Specifically, we consider the frequency fHR corresponding to the maximum peak intensity in the band [0.7 Hz—4 Hz] for HR estimation. Likewise, we consider the frequency fBR corresponding to the maximum peak intensity in the band [0.15 Hz—0.4 Hz] for BR estimation. HR and BR values, expressed as average beats/minute, are estimated by multiplying the frequency values fHR and fBR by 60.

A threshold for maximum change in HR and BR between continuous measurements is used to improve the performance of the overall system. In this work, the threshold is set to 6 bpm for HR and 3 breath/min for BR, and if the difference between the result of the current HR/BR and the value of the last HR/BR goes beyond the threshold, the current HR/BR will be rejected and the next highest power frequency in the operational range will be searched. The last HR/BR value will be retained if no frequency peaks are located.

SpO2 estimation

For the estimation of SpO2, it is appropriate to first introduce the definition of blood saturation. Blood saturation (or oxygen saturation) measures the percentage of oxyhemoglobin (oxygen-bound hemoglobin) in the blood, and it is represented as arterial oxygen saturation (SaO2) and venous oxygen saturation (SvO2). SaO2 is the parameter measured with oximetry and is normally expressed as a percentage. Under normal physiological conditions, arterial blood is 97% saturated, whilst SvO2 is 75% saturated. At this point, SpO2 is defined as the ratio of the concentration of oxygenated hemoglobin to the total concentration of hemoglobin present in the blood (oxygenated hemoglobin + deoxygenated hemoglobin), and it can be expressed by the Eq. (12):

$$SpO2=\frac{C(Hb02)}{C\left(Hbo2\right)+C(Hb)} x 100\%$$
(12)

A good estimation of SpO2 can be obtained by evaluating the cardiovascular pulse wave signal at the following wavelengths: 660 (red light) and 940 (near-infrared light) nanometres. Now, the literature suggests that considering the output RGB signal obtained at the end of the detrend and normalization stage, the B channel is representative of the infrared wavelength. The standard deviation of R channel and B channel intensities are used in the present work as the pulsatile alternative current (AC) component whereas the mean intensities are used as the non-pulsatile direct current (DC) component. Following the definitions above, the calculation of Spo2 can be rewritten in terms of the voltage with Eqs. (13) and (14):

$$Ratio=\frac{\frac{{AC}_{red}}{{DC}_{red}}}{\frac{{AC}_{blue}}{{DC}_{blue}}}$$
(13)
$$SpO2 \left(\%\right)=110-Ratio x 25$$
(14)

Equation (14) is often used in the literature in the context of medical devices. However, it relies on the calibration curves that are used to make sure that this linear approximation provides a reasonable result.

3 Results and discussion

The validation of the overall proposed system for vital signs estimation was conducted at an early stage in the “Smart Living Technologies Laboratory” located at the Institute of Microelectronics and Microsystems (IMM) in Lecce, Italy. At a later stage, experimentation was possible on end-users provided by two project partners (an orthopaedic institute located in the city of Bologna, Italy and the nephrology department of the Polyclinic located in the city of Bari, Italy). In the first validation phase, due to COVID-19 restrictions, it was only possible to test the entire platform with 14 aging subjects. In the subsequent trial phase, the tests were performed on a further 19 end-users (13 at the orthopaedic institute in Bologna and 6 at the nephrology department in Bari). Among the selected subjects, eight had cardiac problems of various types (cardiac bypasses and/or stent implantation) and two were suffering from renal problems. Before the experimentation stage, informed consent was obtained from all subjects involved. All the experiments were performed using Python language (version 3.7.0) on the hardware-embedded architecture introduced in Section 2.1. The python libraries used for the development of the algorithmic pipeline are as follows: numpy (vers. 1.21.6), opencv (vers. 4.5.3.56), scipy (vers. 1.4.1), pandas (vers. 1.2.0), mediapipe (vers. 0.8.11) and pyqt5 (vers. 5.12.3) for software interface development. The Logitech C920 HD Pro webcam was used for image streaming acquisition. For each subject, a series of measurements () measurements were repeated at the following distances from the vision sensor: 0.5 m, 1.0 m, and 1.5 m. The experimental environment was always illuminated with natural light, the intensity of which varied in the range of 100 lx – 300 lx. The acquisition time has been set to 120 s for each measurement session. The experimental setup that was adopted for validation and the details of the implemented software interface are shown in Fig. 7:

Fig. 7
figure 7

(a) Experimental setup and (b) visual interface of the developed software for vital signs estimation

During the experimentation conducted in the laboratory located at IMM in Lecce (Italy), each older adult was asked to wear a commercial OXI-2 pulse oximeter produced by GIMA [35] to collect simultaneously baseline values of HR and SpO2 to be compared with the values estimated through our system. Moreover, each end-user wore a commercial smart garment produced by the Italian company Smartex s.r.l. [36], which through piezoresistive sensors can provide the ground truth value for the values of BR. Different kinds of metrics were proposed in this research area for evaluating HR and BR measurement methods. Here, the commonly used Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE) metrics were utilized. MAE gives the mean of the absolute difference between HR/BR estimation and target value measured with previously introduced commercial devices. RMSE represents instead the difference of squares of our estimated measures and those obtained from commercial devices. For HR and BR, it is also reported the success rates ϕ and µ. The latter quantities are expressed by the acronyms SUCϕ and SUCµ. They indicate the percentage for which the absolute error from the ground truth and estimator falls below ϕ bpm for HR and below µ resp/min for BR.

Since the innovative aspect of the proposed pipeline consists of the design and development of algorithmic steps for estimating the vital signals of elderly subjects, it is worth highlighting the contribution made by the logic blocks responsible for optimizing the R, G, and B temporal signals extracted from the ROI of interest. Consequently, Tables 2 and 3 report the average MAE, RMSE, and SUCϕ obtained varying the distance of the end-user face from the vision sensor without “ROI cleaning” and after the application of “ROI cleaning” concerning HR. Tables 4 and 5 report the average MAE, RMSE, and SUCµ obtained varying the distance of the end-user face from the vision sensor without “ROI cleaning” and after the application of “ROI cleaning” concerning BR.

Table 2 MAE, RMSE, and SUCϕ (3 bpm and 5 bpm) for HR estimation at varying distances from the vision sensor and without “ROI cleaning”
Table 3 MAE, RMSE, and SUCϕ (3 bpm and 5 bpm) for HR estimation at varying distances from the vision sensor after the application of “ROI cleaning”
Table 4 MAE, RMSE, and SUCµ (1 resp/min and 2 resp/min) for BR estimation at varying distances from the vision sensor and without “ROI cleaning”
Table 5 MAE, RMSE, and SUCµ (1 resp/min and 2 resp/min) for BR estimation at varying distances from the vision sensor after the application of “ROI cleaning”

The values in the above tables firstly show the relative dependence of the pipeline on the distance at which the subject's face is positioned with respect to the vision sensor (obviously within the distance range considered for validation). The reported metrics return very close numerical values at both the 0.5 m and 1.0 m distances, while there is a gradual worsening when the subject's face is about 1.5 m away. This is to be correlated with the variation in the size of the ROI. The results shown in the table demonstrate that the introduction of the 'ROI cleaning' algorithmic block improves the correctness of HR and BR measurement at all three distances considered. Furthermore, it should be emphasized that the metrics SUCϕ and SUCµ have a value greater than 97% at an ideal distance from the vision sensor of 1.0 m and this is valid considering a deviation from the HR and BR value which in long-term monitoring does not affect any alerts of a medical nature.

Regarding SpO2, Figs. 8 and 9 plot the average values of SpO2 obtained for all 14 subjects and measured respectively at 0,5 m, 1.0 m, and 1.5 m by our proposed pipeline compared to those of OXI-2 pulse oximeter, both without and with the application of the 'ROI cleaning' algorithmic block.

Fig. 8
figure 8

Estimated SpO2 average values at varying distances and without the application of “ROI cleaning”

Fig. 9
figure 9

Estimated SpO2 average values at varying distances and with the application of “ROI cleaning”

Analyzing the numerical results, an improvement of approximately 1% was obtained when the estimation of SpO2 is achieved via the pipeline that integrates the logic block of “ROI cleaning”, again confirming the design choice made. Considering the estimation of SpO2 for ground truth, the following average percentage differences were obtained as the distance increased: 0.85%, 1.14% 1.92% in the case of the pipeline incorporating the 'ROI cleaning' block, and 1.64%, 2.07%, and 3.21% without application of the 'ROI cleaning' block. It can be concluded that in both cases the error is very low indeed. The results obtained from this first validation of the pipeline finally show that the Eq. (14) chosen for the final estimate of SpO2 returns fairly true values, although in future developments it is not excluded to analyze further parametric formulae which possibly calculate the parameters of the calibration curve downstream of an initial 'tuning' measurement. Regarding the photoplethysmography node test at the two clinical partners introduced above, it was only possible to assess the goodness of HR and SpO2 measurements but not BR, due to the non-availability of a commercial ground-truth sensor capable of estimating this quantity. Both partners used a commercial pulse oximeter as ground-truth for HR and SpO2 estimation. Table 6 shows the results obtained in terms of MAE and standard deviation obtained by comparing our contact-less system and the pulse oximeter. To prove the generality and transversality of the sensory node, users were divided into two categories: healthy end-users and unhealthy end-users.

Table 6 MAE and standard deviation between our proposed system and a commercial pulse oximeter obtained in experiments carried out by project partners

The results obtained, referring to the experimentation with users recruited by the project partners, tend to be in line with the results obtained during the validation phase. Furthermore, it is evident that the health conditions of the end-users do not influence these results, providing tangible confirmation of the generalisation of the functioning of the entire system implemented.

3.1 Comparison with state-of-the-art

To compare the obtained results with the state of the art, our proposed pipeline was also evaluated on three literature datasets suitable for rPPG algorithm evaluation: COHFACE [37], UBFC-RPPG [38], and MAHNOB-HCI [39].

COHFACE is a publicly available dataset proposed by the Idiap Research Institute. It contains 40 subjects and 160 one-minute RGB videos. In this dataset, a conventional webcam was used to capture the full face of the participant in two different illumination settings to evaluate the effect of illumination variation. The lighting conditions simulated for each subject are (1) the ceiling lights and a 400W halogen spotlight (in this setting, the light illuminates evenly on the subject’s face), and (2) the natural light which illuminates from the window blinds next to the subject. Moreover, respiratory signal and Skin reflectance PPG were recorded as ground truth signals. The most important shortcoming of this dataset was the heavy compression of acquired video that added noise artifacts.

UBFC-RPPG is a very recent dataset proposed mainly for rPPG algorithm evaluation. It is composed of 43 videos, where each video is synchronized with a commercial pulse oximeter finger clip sensor and with a duration equal to 2 min. It is important to highlight that videos are recorded with the same low-cost webcam used in our proposed pipeline. The participants sit in front of the camera at about one meter from the webcam. The data recording was conducted indoors with indoor illumination and slight changes in sunlight. One special aspect of the recording is that participants were told to play a time-sensitive mathematical game. Its purpose was to augment the HR of participants and hence simulate a real-life human–computer interaction scenario for evaluation.

The last considered dataset used for comparison is MAHNOB-HCI. This dataset is multimodal since it was originally recorded for emotion recognition and implicit tagging research. However, as ground truth signals, such as ECG and respiration amplitude, were recorded, it was also suitable for rPPG algorithm evaluation. It includes 527 videos from 27 subjects. Also, six cameras were used to capture different views (frontal view, profile view, wide angle, close-ups) of the participant, which made this dataset useful for evaluating the algorithms when the pose angle varied.

The datasets introduced above have certain limitations, which are described below. Firstly, none of the datasets contain ground-truth data for the vital parameter SpO2, so it was not possible to compare our algorithm with the state of the art. Also, there is only BR data in COHFACE. Moreover, the HR can be obtained by calculating other signals such as ECG and BVP signals. Finally for the proposed application context, the age of the participants/subjects is not known for all three datasets, so it was not possible to make a comparison with the state of the art specifically for the category of elderly subjects. Some examples of these 3 datasets are shown in Fig. 10, whereas Tables 7, 8 and 9 reported the comparison with the more recent published results only for the vital sign HR in terms of RMSE and/or MAE.

Fig. 10
figure 10

Images of 3 benchmark datasets: COHFACE (top), UBFC-RPPG (middle), MAHNOB-HCI (bottom)

Table 7 Performance comparison for HR estimation on the COHFACE dataset
Table 8 Performance comparison for HR estimation on the UBFC-RPPG dataset
Table 9 Performance comparison for HR estimation on the MAHNOB-HCI dataset

The first conclusion to be derived from the MAE and RMSE estimation reported in the three tables above is that the COHFACE is the most challenging database. A further conclusion to be drawn from looking at the tables is a high variability of MAE and RMSE values, even when analyzing the results obtained from the different methodologies applied to the same dataset.

4 Limitations and future work

However, it is worth pointing out some limitations of the proposed work. Firstly, the pipeline needs controlled lighting conditions within the living environment in which the acquisition takes place. Furthermore, an issue that could affect its operation is closely related to the maximum distance at which the subject's face is located. Another limitation observed during the pre-validation and testing phase of the pipeline is related to the size of the ROI in the case of a non-frontal face position concerning the camera. However, this last critical issue can be addressed and resolved in future developments of the work, which include the integration of the forehead as ROI for vital signs estimation. A further future development involves testing the pipeline with a larger number of elderly subjects. Also, the results obtained will have to be compared with state-of-the-art publications in which the estimation of not only the HR parameter but also the BR and SpO2 parameters is carried out. Finally, it will be interesting to test the pipeline not only with subjects observed at rest but also after low to moderate exercise.

5 Conclusions

The measurement of vital signs in a contactless way has attracted the attention of many researchers in recent years. The motivation has increased especially in the last 2 years when the spread of the coronavirus and its ease of contagion, has recommended social distancing in family, work, and hospital settings. There is also considerable evidence that for the elderly, this measuring procedure is much more accepted than the measurement of vital signs using wearable solutions. The pipeline presented in this paper allows the non-contact measurement of 3 important vital signs: HR, BR, and SpO2. To obtain measurements as close to reality as possible, the pipeline was supplemented with algorithmic blocks capable of handling the criticalities present in the specific measurement of vital signs from the plethysmography signal extracted from ROI belonging to the face of an elderly subject. Specifically, since the plethysmography signal is closely related to the time course of the color levels of the R, G, and B channels, an algorithmic layer was designed and implemented to eliminate pixels within the ROI that negatively affect the quality of the plethysmography signal. As a result, wrinkles, folds, and moles were automatically eliminated using computer vision and pattern recognition methods. The reported results show good measurement validity within the following distance range from the optical center of the sensor: 0.5 m -1.5 m. Furthermore, it is important to emphasize how the methodology implemented for ROI detection allows the estimation of vital signs even in the case of small rotations of the subject's face (both concerning a horizontal and vertical angle). This last aspect is very important about the category of subjects taken into consideration. Another very relevant aspect of the proposed work is the use of low-cost hardware to deploy the solution. The pipeline is designed to work with commercial cameras that may already be present within a living environment (e.g., webcams or cameras already integrated into a laptop). Furthermore, as demonstrated in the results section, the computation cost allows the use of embedded platforms costing less than € 100 for data processing.