1 Introduction

As the world’s older population continues to grow at an unprecedented rate, the current supply of care providers is insufficient to meet the current and ongoing demand for care services (Dall et al. 2013). Researchers have explored an opportunity of socially assistive robots (Feil-Seifer et al. 2005; Tapus and Mataric 2006) that aim to enable people with cognitive, sensory, and motor impairments or assist the clinical workforce (Riek 2017). One potential application is socially assistive robots for rehabilitation therapy (Matarić et al. 2007; Lee et al. 2020, 2022). During rehabilitation, patients require completing a significant amount of self-directed exercises (O’Sullivan et al. 2019; Lee et al. 2022). However, low treatment adherence is a problem across several healthcare disciplines of physiotherapy (Kåringen et al. 2011). To address these problems, there has been increasing attention on social robot coaching systems (Riek 2017; Matarić et al. 2007; Lee et al. 2020, 2022). These systems autonomously monitor patients’ exercises and provide encouragement to support patients’ engagement in well-being-related or rehabilitation exercises through social interaction (Tapus et al. 2007; Feil-Seifer et al. 2005).

Prior work on robotic exercise coaching systems has demonstrated that older adults or post-stroke subjects can successfully exercise and stay engaged with a robot over sessions (Fasola and Matarić 2013; Görer 2013). However, despite the potential of a robot to monitor and guide exercises, prior work is limited to providing generic, predefined corrective feedback on patient’s exercise performance (e.g., checking angular difference with the prespecified motion (Görer 2013; Fasola and Matarić 2013; Guneysu and Arnrich 2017)). It is still challenging to empower a social robot exercise coaching system to generate tailored corrective feedback on an individual patient’s motion (Görer 2013) and adopt these systems broadly yet.

In this work, we design, develop, and evaluate a socially assistive robot coaching system that automatically monitors and coaches physical rehabilitation therapy. Specifically, we selected a test domain as stroke, which is the second leading cause of death and disability (Feigin et al. 2017). We then conducted interviews with therapists to design and develop a socially assistive robot coaching system. This system integrates a machine learning (ML) model with a rule-based (RB) model and can be tuned with held-out user data to assess the performance of exercises for personalized post-stroke therapy (Fig. 1a) (Lee et al. 2020). Building upon the previous work (Lee et al. 2020), we demonstrated the benefit of our approach to adapt a new user and provide personalized assessment compared to the commonly used transfer learning technique on a feed-forward neural network model (Zhuang et al. 2016) utilized a Gaussian mixture model to generate an ideal motion and arbitrarily set a threshold value to identify the differences of joints between idea and observed motions. Although both (Guneysu and Arnrich 2017) and (Tanguy et al. 2016) support analyzing multiple variables for evaluating an exercise, they still rely on a predefined motion or a generic threshold. Prior work with generic threshold-based methods might not be applicable for patients with various characteristics (Lee et al. 2020).

In addition, researchers have also explored a machine learning model to monitor patients’ quality of motion. For instance, Kashi et al. (2020) evaluated the feasibility of a random-forest model to identify compensatory movements. However, it remains unclear how such a machine learning model can adapt to a new patient and perform well to assess patients’ quality of motion.

For personalized quantitative rehabilitation assessment, Lee et al. (2020) explored an approach of dynamic feature selections and a hybrid model that integrates a machine learning model with a rule-based model (Lee et al. 2020). However, prior work is limited to providing assessment after completing a motion and does not support frame-level assessment to provide any information on when an erroneous motion has occurred.

Building upon prior work that explores a hybrid model for personalized assessment (Lee et al. 2020, 2020), we further investigated the system implementation of a socially assistive robot to automatically monitor and guide patients’ exercises. Specifically, we analyzed the benefit of our interactive hybrid approach compared to the commonly used transfer learning technique on a feedforward neural network model (Zhuang et al. 2020; Weiss et al. 2016) (i.e., pretrains the model using the dataset from other post-stroke survivors and then fine-tune it based on data from a new post-stroke survivor). In addition, we conducted a real-world experiment to evaluate the feasibility to adapt to a new participant and provide personalized, real-time corrective feedback.

3 Study for stroke rehabilitation

This work focuses on the domain of stroke, which is the second leading cause of death and third most common contributor to disability (Feigin et al. 2017). First, we iteratively discussed with three therapists (TPs with check marks in the specification column of Table 1; \(\hbox {mean (M)} = 6.33\), \(\hbox {standard deviation (SD)} = 2.05\) years of experience in stroke rehabilitation) to specify the study designs on stroke rehabilitation: exercises and performance components for assessment (Lee et al. 2019). We then had additional interviews with therapists to learn their practices on how they guide rehabilitation assessment. Based on these interviews, we created an interactive social robot coaching system that automatically monitors and coaches rehabilitation exercises. We then conducted a real-world experiment with ten healthy participants to evaluate the potential benefits and limitations of our system. This section presents only the specifications of our study and interviews with therapists to understand their practices. The evaluation part will be discussed later in Sect. 6.2 after presenting our system implementation.

Table 1 Profiles of therapists, who contributed to specify the study and share their practices to design our system

3.1 Three task-oriented upper-limb exercises

This work utilizes three upper-limb stroke rehabilitation exercises recommended by therapists (Lee et al. 2020). For Exercise 1, a user has to raise the user’s wrist to the mouth as if drinking water (Fig. 2a). For Exercise 2, a user has to raise the user’s wrist forward as if touching a light switch on the wall (Fig. 2c). For Exercise 3, a user has to extend the user’s elbow in the seated position to practice the usage of a cane (Fig. 2e).

3.2 Unaffected and affected sides

When a stroke occurs, post-stroke survivors suffer from the paralyzed, limited functional abilities of limbs. In this work, we refer to the unparalyzed side of a post-stroke survivor as the “Unaffected” side and the paralyzed side of a post-stroke survivor with limited functional ability as the “Affected” side.

3.3 Performance components

We discussed commonly used stroke assessment tools (i.e., the Wolf Motor Function Test (Wolf et al. 2001) and the Fugl–Meyer Assessment (Sanford et al. 1993)) with therapists and specified three common performance components of stroke rehabilitation exercises: ‘Range of Motion (ROM)’, ‘Smoothness’, and ‘Compensation’ (Lee et al. 2020). The ‘ROM’ indicates how closely a patient performs the target position of a task-oriented exercise. The ‘Smoothness’ describes the degree of trembling and irregular movement of joints while performing an exercise. The ‘Compensation’ indicates whether a patient performs any unnecessary, compensatory movements to achieve a target movement. For instance, patients might lean their head or trunk to the side and elevate their shoulder to perform an exercise using their affected side with the limited functional ability (Fig. 2).

The descriptions and labels of performance components are described in Table 2. The labels of ‘ROM’ and ‘Smoothness’ are annotated at the end of a motion and represented as a binary label on each performance component: a correct/normal performance component (\(Y = 1\)) and an incorrect/abnormal performance component (\(Y = 0\)). The labels of ‘Compensation’ are annotated at every frame of the patient’s motion to indicate whether three major compensations (i.e., abnormal alignment of head, spine, and shoulder) occur or not.

Fig. 2
figure 2

Sample unaffected and affected motions of exercises: a a patient raises the patient’s unaffected side of the wrist to the mouth, b a patient compensated with trunk and shoulder joints when attempting to move the patient’s affected side of the wrist, c a patient raises patient’s unaffected side of the wrist forward, d a patient elevated shoulder to compensate the limited functional ability of the patient’s affected side, e a patient extends the patient’s affected side of the elbow, and f a patient leaned trunk forward to extend the patient’s elbow

Table 2 Performance components and labels of physical stroke rehabilitation exercises

3.4 Understanding therapists’ practices

We conducted a one-on-one interview with each of the three therapists (TPs with check marks in the interview column of Table 1; \(\hbox {mean (M)} = 11.00\), \(\hbox {standard deviation (SD)} = 8.52\) years of experience in stroke rehabilitation). During the 1-h interview, the researcher asked therapists to speak aloud their strategies for coaching a rehabilitation session and providing feedback on patient’s exercises (i.e., “what kinds of feedback do you generate for a post-stroke survivor?”). To assist therapists’ speaking aloud process, the researcher showed them the videos of post-stroke survivors, who have different functional abilities (i.e., high, moderate, and low capability to achieve an exercise) and perform rehabilitation exercises. The detailed process of collecting these videos is described in Sect. 5.1.

We analyzed the transcripts of interviews with therapists through an iterative coding process (Gale et al. 2013). Specifically, we first open-coded interview transcripts, discussed emerging themes and ideas, and iteratively improved our codebook. We found that therapists oversee the treatments of a post-stroke survivor by providing a personalized rehabilitation session. Specifically, they monitor how their patients perform an exercise and provide their patients feedback to support the correct execution of an exercise and encourage participation in rehabilitation (Lee et al. 2022). For guiding a rehabilitation session, we noticed that all three therapists have a simple and common procedure (Lee et al. 2022). Specifically, when they start a session, they engage with their patients through brief greetings and describe the goal of a session (e.g., what kinds of exercises a patient will perform and how many repetitions are recommended) (Lee et al. 2022). If a patient is not familiar with an exercise motion, therapists might show themselves to instruct a motion that a patient has to practice (Lee et al. 2022). When a patient performs an exercise, therapists monitor the patient’s exercises to identify any part for improvement and provide corrective feedback (Lee et al. 2022). For instance, we found that therapists are particularly attentive to providing feedback on compensatory motions (Cirstea and Levin 2000) that might cause more severe pains. As rehabilitation requires patient engagement over an extended period, therapists also strive to provide positive encouragement to their patients (Lee et al. 2022).

4 Interactive approach of an socially assistive robot for personalized assessment and feedback

This work presents an interactive approach of a social robot exercise coaching system (Fig. 1a), which combines machine learning (ML) and rule-based (RB) models to assess the performance of patient’s exercises and tunes with patient’s data to generate personalized feedback (Lee et al. 2020). An ML model of our approach aims to extract meaningful patterns from a large amount of data and to support a generic assessment of the patient’s quality of motion (Lee et al. 2020). As such, a generic ML model might not perform well on an unobserved new patient’s motion with unique characteristics; our approach also integrates an ML model with a personalized RB model that can tune with the patient’s unaffected motions to derive patient-specific threshold values. This RB model can be easily updated to complement a generic ML model through a weighted average, ensemble technique (Lee et al. 2020) into a hybrid model (HM) and utilized to generate personalized corrective feedback on patient’s exercises. To provide feedback when an erroneous motion has occurred, we explored an ensemble voting method that leverages predictions on multiple consecutive frames for a more accurate frame-level assessment (Lee et al. 2020). In the following subsections, we describe the components of our approach: feature extraction, ML models, RB models, hybrid models, an ensemble voting method, and an interface of a socially assistive robot for personalized rehabilitation therapy.

4.1 Feature extraction

This work represented an exercise motion with sequential joint coordinates from a Kinect v2 sensor (Microsoft, Redmond, USA) and extracted various kinematic features (Lee et al. 2019). For the ‘ROM’ performance component, we computed joint angles (e.g., elbow flexion, shoulder flexion, elbow extension), the distance to a target position, and normalized relative joint trajectories (i.e., the Euclidean distance between two joints—head and wrist, head and elbow) (Lee et al. 2019). For the ‘Smoothness’ performance component, we computed the following speed-related features: speed and the zero-crossing ratio of acceleration (Lee et al. 2019). As our work focuses on upper-limb exercises, we computed these speed-related features on wrist and elbow joints. For the ‘Compensation’ performance component, we computed normalized joint trajectories: distances between joint positions of the head, spine, and shoulder in x, y, z axis from the initial to the current frame (Lee et al. 2019).

A moving average filter with a window size of five frames was applied to reduce the noise of the estimated joint positions from a Kinect sensor similar to Lee et al. (2019). Given an exercise motion, we computed a feature matrix \({\textbf {F}} = \{f_1, \ldots , f_T\} \in R^{T \times d}\) with T number of frames and d features and the statistics (e.g., maximum, minimum, range, average, and standard deviation) of a feature matrix over all frames of the exercise to summarize a motion into a feature vector, \(X \in R^{5d}\). This summarized feature vector was utilized for the assessment of ‘ROM’ and ‘Smoothness’ performance components. In addition, unlike (Lee et al. 2019) that only supports offline assessment on the ‘Compensation’ performance component, we extracted a feature vector at each frame for real-time, frame-level assessment on the ‘Compensation’ performance component. Overall, we extracted 30 features for the ‘ROM’ performance component, 60 features for the ‘Smoothness’ performance component, and 9 features for the ‘Compensation’ performance component.

4.2 Machine learning (ML) model

For a machine learning (ML) model, we applied a supervised learning algorithm through leave-one-patient-out cross-validation that utilizes training data from all patients except a patient for testing. The ML model computes the score of being correct on a performance component (\(P_{\textrm{ML}}\)) and predicts the quality of motion. Among various supervised learning algorithms, a decision tree, linear regression, a support vector machine, a feedforward neural network, and a long short-term memory (LSTM) network, we utilized a feedforward neural network (NN) model due to its outperformance as shown in Lee et al. (2020). For the implementation of a feedforward neural network (NN) model, we explored various architectures (i.e., one to three layers with 32, 64, 128, 256, and 512 hidden units) and an adaptive learning rate with different initial learning rates (i.e., 0.0001, 0.005, 0.001, 0.01, 0.1). We applied ‘ReLu’ activation functions and ‘AdamOptimizer’ and trained a model with cross-entropy loss and a mini-batch size of 1 and an epoch of 1.

4.3 Rule-based (RB) model

A rule-based (RB) model leverages the set of feature-based, if-then rules from therapists to estimate the quality of a motion (Lee et al. 2020). In addition, our system computes statistics of kinematic features from user data and generates patient-specific rules for personalized assessment. For the initial development of the RB model, semi-structured interviews were conducted with two therapists (\(\hbox {mean (M)} = 5.05\), \(\hbox {standard deviation (SD)} = 1.05\) years of experience in stroke rehabilitation) to elicit their knowledge of assessing stroke rehabilitation exercises. The knowledge of therapists has been formalized as 15 independent if-then rules (Appendix Table 5). For example, the assessment on the ROM component for Exercise 1 is specified as follows (Lee et al. 2020):

$$\begin{aligned} \hat{Y} ={\left\{ \begin{array}{ll} 1 &{} \text {if}\quad p^{\textrm{max}}(\hbox {wr}, c_y) \ge p^{\textrm{max}}(\hbox {spsh}, c_y) \\ 0 &{} \text {else} \\ \end{array}\right. } \end{aligned}$$
(1)

where p(jc) indicates the joint position (p) with a joint j (e.g., wrist \((\hbox {wr})\) and spine shoulder, the top of spine, \((\hbox {spsh})\)) and the coordinate of a joint (c) in the set \(C \in \{c_x, c_y, c_z\}\). \(\hat{Y}\) denotes the predicted label on a performance component.

This rule simply checks the maximum position of a wrist joint, \(p^{\textrm{max}}(\hbox {wr}, c_y)\), related to that of a spine shoulder joint, \(p^{\textrm{max}}(\hbox {spsh}, c_y)\), in the y-coordinate to roughly estimate whether a patient achieves the target position of Exercise 1. For the prediction with multiple rules, we apply a majority voting algorithm and do not apply any tie-breaking method given an odd number of rules.

The score of being correct on each performance component using the RB model (\(P_{\textrm{RB}}\)) is computed with the following equation:

$$\begin{aligned} P_{\textrm{RB}} = \frac{1}{|\mathbb {R}|}\sum _{r \in R} \min \left( \frac{{x}_{r}}{{\tau }_{r}}, 1\right) \end{aligned}$$
(2)

where \(x_r\) indicates the feature value of a rule r from a trial (e.g., \(p^{\textrm{max}}(wr, c_y)\) for the example above), \(\tau _{r}\) describes the threshold value of a rule r (e.g., \(p^{\textrm{max}}(\hbox {spsh}, c_y)\) for the example above). \(\mathbb {R}\) describes the set of rules elicited from the therapists. \(\min \) function is applied so that this equation assigns a value of 1 if the feature value of a rule exceeds the threshold of that rule. Otherwise, the equation normalizes the feature value of a rule with the threshold of a rule to compute the score of being correct.

In addition, as the initial threshold values of rules are generic, our approach can further tune a rule-based (RB) model with held-out user’s unaffected motions to update its threshold values on each patient (Fig. 1a). For the computation of personalized threshold values, we utilize the held-out user’s unaffected motions to learn a Gaussian probability density function \(f(x_r) \sim N(\mu _r, \sigma _r^2)\). Specifically, when a patient first interacts with the system and there is no prestored patient’s unaffected data, the system will inform the patient to perform an exercise with the patient’s unaffected side. When the system has the patient’s unaffected data, it will process to extract the feature value of a rule (\(x_r\)). We then utilized the maximum likelihood estimate (MLE) (Gopinath 1998) to estimate the parameters of a Gaussian probability density function, \(\mu _r\) and \(\sigma _r\) as the mean and standard deviation of \(x_r\), respectively. We then update the threshold value for a rule r with either \(2\sigma _s\) or \(3\sigma _s\) (i.e., \(\tau _r \in [\mu _r + 2\sigma _r, \mu _r + 3\sigma _r]\)).

4.4 Hybrid model

A hybrid model (HM) applies a weighted average, ensemble technique (Baltrušaitis et al. 2019) to combine machine learning (ML) and rule-based models to assess patients’ quality of motion (Lee et al. 2020). For the prediction on the quality of motion, the HM computes the weighted average of prediction scores from two models, in which the contribution weight of each model is the performance of a model (i.e., the F1-score of each model in the range of [0, 1]). Given training data, this weight can be precomputed and updated as the system collects additional data. The equation of computing the score of being correct using the HM, \(P_{\textrm{HM}}\), is as follows:

$$\begin{aligned} P_{\textrm{HM}} = \frac{\rho _{\textrm{ML}}}{\rho _{\textrm{ML}} + \rho _{\textrm{RB}}}P_{\textrm{ML}} + \frac{\rho _{\textrm{RB}}}{\rho _{\textrm{ML}} + \rho _{\textrm{RB}}}P_{\textrm{RB}} \end{aligned}$$
(3)

where \(P_{\textrm{ML}}\) and \(P_{\textrm{RB}}\) indicate the scores of the machine learning (ML) and rule-based (RB) models, and \(\rho _{\textrm{ML}}\) and \(\rho _{\textrm{RB}}\) describe the weights, F1-scores of ML and RB models.

4.5 Ensemble voting method for frame-level assessment

Our approach can detect a compensation motion at the frame level in real time so that a social robot exercise coaching system can provide a patient with corrective feedback when an erroneous motion has occurred. However, such a frame-level assessment, identifying the exact boundaries of a compensation motion, is challenging (Hasan and Roy-Chowdhury 2014). Thus, our approach applies an ensemble voting method (Dietterich 2000) that utilizes predictions on multiple consecutive \(V_f\) frames for a more robust frame-level assessment. The process of this method consists of (1) initial, continuous frame-level predictions and (2) the computation of a score to determine a winning prediction.

Let us denote \(h(f_t)\) the predicted frame-level assessment at frame t with an assessment model h (e.g., a machine learning model, a rule-based model, or a hybrid model) and a feature vector \(f_t\). The first process of an initial frame-level prediction runs continuously with an assessment model to generate predicted frame-level assessment \(h(f_t)\) at each frame t. When \(V_f\) number of initial frame-level predictions are available, our method computes a score of detecting a compensation motion at frame t over all label classes \(Y \in \mathcal {Y}\). Then, the winning prediction at frame t is selected as follows:

$$\begin{aligned} \hat{Y}_t = {\mathop {{{\,\mathrm{arg\,max}\,}}}\limits _{Y\in \mathcal {Y}}} \sum _{f_t \in \bar{F}} \delta (h(f_t), Y) \end{aligned}$$
(4)

where \(\bar{F}\) indicates a set of accumulated \(V_f\) feature vectors until t frame and \(\delta (h(f_t), Y)\) assigns 1 if \(h(f_t) = Y\) and 0 otherwise. The \(\delta \) function is to count the predicted assessment of Y with the predictions from \(V_f\) frames. \(\hat{Y}_t\) indicates the predicted frame-level assessment at t frame on a compensation motion with the largest number of the predictions, votes from \(V_f\) frames. In case of having tied votes, our method assigns \(\hat{Y}_t\) with the latest prediction \(h(f_t)\). By leveraging votes from past \(V_f - 1\) frames to the current t frame, our approach can support a more robust frame-level assessment.

4.6 Interface of a socially assistive robot

Based on our findings from the interviews with therapists (Sect. 3.4), we designed and developed a state machine to enable interactions of our social robot exercise coach system with users as a therapist. This state machine (Fig. 3) includes ten states: ‘Greeting/Briefing’, ‘Demonstration’, ‘Initial’, ‘Movement’, ‘Terminate’, ‘Feedback’, ‘Notify’, ‘Encourage’, ‘Correction’, and ‘Wrap-up’. Depending on the user inputs (e.g., clicking a button to start a system) and the results from our motion analysis component, the state machine will transit to a corresponding state and generate audio and visual feedback and control the behaviors of our social robot exercise coaching system (e.g., gestures).

In the ‘Greeting/Briefing’ state, our robotic exercise coaching system will summarize the main goal of a rehabilitation session as specified by a therapist. The system will show the video of a prescribed motion in the ‘Demonstration’ state if a new exercise is prescribed and a user requests it. In the ‘Initial’ state, the system will prompt whether a user is ready to start an exercise. Once a user confirms to start performing an exercise, the system will transit to the ‘Movement’ state and alert that the system starts monitoring in the ‘Notify’ state. When a user performs an exercise, the system will provide various types of feedback in the ‘Feedback’ state. For instance, if the system detects any compensated motion in real time, the system will provide a user corrective feedback on which unnecessary joints are involved in the ‘Correction’ state. Once a user completes an exercise, the state machine of our system transits to the ‘Terminate’ state, in which it will summarize the predicted assessment on the quality of motion in the ‘Notify’ state and provides ‘Encouragement’. When a user completes all prescribed exercises or requests to finish a session, the system will summarize what a user achieves in the session and remind the next session in the ‘Wrap-Up’ state.

For a social assistive robot, we used an NAO robot (SoftBank Robotics Europe, France) that supports competitive hardware capabilities and a user-friendly software development environment with cost reduction (Gouaillier et al. Full size image