Keywords

1 Introduction

Social communication skills are fundamental for successful communication in globalized and multi-cultural societies as they are central to education, work, and daily life. Although there is great interest in the notion of communication skills in scientific and real-life applications, the concept is difficult to generally define due to the complexity of communication, wide variety of related cognitive and social abilities, and huge situational variability [7]. Techniques that involve nonverbal behaviors to estimate communication skills have been receiving much attention. For example, researchers have developed models for estimating public speaking skills [29, 32], persuasiveness [28], communication skills during job interviews [25] and group work [26], and leadership [31].

We are working on constructing models for estimating “the empathy skill level” in multi-party discussions. Empathy, which is the ability to understand and share the feelings of others, is one of the most important social skills and has long been studied in psychology [4, 5]. Davis’ Interpersonal Reactivity Index (IRI) [3] includes four dimensions of empathy: perspective taking (PT), i.e., the tendency to adopt another’s psychological perspective; fantasy (FS), i.e., the tendency to strongly identify with fictitious characters; empathic concern (EC), i.e., the tendency to experience feelings of warmth, sympathy, and concern toward others; and personal distress (PD), i.e., the tendency to have feelings of discomfort and concern when witnessing others’ negative experiences. Davis’ IRI has been translated into many languages [6] and used in a wide variety of fields such as neuroscience [1] and genetics [30].

In our previous study, we developed an estimation model of an individual’s EC score that uses the gaze behavior and dialogue act (DA) near the end of utterances during turn-kee**/changing as feature values [17]. The model has a higher estimation accuracy than one using the overall values of verbal/nonverbal behaviors in an entire discussion such as the amount of utterances and physical motion used in many previous studies on skill estimation [7, 25, 26, 28, 29, 31, 32]. This suggests that behavior during turn-kee**/turn-changing in a very short time interval is useful for estimating individual empathy skill level. We demonstrated that the gaze behavior towards the end of utterances and DA, i.e., verbal-behavior information indicating the intension of an utterance, during turn-kee**/changing are important for estimating an individual’s EC score.

Since we focused on estimating only an individual’s EC score, it is necessary to verify whether gaze behavior and DA are useful for estimating the scores of the other three dimensions, i.e., PT, PD, and FS.

In this research, we explored whether gaze behavior and DA during turn-kee**/changing are useful in estimating the scores of the remaining three Davis’ IRI dimensions by constructing and evaluating estimation models based on these dimensions. We found that gaze behavior and DA are useful for estimating the scores of PT, PD, and FS. Therefore, gaze behavior and DA during turn-changing/kee** are useful for estimating the scores of all four Davis’ IRI dimensions.

Fig. 1.
figure 1

Photograph of multi-party discussion.

2 Corpus Data

We previously created a face-to-face conversation corpus for develo** our estimation model of an individual’s empathy skill level using gaze behavior and DA in multi-party discussions [13, 17]. In this section, we give details of our corpus. The corpus includes eight face-to-face four-person discussions held by four groups of four different people (16 participants in total). In each group, the four participants were Japanese women in their 20’s and 30’s who had never met before. They sat facing each other (Fig. 1). We labeled the participants, from left to right, P1, P2, P3, and P4. They argued and gave opinions in response to highly divisive questions, such as “Is marriage the same as love?”, and needed to reach a conclusion within ten minutes. All four four-person groups took part in two discussions.

The participants’ voices were recorded with a pin microphone attached to their chests, and the entire discussions were videoed. Upper body shots of each participant (recorded at 30 Hz) were also taken. From the collected data for all eight discussions (80 min in total) and from the recorded data, we constructed a multimodal corpus consisting of the following verbal/nonverbal behaviors and the participants’ empathy skill levels.

  • Utterances and DAs: We built the utterance unit using the inter-pausal unit (IPU) [23]. The utterance interval was extracted manually from the speech wave. The portion of an utterance followed by 200 ms of silence was used as the unit of one utterance. From the created IPU, backchannels were excluded, and an utterance unit continued from the same person was considered as one utterance turn. IPU pairs adjoined in time, and IPU groups during turn-kee**/changing were created. The data for speech overlaps, i.e., when a listener interrupted during a speaker’s utterance or two or more participants spoke simultaneously at turn-changing, were excluded from the IPU pairs for analysis. Eventually, there were 1227 IPUs during turn-kee** and 129 during turn-changing.

  • Gaze objects: A skilled annotator manually annotated the gaze objects by using bust/head and overhead views in each video frame. The gaze objects were the four participants (labeled P1, P2, P3, and P4, as mentioned above) and non-persons, i.e., the walls or floor. Three annotators annotated the gaze behavior in our conversation dataset to verify the annotation quality. Conger’s Kappa coefficient was 0.887. Based on the benchmarks of [8], the gaze annotations were of excellent quality.

  • Empathy skill level: All participants were asked to answer a questionnaire that was based on Davis’ IRI [3]. We collected the scores of the four Davis’ IRI dimensions to estimate the participants’ empathy skill levels.

All verbal and nonverbal behavior data were integrated at 30 Hz for visual display using the NTT Multimodal Meeting Viewer [27]. This viewer enables us to annotate the multimodal data frame-by-frame and observe the data intuitively.

3 Feature Values

We used the gaze behavior and DA during turn-changing/turn-kee** as feature values for develo** our estimation model of empathy skill level [13, 17]. In this section, we give details of these feature values.

We first introduce the feature values of gaze behavior. We focused on Gaze Transition pattern (GTP) for analyzing gaze behavior, which are temporal transitions of participant’s gaze behavior near the end of utterances [10, 15, 18, 19, 21]. A GTP is expressed as an n-gram, which is defined as a sequence of gaze-direction shifts. We demonstrated that the occurrence frequencies of GTPs differ significantly for a speaker and listener during turn-kee** and a speaker, listener who becomes the next speaker (hereafter called “next-speaker”), and listeners who do not become the next speaker (hereafter called “listeners”) during turn-changing. We also demonstrated that GTP is effective for estimating the next speaker in multi-party discussions. Thus, we used GTPs as analysis gaze parameters. To generate a GTP, we focused on the gazed object for 1200 ms: 1000 ms before and 200 ms after the utterance since the GTP during 1200 ms is important for turn-taking [10, 15, 18, 19, 21]. A GTP is composed of a person or object classified as “speaker”, “listener”, or “non-person” and labeled. We considered whether there was mutual gaze and classified gaze behavior using the following seven gaze labels.

  • S: Person looks at a speaker without mutual gaze (speaker does not look at the listener.).

  • SM: Person looks at the speaker with mutual gaze (speaker looks at a listener.).

  • L1\(\sim \)L3: Person looks at another listener without mutual gaze. Labels L1, L2, and L3 indicate different people. The sitting position does not matter. For example, if P1 who is speaking looks at P2 followed by P3 then P2 again, the gaze transition pattern of P1 is L1-L2-L1.

  • LM1\(\sim \)LM3: Person looks at another listener with mutual gaze. Labels LM1, LM2, and LM3 indicate different people.

  • N: Person looks at the next speaker without mutual gaze only during turn-changing.

  • NM: Person looks at the next speaker with mutual gaze only during turn-changing.

  • X: Person looks at non-persons, such as the floor or ceiling, i.e., gaze aversion.

Fig. 2.
figure 2

Example of generating GTPs in turn-changing situation

Figure 2 shows how GTPs are constructed: P1 finishes speaking, then P2 starts to speak. Person P1 gazes at P2 after she gazes at a non-person during the analysis interval. When P1 looks at P2, P2 looks at P1; that is, there is mutual gaze. Therefore, P1’s GTP is X-NM. Person P2 looks at P4 after making eye contact with P1; thus, P2’s GTP is SM-L1. Person P3 looks at a non-person after looking at P1; thus, P3’s GTP is S-X. Person P4 looks at P2 and P3 after looking at a non-person; thus, P4’s GTP is X-N-L1.

A DA for each IPU was extracted using an estimation technique for Japanese [9, 24] for DA analysis. This technique can estimate a DA of a sentence from among 33 DA categories using word n-grams, semantic categories (obtained from a Japanese thesaurus Goi-Taikei), and character n-grams. The technique outputs 33 DA categories. We grouped them into the following five major categories.

  • Provision: Utterance for providing information

  • Self-discourse: Utterance for disclosing oneself

  • Empathy: Utterance intending empathy

  • Turn-yielding: Utterance intending a listener to speak next (ex. utterance of question, suggestion, or confirmation)

  • Others: Utterance not included in the above four categories

About 90% of utterances included the DA categories of Provision, Self-disclosure, Empathy, and Turn-yielding.

In our previous study, we demonstrated that the occurrence frequencies of GTPs accompanying each DA category for the speaker and listeners during turn-kee**; and the speaker, next-speaker, and listeners during turn-changing in multi-party discussions is effective for estimating the participant’s EC score [13, 17]. We used them as feature values in this study in similar manner as we did in our previous study.

4 Empathy-Skill-Estimation Models

The goal of this study was to demonstrate that the gaze behavior and DA during turn-kee**/changing are useful for estimating individuals’ empathy skill levels. We constructed a model for estimating the EC score using GTP and DA information, one using utterance information such as duration of speaking and number of speaking-turns and one using simple gaze information (which is the duration of looking at a speaker or listener in a discussion) to compare the usefulness of GTP and DA information. We also constructed two estimation models using GTP and DA and using GTP, DA, utterance information, and simple gaze information to evaluate the effectiveness of multimodal fusion.

We constructed the estimation models using a SMOreg [22], which implements a support vector machine (SVM) for regression in Weka [2], and evaluated the accuracy of the models and the effectiveness of each feature. The settings of the SVM, i.e., the polynomial kernel, cost parameter (C), and hyper parameter of the kernel (\(\gamma \)), were determined using a grid-search technique. The objective variable is the EC score of each person.

The details of the five estimation models are as follows.

  • Chance level: This model outputs the mean value of all participants.

  • Utterance amount model: This model uses the ratio of utterances and turns in the discussion.

  • Gaze amount model: This model uses the duration a person was looking at the speaker and listeners in the discussion.

  • GTP+DA model: This model uses the occurrence frequencies of GTPs for each DA category when the person is either the speaker or a listener during turn-kee** and the speaker, next-speaker, or a listener during turn-changing.

  • All model: This model uses the ratio of utterances and turns, duration of looking, and occurrence frequencies of GTPs for each DA category. In other words, the features are integrated with an early-fusion method.

Fig. 3.
figure 3

Mean absolute errors of estimation models based on Davis’ IRI dimensions during turn-kee**/changing

We used ten-fold cross validation with the data of the 16 participants. The mean absolute error of each estimation model is shown in Fig. 3. The GTP+DA model (0.081 for PT; 0.052 for EC, 0.076 for FS: 0.055 for PD) performed significantly better than the Utterance amount model (0.0840 for PT; 0.742 for EC, 0.530 for FS: 0.454 for PD) and Gaze amount model (0.529 for PT; 0.477 for EC, 0.322 for FS: 0.383 for PD) for all scores. Moreover, there was no difference in mean absolute error between the GTP+DA model and All model (0.081 for PT; 0.052 for EC, 0.076 for FS: 0.055 for PD). These results indicate that the combination of GTP and DA information during turn-kee**/changing is a good estimator of an individual’s empathy skill level even without using utterance information or simple gaze information. Therefore, we demonstrated that the gaze behavior and DA during turn-kee**/changing are useful in estimating the scores of PT, FS, PD as well as EC. In other words, they are useful for estimating the scores of all four Davis’ IRI dimensions.

5 Conclusion

We explored whether gaze behavior and DA during turn-kee**/changing are useful for estimating the scores of the three Davis’ IRI dimensions of PT, PD, and FS. We constructed and evaluated five estimation models based on these dimensions. We found that gaze behavior and DA are useful for estimating the scores of PT, PD, and FS. Therefore, the gaze behavior and DA during turn-changing/kee** are useful for estimating the scores of all four Davis’ IRI dimensions.

In the future, we plan to verify how effective gaze behavior and DA are for scores of other social-skill indices and personal traits. We will also explore how the other behaviors such as head nods [11, 14], respiration [16, 20] and mouth movement [12] during turn-taking is effective for estimating individual’s social skill.