Abstract
We explored the effectiveness of external observable behaviors in multi-party discussions to estimate an individual’s empathy skill level. In our previous research, we estimated personal empathy skills from the external observable behavior in multi-person dialogues. We demonstrated that the gaze behavior towards the end of utterances and dialogue act (DA), i.e., verbal-behavior information indicating the intension of an utterance during turn-kee**/changing, are important for estimating empathy level. We focused on Davis’ Interpersonal Reactivity Index (IRI), which measures empathy skill level and consists of four dimensions of empathy, i.e., empathic concern (EC), perspective taking (PT), personal distress (PD), and fantasy (FS), as the estimation target. We particularly focused on estimating an individual’s EC score. In this research, we explored whether gaze behavior and DA during turn-kee**/changing are useful regarding the other three dimensions, i.e., PT, PD, and FS by constructing and evaluating estimation models based on these dimensions. We found that gaze behavior and DA are useful for estimating the scores of these three dimensions. Therefore, gaze behavior and DA during turn-changing/kee** are useful for estimating the scores of all four Davis’ IRI dimensions.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
Social communication skills are fundamental for successful communication in globalized and multi-cultural societies as they are central to education, work, and daily life. Although there is great interest in the notion of communication skills in scientific and real-life applications, the concept is difficult to generally define due to the complexity of communication, wide variety of related cognitive and social abilities, and huge situational variability [7]. Techniques that involve nonverbal behaviors to estimate communication skills have been receiving much attention. For example, researchers have developed models for estimating public speaking skills [29, 32], persuasiveness [28], communication skills during job interviews [25] and group work [26], and leadership [31].
We are working on constructing models for estimating “the empathy skill level” in multi-party discussions. Empathy, which is the ability to understand and share the feelings of others, is one of the most important social skills and has long been studied in psychology [4, 5]. Davis’ Interpersonal Reactivity Index (IRI) [3] includes four dimensions of empathy: perspective taking (PT), i.e., the tendency to adopt another’s psychological perspective; fantasy (FS), i.e., the tendency to strongly identify with fictitious characters; empathic concern (EC), i.e., the tendency to experience feelings of warmth, sympathy, and concern toward others; and personal distress (PD), i.e., the tendency to have feelings of discomfort and concern when witnessing others’ negative experiences. Davis’ IRI has been translated into many languages [6] and used in a wide variety of fields such as neuroscience [1] and genetics [30].
In our previous study, we developed an estimation model of an individual’s EC score that uses the gaze behavior and dialogue act (DA) near the end of utterances during turn-kee**/changing as feature values [17]. The model has a higher estimation accuracy than one using the overall values of verbal/nonverbal behaviors in an entire discussion such as the amount of utterances and physical motion used in many previous studies on skill estimation [7, 25, 26, 28, 29, 31, 32]. This suggests that behavior during turn-kee**/turn-changing in a very short time interval is useful for estimating individual empathy skill level. We demonstrated that the gaze behavior towards the end of utterances and DA, i.e., verbal-behavior information indicating the intension of an utterance, during turn-kee**/changing are important for estimating an individual’s EC score.
Since we focused on estimating only an individual’s EC score, it is necessary to verify whether gaze behavior and DA are useful for estimating the scores of the other three dimensions, i.e., PT, PD, and FS.
In this research, we explored whether gaze behavior and DA during turn-kee**/changing are useful in estimating the scores of the remaining three Davis’ IRI dimensions by constructing and evaluating estimation models based on these dimensions. We found that gaze behavior and DA are useful for estimating the scores of PT, PD, and FS. Therefore, gaze behavior and DA during turn-changing/kee** are useful for estimating the scores of all four Davis’ IRI dimensions.
2 Corpus Data
We previously created a face-to-face conversation corpus for develo** our estimation model of an individual’s empathy skill level using gaze behavior and DA in multi-party discussions [13, 17]. In this section, we give details of our corpus. The corpus includes eight face-to-face four-person discussions held by four groups of four different people (16 participants in total). In each group, the four participants were Japanese women in their 20’s and 30’s who had never met before. They sat facing each other (Fig. 1). We labeled the participants, from left to right, P1, P2, P3, and P4. They argued and gave opinions in response to highly divisive questions, such as “Is marriage the same as love?”, and needed to reach a conclusion within ten minutes. All four four-person groups took part in two discussions.
The participants’ voices were recorded with a pin microphone attached to their chests, and the entire discussions were videoed. Upper body shots of each participant (recorded at 30 Hz) were also taken. From the collected data for all eight discussions (80 min in total) and from the recorded data, we constructed a multimodal corpus consisting of the following verbal/nonverbal behaviors and the participants’ empathy skill levels.
-
Utterances and DAs: We built the utterance unit using the inter-pausal unit (IPU) [23]. The utterance interval was extracted manually from the speech wave. The portion of an utterance followed by 200 ms of silence was used as the unit of one utterance. From the created IPU, backchannels were excluded, and an utterance unit continued from the same person was considered as one utterance turn. IPU pairs adjoined in time, and IPU groups during turn-kee**/changing were created. The data for speech overlaps, i.e., when a listener interrupted during a speaker’s utterance or two or more participants spoke simultaneously at turn-changing, were excluded from the IPU pairs for analysis. Eventually, there were 1227 IPUs during turn-kee** and 129 during turn-changing.
-
Gaze objects: A skilled annotator manually annotated the gaze objects by using bust/head and overhead views in each video frame. The gaze objects were the four participants (labeled P1, P2, P3, and P4, as mentioned above) and non-persons, i.e., the walls or floor. Three annotators annotated the gaze behavior in our conversation dataset to verify the annotation quality. Conger’s Kappa coefficient was 0.887. Based on the benchmarks of [8], the gaze annotations were of excellent quality.
-
Empathy skill level: All participants were asked to answer a questionnaire that was based on Davis’ IRI [3]. We collected the scores of the four Davis’ IRI dimensions to estimate the participants’ empathy skill levels.
All verbal and nonverbal behavior data were integrated at 30 Hz for visual display using the NTT Multimodal Meeting Viewer [27]. This viewer enables us to annotate the multimodal data frame-by-frame and observe the data intuitively.
3 Feature Values
We used the gaze behavior and DA during turn-changing/turn-kee** as feature values for develo** our estimation model of empathy skill level [13, 17]. In this section, we give details of these feature values.
We first introduce the feature values of gaze behavior. We focused on Gaze Transition pattern (GTP) for analyzing gaze behavior, which are temporal transitions of participant’s gaze behavior near the end of utterances [10, 15, 18, 19, 21]. A GTP is expressed as an n-gram, which is defined as a sequence of gaze-direction shifts. We demonstrated that the occurrence frequencies of GTPs differ significantly for a speaker and listener during turn-kee** and a speaker, listener who becomes the next speaker (hereafter called “next-speaker”), and listeners who do not become the next speaker (hereafter called “listeners”) during turn-changing. We also demonstrated that GTP is effective for estimating the next speaker in multi-party discussions. Thus, we used GTPs as analysis gaze parameters. To generate a GTP, we focused on the gazed object for 1200 ms: 1000 ms before and 200 ms after the utterance since the GTP during 1200 ms is important for turn-taking [10, 15, 18, 19, 21]. A GTP is composed of a person or object classified as “speaker”, “listener”, or “non-person” and labeled. We considered whether there was mutual gaze and classified gaze behavior using the following seven gaze labels.
-
S: Person looks at a speaker without mutual gaze (speaker does not look at the listener.).
-
SM: Person looks at the speaker with mutual gaze (speaker looks at a listener.).
-
L1\(\sim \)L3: Person looks at another listener without mutual gaze. Labels L1, L2, and L3 indicate different people. The sitting position does not matter. For example, if P1 who is speaking looks at P2 followed by P3 then P2 again, the gaze transition pattern of P1 is L1-L2-L1.
-
LM1\(\sim \)LM3: Person looks at another listener with mutual gaze. Labels LM1, LM2, and LM3 indicate different people.
-
N: Person looks at the next speaker without mutual gaze only during turn-changing.
-
NM: Person looks at the next speaker with mutual gaze only during turn-changing.
-
X: Person looks at non-persons, such as the floor or ceiling, i.e., gaze aversion.
Figure 2 shows how GTPs are constructed: P1 finishes speaking, then P2 starts to speak. Person P1 gazes at P2 after she gazes at a non-person during the analysis interval. When P1 looks at P2, P2 looks at P1; that is, there is mutual gaze. Therefore, P1’s GTP is X-NM. Person P2 looks at P4 after making eye contact with P1; thus, P2’s GTP is SM-L1. Person P3 looks at a non-person after looking at P1; thus, P3’s GTP is S-X. Person P4 looks at P2 and P3 after looking at a non-person; thus, P4’s GTP is X-N-L1.
A DA for each IPU was extracted using an estimation technique for Japanese [9, 24] for DA analysis. This technique can estimate a DA of a sentence from among 33 DA categories using word n-grams, semantic categories (obtained from a Japanese thesaurus Goi-Taikei), and character n-grams. The technique outputs 33 DA categories. We grouped them into the following five major categories.
-
Provision: Utterance for providing information
-
Self-discourse: Utterance for disclosing oneself
-
Empathy: Utterance intending empathy
-
Turn-yielding: Utterance intending a listener to speak next (ex. utterance of question, suggestion, or confirmation)
-
Others: Utterance not included in the above four categories
About 90% of utterances included the DA categories of Provision, Self-disclosure, Empathy, and Turn-yielding.
In our previous study, we demonstrated that the occurrence frequencies of GTPs accompanying each DA category for the speaker and listeners during turn-kee**; and the speaker, next-speaker, and listeners during turn-changing in multi-party discussions is effective for estimating the participant’s EC score [13, 17]. We used them as feature values in this study in similar manner as we did in our previous study.
4 Empathy-Skill-Estimation Models
The goal of this study was to demonstrate that the gaze behavior and DA during turn-kee**/changing are useful for estimating individuals’ empathy skill levels. We constructed a model for estimating the EC score using GTP and DA information, one using utterance information such as duration of speaking and number of speaking-turns and one using simple gaze information (which is the duration of looking at a speaker or listener in a discussion) to compare the usefulness of GTP and DA information. We also constructed two estimation models using GTP and DA and using GTP, DA, utterance information, and simple gaze information to evaluate the effectiveness of multimodal fusion.
We constructed the estimation models using a SMOreg [22], which implements a support vector machine (SVM) for regression in Weka [2], and evaluated the accuracy of the models and the effectiveness of each feature. The settings of the SVM, i.e., the polynomial kernel, cost parameter (C), and hyper parameter of the kernel (\(\gamma \)), were determined using a grid-search technique. The objective variable is the EC score of each person.
The details of the five estimation models are as follows.
-
Chance level: This model outputs the mean value of all participants.
-
Utterance amount model: This model uses the ratio of utterances and turns in the discussion.
-
Gaze amount model: This model uses the duration a person was looking at the speaker and listeners in the discussion.
-
GTP+DA model: This model uses the occurrence frequencies of GTPs for each DA category when the person is either the speaker or a listener during turn-kee** and the speaker, next-speaker, or a listener during turn-changing.
-
All model: This model uses the ratio of utterances and turns, duration of looking, and occurrence frequencies of GTPs for each DA category. In other words, the features are integrated with an early-fusion method.
We used ten-fold cross validation with the data of the 16 participants. The mean absolute error of each estimation model is shown in Fig. 3. The GTP+DA model (0.081 for PT; 0.052 for EC, 0.076 for FS: 0.055 for PD) performed significantly better than the Utterance amount model (0.0840 for PT; 0.742 for EC, 0.530 for FS: 0.454 for PD) and Gaze amount model (0.529 for PT; 0.477 for EC, 0.322 for FS: 0.383 for PD) for all scores. Moreover, there was no difference in mean absolute error between the GTP+DA model and All model (0.081 for PT; 0.052 for EC, 0.076 for FS: 0.055 for PD). These results indicate that the combination of GTP and DA information during turn-kee**/changing is a good estimator of an individual’s empathy skill level even without using utterance information or simple gaze information. Therefore, we demonstrated that the gaze behavior and DA during turn-kee**/changing are useful in estimating the scores of PT, FS, PD as well as EC. In other words, they are useful for estimating the scores of all four Davis’ IRI dimensions.
5 Conclusion
We explored whether gaze behavior and DA during turn-kee**/changing are useful for estimating the scores of the three Davis’ IRI dimensions of PT, PD, and FS. We constructed and evaluated five estimation models based on these dimensions. We found that gaze behavior and DA are useful for estimating the scores of PT, PD, and FS. Therefore, the gaze behavior and DA during turn-changing/kee** are useful for estimating the scores of all four Davis’ IRI dimensions.
In the future, we plan to verify how effective gaze behavior and DA are for scores of other social-skill indices and personal traits. We will also explore how the other behaviors such as head nods [11, 14], respiration [16, 20] and mouth movement [12] during turn-taking is effective for estimating individual’s social skill.
References
Banissy, M.J., Kanai, R., Walsh, V., Rees, G.: Inter-individual differences in empathy are reflected in human brain structure. NeuroImage 62, 2034–2039 (2012)
Bouckaert, R.R., et al.: WEKA-experiences with a Java open-source project. J. Mach. Learn. Res. 11, 2533–2541 (2010)
Davis, M.H.: A multidimensional approach to individual differences in empathy. 10 (1980)
de Waal, F.B.M.: The antiquity of empathy. Science 336, 874–876 (2012)
Decetya, J., Svetlova, M.: Putting together phylogenetic and ontogenetic perspectives on empathy. Dev. Cogn. Neurosci. 2(1), 1–24 (2012)
Fernandez, A., Dufey, M., Kramp, U.: Testing the psychometric properties of the interpersonal reactivity index (IRI) in Chile: empathy in a different cultural context. Eur. J. Assess. 27, 179–185 (2011)
Greene, J.O., Burleson, B.R.: Handbook of Communication and Social Interaction Skills. Psychology Press, Philadelphia (2003)
Gwet, K.L.: Handbook of inter-rater reliability: the definitive guide to measuring the extent of agreement among raters. In: Advanced Analytics, LLC (2014)
Higashinaka, R., et al.: Towards an open-domain conversational system fully based on natural language processing. In: International Conference on Computational Linguistics, pp. 928–939 (2014)
Ishii, R., Kumano, S., Otsuka, K.: Multimodal fusion using respiration and gaze behavior for predicting next speaker in multi-party meetings. In: ICMI, pp. 99–106 (2015)
Ishii, R., Kumano, S., Otsuka, K.: Predicting next speaker using head movement in multi-party meetings. In: ICASSP, pp. 2319–2323 (2015)
Ishii, R., Kumano, S., Otsuka, K.: Analyzing mouth-opening transition pattern for predicting next speaker in multi-party meetings. In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing, pp. 209–216 (2016)
Ishii, R., Kumano, S., Otsuka, K.: Analyzing gaze behavior during turn-taking for estimating empathy skill level. In: Proceedings of the 19th ACM International Conference on Multimodal Interaction, ICMI 2017, pp. 365–373. ACM, New York (2017)
Ishii, R., Kumano, S., Otsuka, K.: Prediction of next-utterance timing using head movement in multi-party meetings. In: Proceedings of the 5th International Conference on Human Agent Interaction, HAI 2017, pp. 181–187. ACM, New York (2017)
Ishii, R., Otsuka, K., Kumano, S., Yamamoto, J.: Predicting of who will be the next speaker and when using gaze behavior in multiparty meetings. ACM Trans. Interact. Intell. Syst. 6(1), 4 (2016)
Ishii, R., Otsuka, K., Kumano, S., Yamamoto, J.: Using respiration to predict who will speak next and when in multiparty meetings. ACM Trans. Interact. Intell. Syst. 6(2), 20 (2016)
Ishii, R., Otsuka, K., Kumano, S., Higashinaka, R., Tomita, J.: Analyzing gaze behavior and dialogue act during turn-taking for estimating empathy skill level. In: Proceedings of the 20th ACM International Conference on Multimodal Interaction, ICMI 2018, pp. 31–39. ACM, New York (2018)
Ishii, R., Otsuka, K., Kumano, S., Matsuda, M., Yamato, J.: Predicting next speaker and timing from gaze transition patterns in multi-party meetings. In: Proceedings of the International Conference on Multimodal Interaction, pp. 79–86 (2013)
Ishii, R., Otsuka, K., Kumano, S., Yamato, J.: Analysis and modeling of next speaking start timing based on gaze behavior in multi-party meetings. In: Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, pp. 694–698 (2014)
Ishii, R., Otsuka, K., Kumano, S., Yamato, J.: Analysis of respiration for prediction of who will be next speaker and when? In multi-party meetings. In: Proceedings of the International Conference on Multimodal Interaction, pp. 18–25 (2014)
Ishii, R., Otsuka, K., Kumano, S., Yamato, J.: Analysis of timing structure of eye contact in turn-changing. In: Proceedings of the 7th Workshop on Eye Gaze in Intelligent Human Machine Interaction, GazeIn 2014, pp. 15–20. ACM, New York (2014)
Keerthi, S.S., Shevade, S.K., Bhattacharyya, C., Murthy, K.R.K.: Improvements to platt’s SMO algorithm for SVM classifier design. Neural Comput. 13(3), 637–649 (2001)
Koiso, H., Horiuchi, Y., Tutiya, S., Ichikawa, A., Den, Y.: An analysis of turn-taking and backchannels based on prosodic and syntactic features in Japanese map task dialogs. Lang. Speech 41, 295–321 (1998)
Meguro, T., Higashinaka, R., Minami, Y., Dohsaka, K.: Controlling listening-oriented dialogue using partially observable Markov decision processes. In: International Conference on Computational Linguistics, pp. 761–769 (2010)
Nguyen, L., Frauendorfer, D., Mast, M., Gatica-Perez, D.: Hire me: computational inference of hirability in employment interviews based on nonverbal behavior. IEEE Trans. Multimed. 16(4), 1018–1031 (2014)
Okada, S., et al.: Estimating communication skills using dialogue acts and nonverbal features in multiple discussion datasets. In: Proceedings of the International Conference on Multimodal Interaction, pp. 169–176 (2016)
Otsuka, K., Araki, S., Mikami, D., Ishizuka, K., Fujimoto, M., Yamato, J.: Realtime meeting analysis and 3D meeting viewer based on omnidirectional multimodal sensors. In: ACM International Conference on Multimodal Interfaces and Workshop on Machine Learning for Multimodal Interaction, pp. 219–220 (2009)
Park, S., Shim, H.S., Chatterjee, M., Sagae, K., Morency, L.-P.: Computational analysis of persuasiveness in social multimedia: a novel dataset and multimodal prediction approach. In: Proceedings of the ACM ICMI, pp. 50–57 (2014)
Ramanarayanan, V., Leong, C.W., Feng, G., Chen, L., Suendermann-Oeft, D.: Evaluating speech, face, emotion and body movement time-series features for automated multimodal presentation scoring. In: Proceedings of the ACM ICMI, pp. 23–30 (2015)
Rodrigues, S.M., Saslow, L.R., Garcia, N., John, O.P., Keltner, D.: Oxytocin receptor genetic variation relates to empathy and stress reactivity in humans. Proc. Nat. Acad. Sci. U.S.A. 106, 21437–21441 (2009)
Sanchez-Cortes, D., Aran, O., Mast, M.S., Gatica-Perez, D.: A nonverbal behavior approach to identify emergent leaders in small groups. IEEE Trans. Multimed. 14(3), 816–832 (2012)
Wortwein, T., Chollet, M., Schauerte, B., Morency, L.-P., Stiefelhagen, R., Scherer, S.: Multimodal public speaking performance assessment. In: Proceedings of the ACM ICMI, pp. 43–50 (2015)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Ishii, R., Otsuka, K., Kumano, S., Higashinaka, R., Tomita, J. (2019). Estimating Interpersonal Reactivity Scores Using Gaze Behavior and Dialogue Act During Turn-Changing. In: Meiselwitz, G. (eds) Social Computing and Social Media. Communication and Social Communities. HCII 2019. Lecture Notes in Computer Science(), vol 11579. Springer, Cham. https://doi.org/10.1007/978-3-030-21905-5_4
Download citation
DOI: https://doi.org/10.1007/978-3-030-21905-5_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-21904-8
Online ISBN: 978-3-030-21905-5
eBook Packages: Computer ScienceComputer Science (R0)