Multimodal Design for Interactive Collaborative Problem-Solving Support

VanderHoeven, Hannah; Bradford, Mariah; Jung, Changsoo; Khebour, Ibrahim; Lai, Kenneth; Pustejovsky, James; Krishnaswamy, Nikhil; Blanchard, Nathaniel

doi:10.1007/978-3-031-60107-1_6

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14689))

Included in the following conference series:

International Conference on Human-Computer Interaction

169 Accesses

Abstract

When analyzing interactions during collaborative problem solving (CPS) tasks, many different communication modalities are likely to be present and interpretable. These modalities may include speech, gesture, action, affect, pose and object position in physical space, amongst others. As AI becomes more prominent in day-to-day use and various learning environments, such as classrooms, there is potential for it to support additional understanding into how small groups work together to complete CPS tasks. Designing interactive AI to support CPS requires creating a system that supports multiple different modalities. In this paper we discuss the importance of multimodal features to modeling CPS, how different modal channels must interact in a multimodal AI agent that supports a wide range of tasks, and design considerations that require forethought when building such a system that most effectively interacts with and aids small groups in successfully completing CPS tasks. We also outline various tool sets that can be leveraged to support each of the individual features and their integration, and various applications for such a system.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Chapter: USD 29.95; Price excludes VAT (Canada)

eBook: USD 109.00; Price excludes VAT (Canada)

Softcover Book: USD 74.99; Price excludes VAT (Canada)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Participants are conventionally indexed 1–3 from left to right in the video frame.

References

Andrews-Todd, J., Forsyth, C.M.: Exploring social and cognitive dimensions of collaborative problem solving in an open online simulation-based task. Comput. Hum. Behav. 104, 105759 (2020). https://doi.org/10.1016/j.chb.2018.10.025
Article Google Scholar
Arnheim, R.: Hand and mind: what gestures reveal about thought by David McNeill. Leonardo 27(4), 358 (1994)
Google Scholar
Banarescu, L., et al.: Abstract meaning representation for sembanking. In: Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse, pp. 178–186 (2013)
Google Scholar
Barron, B.: When smart groups fail. J. Learn. Sci. 12(3), 307–359 (2003)
Article MathSciNet Google Scholar
Bradford, M., Khebour, I., Blanchard, N., Krishnaswamy, N.: Automatic detection of collaborative states in small groups using multimodal features. In: AIED (2023)
Google Scholar
Brutti, R., Donatelli, L., Lai, K., Pustejovsky, J.: Abstract meaning representation for gesture, pp. 1576–1583, June 2022. https://aclanthology.org/2022.lrec-1.169
Castillon, I., Venkatesha, V., VanderHoeven, H., Bradford, M., Krishnaswamy, N., Blanchard, N.: Multimodal features for group dynamic-aware agents. In: Interdisciplinary Approaches to Getting AI Experts and Education Stakeholders Talking Workshop at AIEd. International AIEd Society (2022)
Google Scholar
Chejara, P., Prieto, L.P., Rodriguez-Triana, M.J., Kasepalu, R., Ruiz-Calleja, A., Shankar, S.K.: How to build more generalizable models for collaboration quality? Lessons learned from exploring multi-context audio-log datasets using multimodal learning analytics. In: LAK2023, pp. 111–121. Association for Computing Machinery, New York, NY, USA, March 2023. https://doi.org/10.1145/3576050.3576144
Cunico, F., Carletti, M., Cristani, M., Masci, F., Conigliaro, D.: 6D pose estimation for industrial applications, pp. 374–384, September 2019. https://doi.org/10.1007/978-3-030-30754-7_37
Dey, I., et al.: The NICE framework: analyzing students’ nonverbal interactions during collaborative learning. In: Pre-Conference Workshop on Collaboration Analytics at LAK 2023. SOLAR (2023)
Google Scholar
D’Mello, S., Graesser, A.: Dynamics of affective states during complex learning. Learn. Instr. 22(2), 145–157 (2012)
Article Google Scholar
Eyben, F., et al.: The Geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing. IEEE Trans. Affect. Comput. 7(2), 190–202 (2016). https://doi.org/10.1109/TAFFC.2015.2457417
Article Google Scholar
Eyben, F., Wöllmer, M., Schuller, B.: OpenSMILE: the Munich versatile and fast open-source audio feature extractor. In: Proceedings of the 18th ACM International Conference on Multimedia, pp. 1459–1462. Association for Computing Machinery, New York, NY, USA, October 2010. https://doi.org/10.1145/1873951.1874246
Fan, H., et al.: Multiscale vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6824–6835 (2021)
Google Scholar
Graesser, A.C., Fiore, S.M., Greiff, S., Andrews-Todd, J., Foltz, P.W., Hesse, F.W.: Advancing the science of collaborative problem solving. Psychol. Sci. Pub. Interest 19(2), 59–92 (2018). https://doi.org/10.1177/1529100618808244
Article Google Scholar
de Haas, M., Vogt, P., Krahmer, E.: When preschoolers interact with an educational robot, does robot feedback influence engagement? Multimodal Technol. Interact. 5(12), 77 (2021)
Article Google Scholar
Hara, K., Kataoka, H., Satoh, Y.: Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and ImageNet? In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6546–6555 (2018)
Google Scholar
Hesse, F., Care, E., Buder, J., Sassenberg, K., Griffin, P.: A framework for teachable collaborative problem solving skills. In: Griffin, P., Care, E. (eds.) Assessment and Teaching of 21st Century Skills. EAIA, pp. 37–56. Springer, Dordrecht (2015). https://doi.org/10.1007/978-94-017-9395-7_2
Chapter Google Scholar
Hu, Y., Fua, P., Wang, W., Salzmann, M.: Single-stage 6D object pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020
Google Scholar
Kandoi, C., et al.: Intentional microgesture recognition for extended human-computer interaction. In: Kurosu, M., Hashizume, A. (eds.) HCII 2023. LNCS, vol. 14011, pp. 499–518. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-35596-7_32
Kendon, A.: Gesticulation and speech: two aspects of the process of utterance. In: The Relationship of Verbal and Nonverbal Communication, vol. 25, pp. 207–227 (1980)
Google Scholar
Kendon, A.: Gesture: Visible Action as Utterance. Cambridge University Press (2004)
Google Scholar
Khebour, I., et al.: When text and speech are not enough: a multimodal dataset of collaboration in a situated task (2024)
Google Scholar
Kita, S.: Pointing: a foundational building block of human communication. In: Pointing: Where Language, Culture, and Cognition Meet, pp. 1–8 (2003)
Google Scholar
Kong, A.P.H., Law, S.P., Kwan, C.C.Y., Lai, C., Lam, V.: A coding system with independent annotations of gesture forms and functions during verbal communication: development of a database of speech and gesture (dosage). J. Nonverbal Behav. 39, 93–111 (2015)
Article Google Scholar
Krishnaswamy, N., et al.: Diana’s world: a situated multimodal interactive agent. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13618–13619 (2020)
Google Scholar
Krishnaswamy, N., et al.: Communicating and acting: understanding gesture in simulation semantics. In: IWCS 2017-12th International Conference on Computational Semantics-Short papers (2017)
Google Scholar
Krishnaswamy, N., Pustejovsky, J.: Generating a novel dataset of multimodal referring expressions. In: Proceedings of the 13th International Conference on Computational Semantics-Short Papers, pp. 44–51 (2019)
Google Scholar
Labbé, Y., Carpentier, J., Aubry, M., Sivic, J.: CosyPose: consistent multi-view multi-object 6D pose estimation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12362, pp. 574–591. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58520-4_34
Chapter Google Scholar
Lai, K., et al.: Modeling theory of mind in multimodal HCI. In: Digital Human Modeling and Applications in Health, Safety, Ergonomics and Risk Management. Springer (2024)
Google Scholar
Lascarides, A., Stone, M.: A formal semantic analysis of gesture. J. Semant. 26(4), 393–449 (2009)
Article Google Scholar
Li, J., **, K., Zhou, D., Kubota, N., Ju, Z.: Attention mechanism-based CNN for facial expression recognition. Neurocomputing 411, 340–350 (2020). https://doi.org/10.1016/j.neucom.2020.06.014
Li, S., Deng, W.: Deep facial expression recognition: a survey. IEEE Trans. Affect. Comput. 13(3), 1195–1215 (2022). https://doi.org/10.1109/TAFFC.2020.2981446
Article Google Scholar
Mather, S.M.: Ethnographic research on the use of visually based regulators for teachers and interpreters. In: Attitudes, Innuendo, and Regulators, pp. 136–161 (2005)
Google Scholar
McNeill, D.: Hand and mind. In: Advances in Visual Semiotics, vol. 351 (1992)
Google Scholar
Narayana, P., Beveridge, R., Draper, B.A.: Gesture recognition: locus on the hands. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5235–5244 (2018)
Google Scholar
Oertel, C., Salvi, G.: A gaze-based method for relating group involvement to individual engagement in multimodal multiparty dialogue. In: Proceedings of the 15th ACM on International Conference on Multimodal Interaction - ICMI 2013, pp. 99–106. ACM Press, Sydney, Australia (2013). https://doi.org/10.1145/2522848.2522865
Ogden, L.: Collaborative tasks, collaborative children: an analysis of reciprocity during peer interaction at key stage 1. Br. Edu. Res. J. 26(2), 211–226 (2000)
Article Google Scholar
Palmer, M., Gildea, D., Kingsbury, P.: The proposition bank: an annotated corpus of semantic roles. Comput. Linguist. 31(1), 71–106 (2005)
Article Google Scholar
Pustejovsky, J., Krishnaswamy, N.: VoxML: a visualization modeling language. In: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), pp. 4606–4613. European Language Resources Association (ELRA), Portorož, Slovenia, May 2016. https://aclanthology.org/L16-1730
Pustejovsky, J., Krishnaswamy, N.: Embodied human computer interaction. KI-Künstliche Intelligenz 35(3–4), 307–327 (2021)
Article Google Scholar
Pustejovsky, J., Krishnaswamy, N.: Multimodal semantics for affordances and actions. In: Kurosu, M. (ed.) HCII 2022. LNCS, vol. 13302, pp. 137–160. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-05311-5_9
Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., Sutskever, I.: Robust speech recognition via large-scale weak supervision (2022)
Google Scholar
Rennie, C., Shome, R., Bekris, K.E., De Souza, A.F.: A dataset for improved RGBD-based object detection and pose estimation for warehouse pick-and-place. IEEE Rob. Autom. Lett. 1(2), 1179–1185 (2016)
Article Google Scholar
Ruan, X., Palansuriya, C., Constantin, A.: Affective dynamic based technique for facial emotion recognition (FER) to support intelligent tutors in education. In: Wang, N., Rebolledo-Mendez, G., Matsuda, N., Santos, O.C., Dimitrova, V. (eds.) AIED, vol. 13916, pp. 774–779. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-36272-9_70
Sap, M., LeBras, R., Fried, D., Choi, Y.: Neural theory-of-mind? On the limits of social intelligence in large LMS. ar**v preprint ar**v:2210.13312 (2022)
Schneider, B., Pea, R.: Does seeing one another’s gaze affect group dialogue? A computational approach. J. Learn. Anal. 2(2), 107–133 (2015)
Article Google Scholar
Stewart, A.E.B., Keirn, Z., D’Mello, S.K.: Multimodal modeling of collaborative problem-solving facets in triads. User Model. User-Adap. Inter. 31(4), 713–751 (2021). https://doi.org/10.1007/s11257-021-09290-y
Article Google Scholar
Sun, C., Shute, V.J., Stewart, A., Yonehiro, J., Duran, N., D’Mello, S.: Towards a generalized competency model of collaborative problem solving. Comput. Educ. 143, 103672 (2020). https://www.sciencedirect.com/science/article/pii/S0360131519302258
Sun, C., et al.: The relationship between collaborative problem solving behaviors and solution outcomes in a game-based learning environment. Comput. Hum. Behav. 128, 107120 (2022)
Article Google Scholar
Terpstra, C., Khebour, I., Bradford, M., Wisniewski, B., Krishnaswamy, N., Blanchard, N.: How good is automatic segmentation as a multimodal discourse annotation aid? (2023)
Google Scholar
Tomasello, M., et al.: Joint attention as social cognition. In: Joint Attention: Its Origins and Role in Development, vol. 103130, pp. 103–130 (1995)
Google Scholar
Tong, Z., Song, Y., Wang, J., Wang, L.: VideoMAE: masked autoencoders are data-efficient learners for self-supervised video pre-training. In: Advances in Neural Information Processing Systems, vol. 35, pp. 10078–10093 (2022)
Google Scholar
Törmänen, T., Järvenoja, H., Mänty, K.: Exploring groups’ affective states during collaborative learning: what triggers activating affect on a group level? Educ. Tech. Res. Dev. 69(5), 2523–2545 (2021)
Article Google Scholar
Tyree, S., et al.: 6-DoF pose estimation of household objects for robotic manipulation: an accessible dataset and benchmark. In: IROS (2022)
Google Scholar
Ullman, T.: Large language models fail on trivial alterations to theory-of-mind tasks. ar**v preprint ar**v:2302.08399 (2023)
VanderHoeven, H., Blanchard, N., Krishnaswamy, N.: Robust motion recognition using gesture phase annotation. In: Duffy, V.G. (ed.) HCII 2023. LNCS, vol. 14028, pp. 592–608. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-35741-1_42
VanderHoeven, H., Blanchard, N., Krishnaswamy, N.: Point target detection for multimodal communication. In: Digital Human Modeling and Applications in Health, Safety, Ergonomics and Risk Management. Springer (2024)
Google Scholar
Velikovich, L., Williams, I., Scheiner, J., Aleksic, P., Moreno, P., Riley, M.: Semantic lattice processing in contextual automatic speech recognition for google assistant, pp. 2222–2226 (2018). https://www.isca-speech.org/archive/Interspeech_2018/pdfs/2453.pdf
Wang, G., Manhardt, F., Tombari, F., Ji, X.: GDR-Net: geometry-guided direct regression network for monocular 6D object pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 16611–16621, June 2021
Google Scholar
Wolf, K., Naumann, A., Rohs, M., Müller, J.: A taxonomy of microinteractions: defining microgestures based on ergonomic and scenario-dependent requirements. In: Campos, P., Graham, N., Jorge, J., Nunes, N., Palanque, P., Winckler, M. (eds.) INTERACT 2011. LNCS, vol. 6946, pp. 559–575. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-23774-4_45
Chapter Google Scholar
Zhang, F., et al.: MediaPipe hands: on-device real-time hand tracking. ar**v preprint ar**v:2006.10214 (2020)
Zoric, G., Smid, K., Pandzic, I.S.: Facial gestures: taxonomy and application of non-verbal, non-emotional facial displays for embodied conversational agents. In: Conversational Informatics: An Engineering Approach, pp. 161–182 (2007)
Google Scholar

Download references

Acknowledgments

This work was partially supported by the National Science Foundation under subcontracts to Colorado State University and Brandeis University on award DRL 2019805. The views expressed are those of the authors and do not reflect the official policy or position of the U.S. Government. All errors and mistakes are, of course, the responsibilities of the authors.

Author information

Authors and Affiliations

Colorado State University, Fort Collins, CO, 80523, USA
Hannah VanderHoeven, Mariah Bradford, Changsoo Jung, Ibrahim Khebour, Nikhil Krishnaswamy & Nathaniel Blanchard
Brandeis University, Waltham, MA, 02453, USA
Kenneth Lai & James Pustejovsky

Authors

Hannah VanderHoeven
View author publications
You can also search for this author in PubMed Google Scholar
Mariah Bradford
View author publications
You can also search for this author in PubMed Google Scholar
Changsoo Jung
View author publications
You can also search for this author in PubMed Google Scholar
Ibrahim Khebour
View author publications
You can also search for this author in PubMed Google Scholar
Kenneth Lai
View author publications
You can also search for this author in PubMed Google Scholar
James Pustejovsky
View author publications
You can also search for this author in PubMed Google Scholar
Nikhil Krishnaswamy
View author publications
You can also search for this author in PubMed Google Scholar
Nathaniel Blanchard
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hannah VanderHoeven .

Editor information

Editors and Affiliations

Tokyo City University, Tokyo, Japan
Hirohiko Mori
Tokyo University of Science, Tokyo, Japan
Yumi Asahi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

VanderHoeven, H. et al. (2024). Multimodal Design for Interactive Collaborative Problem-Solving Support. In: Mori, H., Asahi, Y. (eds) Human Interface and the Management of Information. HCII 2024. Lecture Notes in Computer Science, vol 14689. Springer, Cham. https://doi.org/10.1007/978-3-031-60107-1_6

Download citation

DOI: https://doi.org/10.1007/978-3-031-60107-1_6
Published: 01 June 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-60106-4
Online ISBN: 978-3-031-60107-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Multimodal Design for Interactive Collaborative Problem-Solving Support