Multimodal Design for Interactive Collaborative Problem-Solving Support

  • Conference paper
  • First Online:
Human Interface and the Management of Information (HCII 2024)

Abstract

When analyzing interactions during collaborative problem solving (CPS) tasks, many different communication modalities are likely to be present and interpretable. These modalities may include speech, gesture, action, affect, pose and object position in physical space, amongst others. As AI becomes more prominent in day-to-day use and various learning environments, such as classrooms, there is potential for it to support additional understanding into how small groups work together to complete CPS tasks. Designing interactive AI to support CPS requires creating a system that supports multiple different modalities. In this paper we discuss the importance of multimodal features to modeling CPS, how different modal channels must interact in a multimodal AI agent that supports a wide range of tasks, and design considerations that require forethought when building such a system that most effectively interacts with and aids small groups in successfully completing CPS tasks. We also outline various tool sets that can be leveraged to support each of the individual features and their integration, and various applications for such a system.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (Canada)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 109.00
Price excludes VAT (Canada)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 74.99
Price excludes VAT (Canada)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free ship** worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Participants are conventionally indexed 1–3 from left to right in the video frame.

References

  1. Andrews-Todd, J., Forsyth, C.M.: Exploring social and cognitive dimensions of collaborative problem solving in an open online simulation-based task. Comput. Hum. Behav. 104, 105759 (2020). https://doi.org/10.1016/j.chb.2018.10.025

    Article  Google Scholar 

  2. Arnheim, R.: Hand and mind: what gestures reveal about thought by David McNeill. Leonardo 27(4), 358 (1994)

    Google Scholar 

  3. Banarescu, L., et al.: Abstract meaning representation for sembanking. In: Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse, pp. 178–186 (2013)

    Google Scholar 

  4. Barron, B.: When smart groups fail. J. Learn. Sci. 12(3), 307–359 (2003)

    Article  MathSciNet  Google Scholar 

  5. Bradford, M., Khebour, I., Blanchard, N., Krishnaswamy, N.: Automatic detection of collaborative states in small groups using multimodal features. In: AIED (2023)

    Google Scholar 

  6. Brutti, R., Donatelli, L., Lai, K., Pustejovsky, J.: Abstract meaning representation for gesture, pp. 1576–1583, June 2022. https://aclanthology.org/2022.lrec-1.169

  7. Castillon, I., Venkatesha, V., VanderHoeven, H., Bradford, M., Krishnaswamy, N., Blanchard, N.: Multimodal features for group dynamic-aware agents. In: Interdisciplinary Approaches to Getting AI Experts and Education Stakeholders Talking Workshop at AIEd. International AIEd Society (2022)

    Google Scholar 

  8. Chejara, P., Prieto, L.P., Rodriguez-Triana, M.J., Kasepalu, R., Ruiz-Calleja, A., Shankar, S.K.: How to build more generalizable models for collaboration quality? Lessons learned from exploring multi-context audio-log datasets using multimodal learning analytics. In: LAK2023, pp. 111–121. Association for Computing Machinery, New York, NY, USA, March 2023. https://doi.org/10.1145/3576050.3576144

  9. Cunico, F., Carletti, M., Cristani, M., Masci, F., Conigliaro, D.: 6D pose estimation for industrial applications, pp. 374–384, September 2019. https://doi.org/10.1007/978-3-030-30754-7_37

  10. Dey, I., et al.: The NICE framework: analyzing students’ nonverbal interactions during collaborative learning. In: Pre-Conference Workshop on Collaboration Analytics at LAK 2023. SOLAR (2023)

    Google Scholar 

  11. D’Mello, S., Graesser, A.: Dynamics of affective states during complex learning. Learn. Instr. 22(2), 145–157 (2012)

    Article  Google Scholar 

  12. Eyben, F., et al.: The Geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing. IEEE Trans. Affect. Comput. 7(2), 190–202 (2016). https://doi.org/10.1109/TAFFC.2015.2457417

    Article  Google Scholar 

  13. Eyben, F., Wöllmer, M., Schuller, B.: OpenSMILE: the Munich versatile and fast open-source audio feature extractor. In: Proceedings of the 18th ACM International Conference on Multimedia, pp. 1459–1462. Association for Computing Machinery, New York, NY, USA, October 2010. https://doi.org/10.1145/1873951.1874246

  14. Fan, H., et al.: Multiscale vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6824–6835 (2021)

    Google Scholar 

  15. Graesser, A.C., Fiore, S.M., Greiff, S., Andrews-Todd, J., Foltz, P.W., Hesse, F.W.: Advancing the science of collaborative problem solving. Psychol. Sci. Pub. Interest 19(2), 59–92 (2018). https://doi.org/10.1177/1529100618808244

    Article  Google Scholar 

  16. de Haas, M., Vogt, P., Krahmer, E.: When preschoolers interact with an educational robot, does robot feedback influence engagement? Multimodal Technol. Interact. 5(12), 77 (2021)

    Article  Google Scholar 

  17. Hara, K., Kataoka, H., Satoh, Y.: Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and ImageNet? In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6546–6555 (2018)

    Google Scholar 

  18. Hesse, F., Care, E., Buder, J., Sassenberg, K., Griffin, P.: A framework for teachable collaborative problem solving skills. In: Griffin, P., Care, E. (eds.) Assessment and Teaching of 21st Century Skills. EAIA, pp. 37–56. Springer, Dordrecht (2015). https://doi.org/10.1007/978-94-017-9395-7_2

    Chapter  Google Scholar 

  19. Hu, Y., Fua, P., Wang, W., Salzmann, M.: Single-stage 6D object pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020

    Google Scholar 

  20. Kandoi, C., et al.: Intentional microgesture recognition for extended human-computer interaction. In: Kurosu, M., Hashizume, A. (eds.) HCII 2023. LNCS, vol. 14011, pp. 499–518. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-35596-7_32

  21. Kendon, A.: Gesticulation and speech: two aspects of the process of utterance. In: The Relationship of Verbal and Nonverbal Communication, vol. 25, pp. 207–227 (1980)

    Google Scholar 

  22. Kendon, A.: Gesture: Visible Action as Utterance. Cambridge University Press (2004)

    Google Scholar 

  23. Khebour, I., et al.: When text and speech are not enough: a multimodal dataset of collaboration in a situated task (2024)

    Google Scholar 

  24. Kita, S.: Pointing: a foundational building block of human communication. In: Pointing: Where Language, Culture, and Cognition Meet, pp. 1–8 (2003)

    Google Scholar 

  25. Kong, A.P.H., Law, S.P., Kwan, C.C.Y., Lai, C., Lam, V.: A coding system with independent annotations of gesture forms and functions during verbal communication: development of a database of speech and gesture (dosage). J. Nonverbal Behav. 39, 93–111 (2015)

    Article  Google Scholar 

  26. Krishnaswamy, N., et al.: Diana’s world: a situated multimodal interactive agent. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13618–13619 (2020)

    Google Scholar 

  27. Krishnaswamy, N., et al.: Communicating and acting: understanding gesture in simulation semantics. In: IWCS 2017-12th International Conference on Computational Semantics-Short papers (2017)

    Google Scholar 

  28. Krishnaswamy, N., Pustejovsky, J.: Generating a novel dataset of multimodal referring expressions. In: Proceedings of the 13th International Conference on Computational Semantics-Short Papers, pp. 44–51 (2019)

    Google Scholar 

  29. Labbé, Y., Carpentier, J., Aubry, M., Sivic, J.: CosyPose: consistent multi-view multi-object 6D pose estimation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12362, pp. 574–591. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58520-4_34

    Chapter  Google Scholar 

  30. Lai, K., et al.: Modeling theory of mind in multimodal HCI. In: Digital Human Modeling and Applications in Health, Safety, Ergonomics and Risk Management. Springer (2024)

    Google Scholar 

  31. Lascarides, A., Stone, M.: A formal semantic analysis of gesture. J. Semant. 26(4), 393–449 (2009)

    Article  Google Scholar 

  32. Li, J., **, K., Zhou, D., Kubota, N., Ju, Z.: Attention mechanism-based CNN for facial expression recognition. Neurocomputing 411, 340–350 (2020). https://doi.org/10.1016/j.neucom.2020.06.014

  33. Li, S., Deng, W.: Deep facial expression recognition: a survey. IEEE Trans. Affect. Comput. 13(3), 1195–1215 (2022). https://doi.org/10.1109/TAFFC.2020.2981446

    Article  Google Scholar 

  34. Mather, S.M.: Ethnographic research on the use of visually based regulators for teachers and interpreters. In: Attitudes, Innuendo, and Regulators, pp. 136–161 (2005)

    Google Scholar 

  35. McNeill, D.: Hand and mind. In: Advances in Visual Semiotics, vol. 351 (1992)

    Google Scholar 

  36. Narayana, P., Beveridge, R., Draper, B.A.: Gesture recognition: locus on the hands. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5235–5244 (2018)

    Google Scholar 

  37. Oertel, C., Salvi, G.: A gaze-based method for relating group involvement to individual engagement in multimodal multiparty dialogue. In: Proceedings of the 15th ACM on International Conference on Multimodal Interaction - ICMI 2013, pp. 99–106. ACM Press, Sydney, Australia (2013). https://doi.org/10.1145/2522848.2522865

  38. Ogden, L.: Collaborative tasks, collaborative children: an analysis of reciprocity during peer interaction at key stage 1. Br. Edu. Res. J. 26(2), 211–226 (2000)

    Article  Google Scholar 

  39. Palmer, M., Gildea, D., Kingsbury, P.: The proposition bank: an annotated corpus of semantic roles. Comput. Linguist. 31(1), 71–106 (2005)

    Article  Google Scholar 

  40. Pustejovsky, J., Krishnaswamy, N.: VoxML: a visualization modeling language. In: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), pp. 4606–4613. European Language Resources Association (ELRA), Portorož, Slovenia, May 2016. https://aclanthology.org/L16-1730

  41. Pustejovsky, J., Krishnaswamy, N.: Embodied human computer interaction. KI-Künstliche Intelligenz 35(3–4), 307–327 (2021)

    Article  Google Scholar 

  42. Pustejovsky, J., Krishnaswamy, N.: Multimodal semantics for affordances and actions. In: Kurosu, M. (ed.) HCII 2022. LNCS, vol. 13302, pp. 137–160. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-05311-5_9

  43. Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., Sutskever, I.: Robust speech recognition via large-scale weak supervision (2022)

    Google Scholar 

  44. Rennie, C., Shome, R., Bekris, K.E., De Souza, A.F.: A dataset for improved RGBD-based object detection and pose estimation for warehouse pick-and-place. IEEE Rob. Autom. Lett. 1(2), 1179–1185 (2016)

    Article  Google Scholar 

  45. Ruan, X., Palansuriya, C., Constantin, A.: Affective dynamic based technique for facial emotion recognition (FER) to support intelligent tutors in education. In: Wang, N., Rebolledo-Mendez, G., Matsuda, N., Santos, O.C., Dimitrova, V. (eds.) AIED, vol. 13916, pp. 774–779. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-36272-9_70

  46. Sap, M., LeBras, R., Fried, D., Choi, Y.: Neural theory-of-mind? On the limits of social intelligence in large LMS. ar**v preprint ar**v:2210.13312 (2022)

  47. Schneider, B., Pea, R.: Does seeing one another’s gaze affect group dialogue? A computational approach. J. Learn. Anal. 2(2), 107–133 (2015)

    Article  Google Scholar 

  48. Stewart, A.E.B., Keirn, Z., D’Mello, S.K.: Multimodal modeling of collaborative problem-solving facets in triads. User Model. User-Adap. Inter. 31(4), 713–751 (2021). https://doi.org/10.1007/s11257-021-09290-y

    Article  Google Scholar 

  49. Sun, C., Shute, V.J., Stewart, A., Yonehiro, J., Duran, N., D’Mello, S.: Towards a generalized competency model of collaborative problem solving. Comput. Educ. 143, 103672 (2020). https://www.sciencedirect.com/science/article/pii/S0360131519302258

  50. Sun, C., et al.: The relationship between collaborative problem solving behaviors and solution outcomes in a game-based learning environment. Comput. Hum. Behav. 128, 107120 (2022)

    Article  Google Scholar 

  51. Terpstra, C., Khebour, I., Bradford, M., Wisniewski, B., Krishnaswamy, N., Blanchard, N.: How good is automatic segmentation as a multimodal discourse annotation aid? (2023)

    Google Scholar 

  52. Tomasello, M., et al.: Joint attention as social cognition. In: Joint Attention: Its Origins and Role in Development, vol. 103130, pp. 103–130 (1995)

    Google Scholar 

  53. Tong, Z., Song, Y., Wang, J., Wang, L.: VideoMAE: masked autoencoders are data-efficient learners for self-supervised video pre-training. In: Advances in Neural Information Processing Systems, vol. 35, pp. 10078–10093 (2022)

    Google Scholar 

  54. Törmänen, T., Järvenoja, H., Mänty, K.: Exploring groups’ affective states during collaborative learning: what triggers activating affect on a group level? Educ. Tech. Res. Dev. 69(5), 2523–2545 (2021)

    Article  Google Scholar 

  55. Tyree, S., et al.: 6-DoF pose estimation of household objects for robotic manipulation: an accessible dataset and benchmark. In: IROS (2022)

    Google Scholar 

  56. Ullman, T.: Large language models fail on trivial alterations to theory-of-mind tasks. ar**v preprint ar**v:2302.08399 (2023)

  57. VanderHoeven, H., Blanchard, N., Krishnaswamy, N.: Robust motion recognition using gesture phase annotation. In: Duffy, V.G. (ed.) HCII 2023. LNCS, vol. 14028, pp. 592–608. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-35741-1_42

  58. VanderHoeven, H., Blanchard, N., Krishnaswamy, N.: Point target detection for multimodal communication. In: Digital Human Modeling and Applications in Health, Safety, Ergonomics and Risk Management. Springer (2024)

    Google Scholar 

  59. Velikovich, L., Williams, I., Scheiner, J., Aleksic, P., Moreno, P., Riley, M.: Semantic lattice processing in contextual automatic speech recognition for google assistant, pp. 2222–2226 (2018). https://www.isca-speech.org/archive/Interspeech_2018/pdfs/2453.pdf

  60. Wang, G., Manhardt, F., Tombari, F., Ji, X.: GDR-Net: geometry-guided direct regression network for monocular 6D object pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 16611–16621, June 2021

    Google Scholar 

  61. Wolf, K., Naumann, A., Rohs, M., Müller, J.: A taxonomy of microinteractions: defining microgestures based on ergonomic and scenario-dependent requirements. In: Campos, P., Graham, N., Jorge, J., Nunes, N., Palanque, P., Winckler, M. (eds.) INTERACT 2011. LNCS, vol. 6946, pp. 559–575. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-23774-4_45

    Chapter  Google Scholar 

  62. Zhang, F., et al.: MediaPipe hands: on-device real-time hand tracking. ar**v preprint ar**v:2006.10214 (2020)

  63. Zoric, G., Smid, K., Pandzic, I.S.: Facial gestures: taxonomy and application of non-verbal, non-emotional facial displays for embodied conversational agents. In: Conversational Informatics: An Engineering Approach, pp. 161–182 (2007)

    Google Scholar 

Download references

Acknowledgments

This work was partially supported by the National Science Foundation under subcontracts to Colorado State University and Brandeis University on award DRL 2019805. The views expressed are those of the authors and do not reflect the official policy or position of the U.S. Government. All errors and mistakes are, of course, the responsibilities of the authors.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hannah VanderHoeven .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

VanderHoeven, H. et al. (2024). Multimodal Design for Interactive Collaborative Problem-Solving Support. In: Mori, H., Asahi, Y. (eds) Human Interface and the Management of Information. HCII 2024. Lecture Notes in Computer Science, vol 14689. Springer, Cham. https://doi.org/10.1007/978-3-031-60107-1_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-60107-1_6

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-60106-4

  • Online ISBN: 978-3-031-60107-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Navigation