Expanding Multilingual Co-Speech Interaction: The Impact of Enhanced Gesture Units in Text-to-Gesture Synthesis for Digital Humans
- Authors
- Ali, Ghazanfar; Kim, Woojoo; Anwar, Muhammad Shahid; Hwang, Jae-In; Choi, Ahyoung
- Issue Date
- 2025-08
- Publisher
- Institute of Electrical and Electronics Engineers Inc.
- Citation
- IEEE Access, v.13, pp.145144 - 145157
- Abstract
- In this study, we explore the effects of co-speech gesture generation on user experience in 3D digital human interaction by testing two key hypotheses. The first hypothesis posits that increasing the number of gestures enhances the user experience across criteria such as naturalness, human-likeness, temporal consistency, semantic consistency, and social presence. The second hypothesis suggests that language translation does not degrade the user experience across these criteria. To explore these hypotheses, we investigated three conditions using a digital human: voice only with no gestures, limited(56 gestures) co-speech gestures, and full system functionality with over 2000 unique gestures. For the second hypothesis, we used language translation to provide multilingual support, retrieving gestures from an English rule base. We obtained text and pose from English videos and matched the pose with gesture units derived from Korean speakers' motion-capture sequences, enhancing a comprehensive rule base that we used for gesture retrieval for given text input. We used translation of non-English input language to English for text matching. Our novel method utilizes an improved pipeline to extract text, 2D pose data, and 3D gesture units. Incorporating a cutting-edge gesture-pose matching model with deep contrastive learning, we retrieved gestures from a comprehensive rule base containing 210,000 rules. This approach optimizes alignment and generates realistic, semantically consistent co-speech gestures adaptable to various languages. A comprehensive user study evaluated our hypotheses. The results underscored the positive impact of diverse gestures, supporting the first hypothesis. Additionally, multilingual capabilities did not degrade the user experience, confirming the second hypothesis. Highlighting the scalability and flexibility of our method, this study provides valuable insights into cross-lingual data and expert systems for gesture generation, contributing significantly to more engaging and immersive digital human interactions and the broader field of human-computer interaction.
- Keywords
- NONVERBAL BEHAVIOR; APPEARANCE; BEAT; User experience; Videos; Motion capture; Digital humans; Three-dimensional displays; Translation; Semantics; Multilingual; Animation; Contrastive learning; Co-speech gestures; gesture generation; HCI; machine learning; augmented/virtual/mixed realities
- URI
- https://pubs.kist.re.kr/handle/201004/153162
- DOI
- 10.1109/ACCESS.2025.3596328
- Appears in Collections:
- KIST Article > Others
- Files in This Item:
There are no files associated with this item.
- Export
- RIS (EndNote)
- XLS (Excel)
- XML
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.