Combining Voice and Gesture for Human Computer Interaction

Supervisors: Denis Lalanne, Matthias Schwaller

Student: Haleh Chizari

Project status: Finished

Year: 2013

Recently, there has been a great deal of interest in multimodal interfaces thanks to their potential in providing more natural user-machine interactions, particularly in applications where the use of mouse or keyboards is tedious and inappropriate. Among other, two types of inputs are increasingly integrated in multimodal interfaces: voice and gesture. There are many ways for temporal fusing of speech and gesture inputs. However, user-friendly ones are more interesting for use in real world applications. In this context, this research project focuses on the user effort and perceived quality as two important criteria of user-friendliness in the interaction between the user and the computer.

The project starts with study and design of multimodal commands for several typical operations such as selection, dragging and dropping, rotation and resizing. Depending on the use of gesture or voice for initiating or parameterizing the operation, different sets of combined voice-gesture commands are proposed. Then, one multimodal set which seems more suitable for the majority of users is selected to be implemented in C# using Microsoft Kinect SDK.

In the next step, the functionality of the implemented multimodal design is examined compared to that of the existing set in the laboratory which uses gesture unimodal commands. To do this, we asked ten people to evaluate both unimodal and multimodal sets in terms of qualitative and quantitative criteria, such as accuracy, speed, efficiency, perceived quality, effort, and error susceptibility. Statistical analyses show that the proposed multimodal set has quantitatively equivalent performance in selection, rotation, and mixed activities as the existing unimodal set, but underperforms in move and resizing tasks. Furthermore, the t-test analysis on the number of the user errors demonstrates that using the multimodal technology signicantly reduces the user errors in all considered activities. On the other hand, the results obtained from the qualitative questionnaire show that the multimodal set is slightly favored by the users with respect to its unimodal counterpart due to better overall performance as well as less cognitive load and eort.

Document: report.pdf