We introduce an immersive system prototype that integrates face, gesture and speech recognition techniques to support multi-modal human-computer interaction capability. Embedded in an indoor room setting, a multi-camera system is developed to monitor the user facial behavior, body gesture and spatial location in the room. A server that fuses different sensor inputs in a time-sensitive manner so that our system knows who is doing what at where in real-time. When correlating with speech input, the system can better understand the user intention for interaction purpose. We evaluate the performance of core recognition techniques on both benchmark and self-collected datasets and demonstrate the benefit of the system in various use cases.
Reference
2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), pp. 517-524. IEEE, 2018.
Bibtex
@inproceedings{zhao2018immersive, title={An immersive system with multi-modal human-computer interaction}, author={Zhao, Rui and Wang, Kang and Divekar, Rahul and Rouhani, Robert and Su, Hui and Ji, Qiang}, booktitle={2018 13th IEEE International Conference on Automatic Face \& Gesture Recognition (FG 2018)}, pages={517--524}, year={2018}, organization={IEEE} }