Uncertainty-aware Audiovisual Activity Recognition using Deep Bayesian Variational Inference

Efficient multimodal fusion should intelligently understand the relative significance of each modality or fall back to reliable modes of sensing. To design robust and reliable multimodal AI systems, it is essential to quantify uncertainty estimates from individual modalities in deep neural network (DNN) for effective multimodal fusion. We illustrate our proposed method applied to activity recognition with vision and audio modalities, and show it performs better than DNN baseline and Monte Carlo (MC) dropout method.