Multimodal 2DCNN action recognition from RGB-D data with video summarization
Tutor / director / evaluatorEscalera, Sergio
Document typeMaster thesis
Rights accessOpen Access
Human action recognition is nowadays within the most active computer vision research areas. The problem of action recognition is challenging due to the large intra-class variations, low video resolution and high dimension of video data, among others things. Recent development of affordable depth sensors like Microsoft Kinect leads to new opportunities in this field by providing both RGB and depth data. Multimodal fusion in this scenario can greatly help to boost performance of action recognition methods. Recently, although handcrafted features are still widely used owing to their high performance and low computational complexity, there has been a migration from traditional handcrafting towards deep learning. In this work, 2DCNN is extended to multimodal (MM2DCNN) by introducing scene flow fields as the new input for an additional stream. Then, model outputs are integrated in a late fusion fashion. Furthermore, this work also focuses on analyzing the impact of video summarization in action recognition models. To this end, four different summarization techniques have been applied and compared to uniform random selection. Video summarization algorithms aim to select the most discriminative frames of each video, providing keyframe sequences as a result. Each of these methods has been performed over the two different types of data available, extracting keyframe sequences from RGB and depth videos separately. On top of that, we also perform a novel hybrid-like summarization, namely RGB-D synopsis, by combining results from both sequences. Finally, we evaluate and compare the results of each modality in three state-of-the-art action datasets, integrating them with a late fusion for every summarization sequence modality along with uniform random selection. Experimental results show that our new representation improves the accuracy in comparison to 2DCNNs. Besides, the use of video summarization succeeds in boosting the final performance when compared to random frames.