Action tube extraction based 3D-CNN for RGB-D action recognition
Document typeConference report
PublisherInstitute of Electrical and Electronics Engineers (IEEE)
Rights accessOpen Access
In this paper we propose a novel action tube extractor for RGB-D action recognition in trimmed videos. The action tube extractor takes as input a video and outputs an action tube. The method consists of two parts: spatial tube extraction and temporal sampling. The first part is built upon MobileNet-SSD and its role is to define the spatial region where the action takes place. The second part is based on the structural similarity index (SSIM) and is designed to remove frames without obvious motion from the primary action tube. The final extracted action tube has two benefits: 1) a higher ratio of ROI (subjects of action) to background; 2) most frames contain obvious motion change. We propose to use a two-stream (RGB and Depth) I3D architecture as our 3D-CNN model. Our approach outperforms the state-of-the-art methods on the OA and NTU RGB-D datasets. © 2018 IEEE.
CitationXu, Z.; Vilaplana, V.; Morros, J.R. Action tube extraction based 3D-CNN for RGB-D action recognition. A: International Workshop on Content-Based Multimedia Indexing. "16th International Conference on Content-Based Multimedia Indexing: 4-6 September, 2018 La Rochelle, France". Institute of Electrical and Electronics Engineers (IEEE), 2018, p. 1-6.