Command Palette
Search for a command to run...
Liu Jinfu ; Ding Runwei ; Wen Yuhang ; Dai Nan ; Meng Fanyang ; Zhao Shen ; Liu Mengyuan

Abstract
Multimodal-based action recognition methods have achieved high success usingpose and RGB modality. However, skeletons sequences lack appearance depictionand RGB images suffer irrelevant noise due to modality limitations. To addressthis, we introduce human parsing feature map as a novel modality, since it canselectively retain effective semantic features of the body parts, whilefiltering out most irrelevant noise. We propose a new dual-branch frameworkcalled Ensemble Human Parsing and Pose Network (EPP-Net), which is the first toleverage both skeletons and human parsing modalities for action recognition.The first human pose branch feeds robust skeletons in graph convolutionalnetwork to model pose features, while the second human parsing branch alsoleverages depictive parsing feature maps to model parsing festures viaconvolutional backbones. The two high-level features will be effectivelycombined through a late fusion strategy for better action recognition.Extensive experiments on NTU RGB+D and NTU RGB+D 120 benchmarks consistentlyverify the effectiveness of our proposed EPP-Net, which outperforms theexisting action recognition methods. Our code is available at:https://github.com/liujf69/EPP-Net-Action.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| action-recognition-in-videos-on-ntu-rgbd | EPP-Net (Parsing + Pose) | Accuracy (CS): 94.7 Accuracy (CV): 97.7 |
| action-recognition-in-videos-on-ntu-rgbd-120 | EPP-Net (Parsing + Pose) | Accuracy (Cross-Setup): 92.8 Accuracy (Cross-Subject): 91.1 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.