8 months ago

Computer Vision

Video Processing

Method/Architecture

Computer Vision

Ce Zheng Sijie Zhu Matias Mendieta Taojiannan Yang Chen Chen Zhengming Ding

Abstract

Transformer architectures have become the model of choice in natural languageprocessing and are now being introduced into computer vision tasks such asimage classification, object detection, and semantic segmentation. However, inthe field of human pose estimation, convolutional architectures still remaindominant. In this work, we present PoseFormer, a purely transformer-basedapproach for 3D human pose estimation in videos without convolutionalarchitectures involved. Inspired by recent developments in vision transformers,we design a spatial-temporal transformer structure to comprehensively model thehuman joint relations within each frame as well as the temporal correlationsacross frames, then output an accurate 3D human pose of the center frame. Wequantitatively and qualitatively evaluate our method on two popular andstandard benchmark datasets: Human3.6M and MPI-INF-3DHP. Extensive experimentsshow that PoseFormer achieves state-of-the-art performance on both datasets.Code is available at \url{https://github.com/zczcwh/PoseFormer}

Source PDF View Code

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Powered by MailChimp

8 months ago

Computer Vision

Video Processing

Method/Architecture

Computer Vision

Ce Zheng Sijie Zhu Matias Mendieta Taojiannan Yang Chen Chen Zhengming Ding

Abstract

Transformer architectures have become the model of choice in natural languageprocessing and are now being introduced into computer vision tasks such asimage classification, object detection, and semantic segmentation. However, inthe field of human pose estimation, convolutional architectures still remaindominant. In this work, we present PoseFormer, a purely transformer-basedapproach for 3D human pose estimation in videos without convolutionalarchitectures involved. Inspired by recent developments in vision transformers,we design a spatial-temporal transformer structure to comprehensively model thehuman joint relations within each frame as well as the temporal correlationsacross frames, then output an accurate 3D human pose of the center frame. Wequantitatively and qualitatively evaluate our method on two popular andstandard benchmark datasets: Human3.6M and MPI-INF-3DHP. Extensive experimentsshow that PoseFormer achieves state-of-the-art performance on both datasets.Code is available at \url{https://github.com/zczcwh/PoseFormer}

Source PDF View Code

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Powered by MailChimp