8 months ago

Computer Vision

Video Understanding

Multi-Task Learning

Method/Architecture

Computer Vision

Shaowei Liu* Hanwen Jiang* Jiarui Xu Sifei Liu Xiaolong Wang

Abstract

Estimating 3D hand and object pose from a single image is an extremelychallenging problem: hands and objects are often self-occluded duringinteractions, and the 3D annotations are scarce as even humans cannot directlylabel the ground-truths from a single image perfectly. To tackle thesechallenges, we propose a unified framework for estimating the 3D hand andobject poses with semi-supervised learning. We build a joint learning frameworkwhere we perform explicit contextual reasoning between hand and objectrepresentations by a Transformer. Going beyond limited 3D annotations in asingle image, we leverage the spatial-temporal consistency in large-scalehand-object videos as a constraint for generating pseudo labels insemi-supervised learning. Our method not only improves hand pose estimation inchallenging real-world dataset, but also substantially improve the object posewhich has fewer ground-truths per instance. By training with large-scalediverse videos, our model also generalizes better across multiple out-of-domaindatasets. Project page and code: https://stevenlsw.github.io/Semi-Hand-Object

Source PDF View Code

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Powered by MailChimp

8 months ago

Computer Vision

Video Understanding

Multi-Task Learning

Method/Architecture

Computer Vision

Shaowei Liu* Hanwen Jiang* Jiarui Xu Sifei Liu Xiaolong Wang

Abstract

Estimating 3D hand and object pose from a single image is an extremelychallenging problem: hands and objects are often self-occluded duringinteractions, and the 3D annotations are scarce as even humans cannot directlylabel the ground-truths from a single image perfectly. To tackle thesechallenges, we propose a unified framework for estimating the 3D hand andobject poses with semi-supervised learning. We build a joint learning frameworkwhere we perform explicit contextual reasoning between hand and objectrepresentations by a Transformer. Going beyond limited 3D annotations in asingle image, we leverage the spatial-temporal consistency in large-scalehand-object videos as a constraint for generating pseudo labels insemi-supervised learning. Our method not only improves hand pose estimation inchallenging real-world dataset, but also substantially improve the object posewhich has fewer ground-truths per instance. By training with large-scalediverse videos, our model also generalizes better across multiple out-of-domaindatasets. Project page and code: https://stevenlsw.github.io/Semi-Hand-Object

Source PDF View Code

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Powered by MailChimp