5 months ago

CapeLLM: Support-Free Category-Agnostic Pose Estimation with Multimodal Large Language Models

Kim Junho ; Chung Hyungjin ; Kim Byung-Hoon

Abstract

Category-agnostic pose estimation (CAPE) has traditionally relied on supportimages with annotated keypoints, a process that is often cumbersome and mayfail to fully capture the necessary correspondences across diverse objectcategories. Recent efforts have begun exploring the use of text-based queries,where the need for support keypoints is eliminated. However, the optimal use oftextual descriptions for keypoints remains an underexplored area. In this work,we introduce CapeLLM, a novel approach that leverages a text-based multimodallarge language model (MLLM) for CAPE. Our method only employs query image anddetailed text descriptions as an input to estimate category-agnostic keypoints.We conduct extensive experiments to systematically explore the design space ofLLM-based CAPE, investigating factors such as choosing the optimal descriptionfor keypoints, neural network architectures, and training strategies. Thanks tothe advanced reasoning capabilities of the pre-trained MLLM, CapeLLMdemonstrates superior generalization and robust performance. Our approach setsa new state-of-the-art on the MP-100 benchmark in the challenging 1-shotsetting, marking a significant advancement in the field of category-agnosticpose estimation.

Benchmarks

Benchmark	Methodology	Metrics
category-agnostic-pose-estimation-on-mp100	CapeLLM	Mean PCK@0.2 - 1shot: 92.60

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning