Command Palette
Search for a command to run...
CapeLLM: Support-Free Category-Agnostic Pose Estimation with Multimodal Large Language Models
Kim Junho ; Chung Hyungjin ; Kim Byung-Hoon

Abstract
Category-agnostic pose estimation (CAPE) has traditionally relied on supportimages with annotated keypoints, a process that is often cumbersome and mayfail to fully capture the necessary correspondences across diverse objectcategories. Recent efforts have begun exploring the use of text-based queries,where the need for support keypoints is eliminated. However, the optimal use oftextual descriptions for keypoints remains an underexplored area. In this work,we introduce CapeLLM, a novel approach that leverages a text-based multimodallarge language model (MLLM) for CAPE. Our method only employs query image anddetailed text descriptions as an input to estimate category-agnostic keypoints.We conduct extensive experiments to systematically explore the design space ofLLM-based CAPE, investigating factors such as choosing the optimal descriptionfor keypoints, neural network architectures, and training strategies. Thanks tothe advanced reasoning capabilities of the pre-trained MLLM, CapeLLMdemonstrates superior generalization and robust performance. Our approach setsa new state-of-the-art on the MP-100 benchmark in the challenging 1-shotsetting, marking a significant advancement in the field of category-agnosticpose estimation.
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| category-agnostic-pose-estimation-on-mp100 | CapeLLM | Mean PCK@0.2 - 1shot: 92.60 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.