HyperAIHyperAI

Command Palette

Search for a command to run...

4 months ago

A Cognitive Paradigm Approach to Probe the Perception-Reasoning Interface in VLMs

Vaishnav Mohit ; Tammet Tanel

A Cognitive Paradigm Approach to Probe the Perception-Reasoning
  Interface in VLMs

Abstract

A fundamental challenge in artificial intelligence involves understanding thecognitive mechanisms underlying visual reasoning in sophisticated models likeVision-Language Models (VLMs). How do these models integrate visual perceptionwith abstract thought, especially when reasoning across multiple images orrequiring fine-grained compositional understanding? Drawing inspiration fromcognitive science, this paper introduces a structured evaluation frameworkusing diverse visual reasoning tasks-Bongard Problems (BPs) and Winoground-todissect the perception-reasoning interface in VLMs. We propose three distinctevaluation paradigms, mirroring human problem-solving strategies: Direct VisualRule Learning (DVRL; holistic processing), Deductive Rule Learning (DRL; ruleextraction and application), and Componential Analysis (CA; analyticaldecomposition via task-agnostic textual descriptions). These paradigmssystematically vary cognitive load and probe processing stages. Notably, CAenables multi-image reasoning evaluation even for single-image architecturesand isolates reasoning from perception by operating on textual descriptions.Applying this framework, we demonstrate that CA, leveraging powerful languagemodels for reasoning over rich, independently generated descriptions, achievesnew state-of-the-art (SOTA) performance on challenging benchmarks includingBongard-OpenWorld, Bongard-HOI, and Winoground. Ablation studies confirmreasoning improves significantly when perceptual challenges are mitigated,revealing a critical perception bottleneck. Our framework provides a valuablediagnostic tool and suggests that decoupling perception (via rich,task-agnostic description) from reasoning is a promising direction for robustand general visual intelligence.

Benchmarks

BenchmarkMethodologyMetrics
visual-reasoning-on-bongard-openworldComponential analysis - gpt-4o
2-Class Accuracy: 92.8
visual-reasoning-on-bongard-openworldcomponential analysis - gemini-2.0
2-Class Accuracy: 93.6

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
A Cognitive Paradigm Approach to Probe the Perception-Reasoning Interface in VLMs | Papers | HyperAI