Command Palette
Search for a command to run...
A Cognitive Paradigm Approach to Probe the Perception-Reasoning
Interface in VLMs
A Cognitive Paradigm Approach to Probe the Perception-Reasoning Interface in VLMs
Mohit Vaishnav Tanel Tammet
Abstract
A fundamental challenge in artificial intelligence involves understanding thecognitive mechanisms underlying visual reasoning in sophisticated models likeVision-Language Models (VLMs). How do these models integrate visual perceptionwith abstract thought, especially when reasoning across multiple images orrequiring fine-grained compositional understanding? Drawing inspiration fromcognitive science, this paper introduces a structured evaluation frameworkusing diverse visual reasoning tasks-Bongard Problems (BPs) and Winoground-todissect the perception-reasoning interface in VLMs. We propose three distinctevaluation paradigms, mirroring human problem-solving strategies: Direct VisualRule Learning (DVRL; holistic processing), Deductive Rule Learning (DRL; ruleextraction and application), and Componential Analysis (CA; analyticaldecomposition via task-agnostic textual descriptions). These paradigmssystematically vary cognitive load and probe processing stages. Notably, CAenables multi-image reasoning evaluation even for single-image architecturesand isolates reasoning from perception by operating on textual descriptions.Applying this framework, we demonstrate that CA, leveraging powerful languagemodels for reasoning over rich, independently generated descriptions, achievesnew state-of-the-art (SOTA) performance on challenging benchmarks includingBongard-OpenWorld, Bongard-HOI, and Winoground. Ablation studies confirmreasoning improves significantly when perceptual challenges are mitigated,revealing a critical perception bottleneck. Our framework provides a valuablediagnostic tool and suggests that decoupling perception (via rich,task-agnostic description) from reasoning is a promising direction for robustand general visual intelligence.