Abstract

Recent advancements in foundation models have enhanced AI systems'capabilities in autonomous tool usage and reasoning. However, their ability inlocation or map-based reasoning - which improves daily life by optimizingnavigation, facilitating resource discovery, and streamlining logistics - hasnot been systematically studied. To bridge this gap, we introduce MapEval, abenchmark designed to assess diverse and complex map-based user queries withgeo-spatial reasoning. MapEval features three task types (textual, API-based,and visual) that require collecting world information via map tools, processingheterogeneous geo-spatial contexts (e.g., named entities, travel distances,user reviews or ratings, images), and compositional reasoning, which allstate-of-the-art foundation models find challenging. Comprising 700 uniquemultiple-choice questions about locations across 180 cities and 54 countries,MapEval evaluates foundation models' ability to handle spatial relationships,map infographics, travel planning, and navigation challenges. Using MapEval, weconducted a comprehensive evaluation of 28 prominent foundation models. Whileno single model excelled across all tasks, Claude-3.5-Sonnet, GPT-4o, andGemini-1.5-Pro achieved competitive performance overall. However, substantialperformance gaps emerged, particularly in MapEval, where agents withClaude-3.5-Sonnet outperformed GPT-4o and Gemini-1.5-Pro by 16% and 21%,respectively, and the gaps became even more amplified when compared toopen-source LLMs. Our detailed analyses provide insights into the strengths andweaknesses of current models, though all models still fall short of humanperformance by more than 20% on average, struggling with complex map images andrigorous geo-spatial reasoning. This gap highlights MapEval's critical role inadvancing general-purpose foundation models with stronger geo-spatialunderstanding.

Source PDF View Code