HyperAIHyperAI

Command Palette

Search for a command to run...

DeepPlanning: Benchmarking Long-Horizon Agentic Planning with Verifiable Constraints

Yinger Zhang Shutong Jiang Renhao Li Jianhong Tu Yang Su Lianghao Deng Xudong Guo Chenxu Lv Junyang Lin

Abstract

While agent evaluation has shifted toward long-horizon tasks, most benchmarks still emphasize local, step-level reasoning rather than the global constrained optimization (e.g., time and financial budgets) that demands genuine planning ability. Meanwhile, existing LLM planning benchmarks underrepresent the active information gathering and fine-grained local constraints typical of real-world settings. To address this, we introduce DeepPlanning, a challenging benchmark for practical long-horizon agent planning. It features multi-day travel planning and multi-product shopping tasks that require proactive information acquisition, local constrained reasoning, and global constrained optimization. Evaluations on DeepPlanning show that even frontier agentic LLMs struggle with these problems, highlighting the importance of reliable explicit reasoning patterns and parallel tool use for achieving better effectiveness-efficiency trade-offs. Error analysis further points to promising directions for improving agentic LLMs over long planning horizons. We open-source the code and data to support future research.

One-sentence Summary

The Qwen Team at Alibaba Group introduces DEEPPLANNING, a benchmark for evaluating LLMs on real-world long-horizon planning tasks requiring proactive information gathering and global constraint optimization, revealing current models’ limitations and highlighting the need for explicit reasoning and parallel tool use.

Key Contributions

  • We introduce DEEPPLANNING, a new benchmark featuring multi-day travel and multi-product shopping tasks that require proactive information gathering, local constraint handling, and global optimization under time and budget limits, addressing gaps in existing LLM agent evaluations.
  • Evaluations on frontier agentic LLMs reveal consistent struggles with long-horizon planning, demonstrating that explicit reasoning patterns and parallel tool use are critical for improving effectiveness-efficiency trade-offs in complex, constrained scenarios.
  • Through error analysis, we identify key failure modes—such as missed tool calls, undetected implicit constraints, and lack of global consistency checking—and outline actionable directions to enhance agentic LLMs for extended planning tasks.

Introduction

The authors leverage real-world multi-day travel and multi-product shopping scenarios to evaluate LLM agents’ long-horizon planning abilities, where global constraints like time budgets and cumulative costs must be optimized across interdependent steps. Prior benchmarks often focus on local, step-level reasoning or oversimplified global constraints, failing to capture the need for proactive information gathering and fine-grained constraint handling typical in practical settings. Their main contribution is DEEPPLANNING, a new benchmark that enforces verifiable, multi-level constraints and exposes critical gaps in current agents — particularly their inability to reliably backtrack, maintain global consistency, or use tools in parallel — while highlighting explicit reasoning and tool orchestration as key to improving effectiveness and efficiency.

Dataset

The authors use DeepPlanning, a benchmark composed of two complex, real-world planning domains: Travel Planning and Shopping Planning. Each task runs in an isolated sandbox backed by domain-specific databases and accessed via custom Python toolkits, ensuring reproducibility.

  • Dataset Composition and Sources:

    • Travel Planning: 120 tasks (available in Chinese and English), sourced from real-world data via public APIs (Fliggy, Amap, Web Search) covering transportation, hotels, attractions, and dining in Chinese tourist cities.
    • Shopping Planning: 120 English-only tasks, built from synthesized product data including price, stock, ratings, and promotions, designed for controlled complexity.
  • Key Subset Details:

    • Travel Planning: Uses 7 sub-databases and 9 APIs (e.g., query_hotel_info). Tasks require multi-day itineraries with minute-level scheduling, itemized costs, and budget summaries. Constraints involve time, location, and budget coupling.
    • Shopping Planning: Uses 3 sub-databases and 15 APIs (e.g., search_products). Tasks require assembling a JSON shopping cart that satisfies user preferences, sizing, shipping, coupons, and budget. Challenges involve combinatorial optimization across the cart.
  • Data Construction Pipeline:

    1. Database & Toolbox Design: Domain-specific databases and hierarchical APIs are built to encourage multi-step agent interaction.
    2. Layered Task Generation: Tasks are reverse-engineered from solutions:
      • Base skeletons (e.g., travel route or product theme) are generated.
      • Personalized constraints (e.g., “book flight after 7 AM”) and environmental constraints (e.g., closed attractions or coupon stacking) are injected.
      • Database entries are adjusted to ensure exactly one optimal solution exists.
      • LLMs convert constraints into natural language queries.
    3. Manual Quality Control: Human experts validate fluency, clarity, and solvability of each task.
  • Processing and Use in Model:

    • Tasks are evaluated based on three core agent capabilities: proactive information acquisition, local constrained reasoning, and global constrained optimization.
    • Output formats are strictly structured: Travel Planning returns minute-level itineraries with cost breakdowns; Shopping Planning returns JSON carts with item details and coupon usage.
    • Evaluation uses 21 scoring criteria for commonsense reasoning in Travel Planning, covering logical consistency, constraint satisfaction, and budget accuracy.
    • No cropping is applied; all tasks are full-scope, end-to-end planning challenges. Metadata is embedded in tool schemas and database fields (Tables 3–6), which define API parameters and data structure.

The authors do not train on this dataset; it serves as an evaluation benchmark for agent planning performance under real-world constraints.

Method

The authors leverage a layered task generation framework to decompose complex user queries into structured planning workflows for both travel and shopping domains. The architecture is organized into three sequential stages: Base Skeleton Generation, Personalized Constraint Injection, and Environment Constraint Injection. Each stage progressively refines the plan by integrating user preferences, domain-specific rules, and real-time environmental data.

In the travel planning module, the system begins by parsing the user’s high-level request—such as destination, dates, and budget—and generates an initial itinerary skeleton. This is followed by injecting personalized constraints, such as hotel star ratings or dining preferences, which are mapped to specific tool calls like query_hotel_info or recommend_around_restaurants. The final stage incorporates environmental constraints, such as flight availability or attraction opening hours, retrieved from external databases. The agent trajectory shown in the diagram illustrates how tool invocations are chained to populate the plan, ensuring all entities—flights, hotels, attractions—are sourced exclusively from tool outputs without fabrication.

For shopping planning, the framework similarly initiates with a base skeleton derived from product categories and budget. Personalized constraints—such as brand, size, or rating thresholds—are then injected, triggering targeted product searches via tools like search_products. The environment constraint phase integrates user profile data, such as available coupons, and evaluates combinations against pricing logic. The system enforces strict coupon application rules: cross-store coupons apply to the entire cart total, while same-brand coupons are scoped to individual brand subtotals. Stacking is permitted only if each coupon’s threshold is met post-discount, and the final selection is the combination yielding the absolute minimum price within budget.

The evaluation module in both domains assesses plan quality against predefined metrics. In travel, it checks for commonsense compliance, cost accuracy, and itinerary structure; in shopping, it validates product match accuracy against ground truth. The system mandates that all final decisions be based on the cart’s state, verified via getCart_info, ensuring that coupon usage and item selection are consistent with the final transactional context. All outputs are strictly derived from tool results, with no creative interpretation or data fabrication permitted.

Experiment

  • DeepPlanning evaluates LLM agents on complex, real-world planning tasks in travel and shopping domains using code-based, rule-driven metrics to ensure objectivity.
  • Agents show fragility in end-to-end execution: high constraint-level scores often mask critical global failures, leading to low case-level accuracy even in top models.
  • Internal reasoning significantly improves performance and efficiency, reducing redundant tool use and enabling better cost-performance trade-offs.
  • Performance declines with task complexity—longer itineraries or multi-constraint shopping tasks expose agents’ difficulty with global optimization and constraint propagation.
  • Error analysis reveals three dominant failure modes: insufficient or misused information retrieval, local constraint violations, and systemic breakdowns in integrating interdependent decisions.
  • Tool usage correlates with performance: more calls improve outcomes, but reasoning models achieve higher scores with fewer interactions, indicating smarter planning.
  • Domain specialization is evident—some models excel in one domain (e.g., Gemini in shopping) but underperform in another, highlighting task-specific strengths.
  • Current benchmarks are limited to travel and shopping; synthesized queries may not reflect real user behavior, and multi-turn interactions remain unexplored.

The authors use a code-based evaluation framework to assess LLM agents on complex travel and shopping planning tasks, revealing that even top models struggle with end-to-end coherence despite high component-level scores. Results show that models with internal reasoning capabilities consistently outperform non-reasoning counterparts, particularly in maintaining global plan integrity under multi-constraint conditions. Performance gains from reasoning come with trade-offs in interaction cost, as more deliberate planning reduces redundant tool use but may increase turn count or require more structured execution.


Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing

HyperAI Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp