HyperAIHyperAI

Command Palette

Search for a command to run...

Console
a day ago

SSRB: Direct Natural Language Querying to Massive Heterogeneous Semi-Structured Data

SSRB: Direct Natural Language Querying to Massive Heterogeneous Semi-Structured Data

Abstract

Searching over semi-structured data with natural language (NL) queries has attracted sustained attention, enabling broader audiences to access information easily. As more applications, such as LLM agents and RAG systems, emerge to search and interact with semi-structured data, two major challenges have become evident: (1) the increasing diversity of domains and schema variations, making domain-customized solutions prohibitively costly; (2) the growing complexity of NL queries, which combine both exact field matching conditions and fuzzy semantic requirements, often involving multiple fields and implicit reasoning. These challenges make formal language querying or keyword-based search insufficient. In this work, we explore neural retrievers as a unified non-formal querying solution by directly index semi-structured collections and understand NL queries. We employ LLM-based automatic evaluation and build a large-scale semi-structured retrieval benchmark (SSRB) using LLM generation and filtering, containing 14M semi-structured objects from 99 different schemas across 6 domains, along with 8,485 test queries that combine both exact and fuzzy matching conditions. Our systematic evaluation of popular retrievers shows that current state-of-the-art models could achieve acceptable performance, yet they still lack precise understanding of matching constraints. While by in-domain training of dense retrievers, the performance can be significantly improved. We believe that our SSRB could serve as a valuable resource for future research in this area, and we hope to inspire further exploration of semi-structured retrieval with complex queries.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp