SemBench: A Benchmark for Semantic Query Processing Engines

What is SemBench?

SemBench is a systematic benchmark designed to evaluate and compare recent LLM-powered multi-modal data systems in realistic settings. It includes well-defined use cases that integrate structured data (tables) with unstructured modalities (text, images, audio), enabling the assessment of systems' ability to process complex semantic queries with ground truth validation. These systems are expected to balance multiple objectives—accuracy, cost, and efficiency—especially when handling large-scale datasets. SemBench emphasizes the trade-offs among these dimensions rather than performance in isolation.

Scalability Experiments

Scalability across scenarios. We report average execution time, monetary cost, result quality, and memory usage over a fixed set of queries that are supported by all systems. Error bars for execution time, cost, and result quality show standard deviations over five runs. For memory usage, error bars show the minimum and maximum values. Cross markers indicate a per-query timeout of 1 hour. After a timeout, the system is treated as failed and is not evaluated at larger scale factors. No timeout is applied in memory experiment to report the complete memory usage at the largest scales.

Scalability and Memory Usage across Scenarios

Performance Comparison across All Scenarios

This analysis examines how different systems perform across all scenarios when allowing for tolerance in measurements. The tolerance analysis considers that systems with very similar results (e.g., 1.01s vs 1.04s latency) may not represent meaningful distinctions in real-world applications.

The charts above show tolerance analysis for three metrics: Execution Time, Monetary Cost, and Quality. Each colored line represents a different system's win count as tolerance increases from strict (0%) to more lenient thresholds.

Understanding the Charts: The solid lines with markers show actual wins at each tolerance level, while the horizontal dashed lines represent the upper bound (maximum possible wins for each system). The vertical dotted lines indicate the convergence tolerance where each system reaches its upper bound for that metric. For execution time and monetary cost, tolerance is relative (percentage-based), while for quality metrics, tolerance is absolute (fixed amount).

As tolerance increases, more systems become winners for queries where performance differences are small. The win count for each system should increase or remain constant as tolerance grows, since systems don't lose wins but may gain them when performance becomes similar enough to be considered equivalent.

Frequently Asked Questions

What types of queries does SemBench target?

SemBench targets complex semantic queries that integrate structured data (tables) with unstructured modalities, including text, images, and audio. Each scenario includes at least 10 diverse queries that test various aspects of multimodal data processing, ranging from simple retrieval to complex analytical tasks that require reasoning across multiple data types and sources. Our queries cover various semantic operators, including semantic filtering, semantic joining, semantic classification, semantic scoring, and semantic aggregation. Moreover, SemBench can be easily extended to support more scenarios with more semantic operators.

Can we contribute to this benchmark?

Yes, we welcome contributions and can discuss building SemBench-2.0 together! The modular architecture supports easy extension:

Add New Use Cases: Implement scenario-specific runner and evaluator
Support New Systems: Create system-specific runner inheriting from GenericRunner
Custom Metrics: Extend evaluation framework with domain-specific quality measures
Additional Queries: Add query definitions in natural language and SQL formats

We welcome contributions for new scenarios, systems, metrics, and more semantic queries!

Can we upload the results of our system?

Of course! We understand that every system is under rapid development, which is why we maintain an online leaderboard. We encourage you to submit your system's results and participate in the benchmark. Please reach out to discuss how to contribute your results to the leaderboard.