WIRELESSBENCH
A Tolerance-Aware Benchmark for LLM Agents on Wireless Network Intelligence
Jingwen Tong, Fang Liu, Linkai Xv, Shiliang Lu, Kangqi Li, Yiqian Zhang, Yijie Song, Zeyang Xue, Jun Zhang
Abstract
LLM agents are emerging as a key enabler for autonomous wireless network management. Reliably deploying them, however, demands benchmarks that reflect real engineering risk.
We present WIRELESSBENCH, the first tolerance-aware, tool-integrated benchmark for LLM-based wireless agents. WIRELESSBENCH is organized as a three-tier cognitive hierarchy: domain knowledge reasoning (WCHW, 1,392 items), intent-driven resource allocation (WCNS, 1,000 items), and proactive multi-step decisions under mobility (WCMSA, 1,000 items). Moreover, WIRELESSBENCH is established on three design principles: tolerance-aware scoring with catastrophic-error detection; tool-necessary tasks requiring a 3GPP-compliant ray-tracing query for channel quality; and Chain-of-Thought (CoT)-traceable items, where every benchmark item ships with a complete CoT trajectory.
Our numerical results show that the direct-prompting model (GPT-4o) scores 68%, trailing a tool-integrated agent (84.64%) by 16.64 pp; 23% of errors are catastrophic failures invisible to exact-match metrics.
More importantly, the hierarchy decomposes errors into four actionable diagnostic categories that flat evaluation cannot reveal. Code and data: https://github.com/jwentong/WirelessBench.
Benchmark
Benchmark Comparison
| general benchmarks | WirelessBench |
|---|---|
| Evaluate isolated capabilities with flat scoring | Distinguish benign approximation from catastrophic engineering errors via tolerance-aware scoring |
| Treat all errors uniformly without catastrophic-error detection | Explicitly detect unit/magnitude confusions and penalize them to zero credit |
| Ignore cascaded multi-step decision chains | Tool-necessary workflow with per-item CoT: position prediction → (ray-traced) CQI → slice selection → bandwidth allocation → QoS verification |
Dataset Statistics
| Task | Val Samples |
Test Samples |
Data Source | Avg Tokens |
Reasoning Chain |
Answer Type |
|---|---|---|---|---|---|---|
| WCHW | 348 | 1044 | Textbook | 85 | 4.2 | Numerical, Formula |
| WCNS | 250 | 750 | 3GPP Protocol + Papers | 156 | 3 | Structured |
| WCMSA | 250 | 750 | 3GPP Protocol + Papers | 203 | 5.5 | Structured |
| Total | 848 | 2544 | - | 128 | 4.1 | - |
Sample Distribution
Reasoning Chain Length
Average Token Count
WCHW
Wireless communication homework, evaluating basic computational capabilities
WCNS
Network slicing resource allocation, simulating 5G scenarios
WCMSA
Mobile service assurance, proactive resource allocation
Task Design
Basic Computational Capability Assessment
Evaluates the agent's capabilities in wireless communication fundamental calculations, covering three major areas: signal processing, information theory, and system design.
Network Slicing Resource Allocation
Simulates real-time resource allocation scenarios for enhanced Mobile Broadband (eMBB) and Ultra-Reliable Low-Latency Communications (URLLC) slices in 5G networks.
Proactive Resource Allocation
Requires agents to perform proactive resource allocation based on mobility prediction, representing the most complex multi-step reasoning task.
Task Execution Workflow
Technical Features
- Generates realistic mobility trajectories following 3GPP protocol specifications
- Strictly follows standardized task execution workflows for data generation
- Achieves dataset expansion through data augmentation modules that introduce diverse mobility patterns and channel dynamics
Data Cleaning and Validation
This work designs a rigorous data cleaning process based on psychometric theory, ensuring high quality and reliability of the dataset.
Step 1: Format Standardization
Unify data formats to ensure consistency:
- Standardize unit representations (kHz / MHz / GHz)
- Normalize scientific notation (10-6 vs 1e-6)
- Standardize answer formats
Step 2: Deduplication
Use semantic similarity algorithms to identify and remove near-duplicate questions:
Step 3: Quality Assessment
Employ multi-dimensional metrics to identify low-quality questions:
- Item-Total Correlation
- Mokken Scale Analysis
- Inter-item Consistency Metrics
Step 4: AI-Assisted Review
Employ advanced large language models as "reviewers" to diagnose and correct flagged issues, ensuring dataset accuracy and completeness.
Tolerance-Aware Scoring Mechanism
This work adopts a tolerance-aware scoring mechanism that distinguishes acceptable engineering errors from catastrophic unit or formula mistakes, better aligning with real-world engineering application scenarios.
|error| ≤ 1%
1% < |error| ≤ 5%
5% < |error| ≤ 10%
|error| > 10% or Unit mismatch (dB/dBm)
Performance Comparison on WirelessBench
| Method | WCHW | WCNS | WCMSA | Average |
|---|---|---|---|---|
| Qwen-Turbo-Latest | 58.34 | 62.13 | 66.43 | 62.3 |
| GPT-4o | 60.32 | 72.45 | 71.22 | 68 |
| CoT-SC (k=5) | 60.01 | 74.82 | 73.56 | 69.46 |
| MedPrompt | 61.22 | 73.18 | 72.89 | 69.1 |
| ADAS | 53.13 | 68.42 | 65.41 | 62.32 |
| AFlow | 69.92 | 76.12 | 73.90 | 73.29 |
| WIRELESSBENCH-REF (Ours) | 81.02 | 86.18 | 86.72 | 84.64 |
All values are in percentage (%)
Performance Comparison
Average Performance Ranking
Task Performance Heatmap
Conclusion
We presented WIRELESSBENCH, an extensible benchmark for evaluating the operational reliability of LLM agents on wireless network intelligence.
WIRELESSBENCH emerged as the principal deployment bottleneck during the development of wireless AI agents. Built around a three-tier cognitive hierarchy (WCHW, WCNS, WCMSA) that spans static domain knowledge, intent-to-allocation mapping, and proactive mobility-aware service assurance, the benchmark operationalizes reliability through three design commitments: tolerance-aware scoring, tool-necessary tasks, and per-item CoT.
Evaluating seven methods reveals that the strongest frontier model achieves only 68.00% under direct prompting, and that formula misapplication (31%), reasoning-path breaks (28%), and unit confusion (23%) dominate the failure landscape (the last of which is nearly eliminated by tool-integrated verification).
While current limitations in environment coverage, data augmentation, and baseline diversity remain open, we hope WIRELESSBENCH provides a rigorous and community-accessible foundation for advancing the reliability of AI agents in mobile-network operations.
Accuracy Comparison
Performance Improvement
Reliability Gap
Direct prompting achieves 68.00% accuracy on WirelessBench
Performance Ceiling
WIRELESSBENCH-REF achieves 84.64% accuracy
Tool Value
Domain-specific tools significantly improve reliability
Citation
Paper Information
Title: WirelessBench: A Tolerance-Aware Benchmark for LLM Agents on Wireless Network Intelligence
Journal: arXiv preprint arXiv:2603.21251v1
arXiv: 2603.21251v1
BibTeX Citation:
@article{tong2026wirelessbench,
title={WirelessBench: A Tolerance-Aware LLM Agent Benchmark for Wireless Network Intelligence},
author={Jingwen Tong and Fang Liu and Linkai Xv and Shiliang Lu and Kangqi Li and Yiqian Zhang and Yijie Song and Zeyang Xue and Jun Zhang},
journal={arXiv preprint arXiv:2603.21251v1},
year={2026}
}