WIRELESSBENCH

A Tolerance-Aware Benchmark for LLM Agents on Wireless Network Intelligence

Jingwen Tong, Fang Liu, Linkai Xv, Shiliang Lu, Kangqi Li, Yiqian Zhang, Yijie Song, Zeyang Xue, Jun Zhang

Abstract

LLM agents are emerging as a key enabler for autonomous wireless network management. Reliably deploying them, however, demands benchmarks that reflect real engineering risk.

We present WIRELESSBENCH, the first tolerance-aware, tool-integrated benchmark for LLM-based wireless agents. WIRELESSBENCH is organized as a three-tier cognitive hierarchy: domain knowledge reasoning (WCHW, 1,392 items), intent-driven resource allocation (WCNS, 1,000 items), and proactive multi-step decisions under mobility (WCMSA, 1,000 items). Moreover, WIRELESSBENCH is established on three design principles: tolerance-aware scoring with catastrophic-error detection; tool-necessary tasks requiring a 3GPP-compliant ray-tracing query for channel quality; and Chain-of-Thought (CoT)-traceable items, where every benchmark item ships with a complete CoT trajectory.

Our numerical results show that the direct-prompting model (GPT-4o) scores 68%, trailing a tool-integrated agent (84.64%) by 16.64 pp; 23% of errors are catastrophic failures invisible to exact-match metrics.

More importantly, the hierarchy decomposes errors into four actionable diagnostic categories that flat evaluation cannot reveal. Code and data: https://github.com/jwentong/WirelessBench.

Benchmark

Benchmark Comparison

general benchmarks WirelessBench
Evaluate isolated capabilities with flat scoring Distinguish benign approximation from catastrophic engineering errors via tolerance-aware scoring
Treat all errors uniformly without catastrophic-error detection Explicitly detect unit/magnitude confusions and penalize them to zero credit
Ignore cascaded multi-step decision chains Tool-necessary workflow with per-item CoT: position prediction → (ray-traced) CQI → slice selection → bandwidth allocation → QoS verification

Dataset Statistics

Task Val
Samples
Test
Samples
Data Source Avg
Tokens
Reasoning
Chain
Answer Type
WCHW 348 1044 Textbook 85 4.2 Numerical, Formula
WCNS 250 750 3GPP Protocol + Papers 156 3 Structured
WCMSA 250 750 3GPP Protocol + Papers 203 5.5 Structured
Total 848 2544 - 128 4.1 -
Sample Distribution
Reasoning Chain Length
Average Token Count
WCHW
85
WCNS
156
WCMSA
203
WCHW

Wireless communication homework, evaluating basic computational capabilities

WCNS

Network slicing resource allocation, simulating 5G scenarios

WCMSA

Mobile service assurance, proactive resource allocation

Task Design

1

Basic Computational Capability Assessment

Evaluates the agent's capabilities in wireless communication fundamental calculations, covering three major areas: signal processing, information theory, and system design.

2

Network Slicing Resource Allocation

Simulates real-time resource allocation scenarios for enhanced Mobile Broadband (eMBB) and Ultra-Reliable Low-Latency Communications (URLLC) slices in 5G networks.

3

Proactive Resource Allocation

Requires agents to perform proactive resource allocation based on mobility prediction, representing the most complex multi-step reasoning task.

Task Execution Workflow
Position Prediction
Path Loss Calculation
CQI Estimation
Slice Selection
Bandwidth Allocation
QoS Verification
Technical Features
  • Generates realistic mobility trajectories following 3GPP protocol specifications
  • Strictly follows standardized task execution workflows for data generation
  • Achieves dataset expansion through data augmentation modules that introduce diverse mobility patterns and channel dynamics

Data Cleaning and Validation

This work designs a rigorous data cleaning process based on psychometric theory, ensuring high quality and reliability of the dataset.

Step 1: Format Standardization

Unify data formats to ensure consistency:

  • Standardize unit representations (kHz / MHz / GHz)
  • Normalize scientific notation (10-6 vs 1e-6)
  • Standardize answer formats
Step 2: Deduplication

Use semantic similarity algorithms to identify and remove near-duplicate questions:

Cosine similarity based on Term Frequency-Inverse Document Frequency (TF-IDF)
Step 3: Quality Assessment

Employ multi-dimensional metrics to identify low-quality questions:

  • Item-Total Correlation
  • Mokken Scale Analysis
  • Inter-item Consistency Metrics
Step 4: AI-Assisted Review

Employ advanced large language models as "reviewers" to diagnose and correct flagged issues, ensuring dataset accuracy and completeness.

Tolerance-Aware Scoring Mechanism

This work adopts a tolerance-aware scoring mechanism that distinguishes acceptable engineering errors from catastrophic unit or formula mistakes, better aligning with real-world engineering application scenarios.

1.0
|error| ≤ 1%
Exact match
0.9
1% < |error| ≤ 5%
Engineering acceptable
0.7
5% < |error| ≤ 10%
Approximate
0.0
|error| > 10% or Unit mismatch (dB/dBm)
Unacceptable
Design Philosophy: This scoring mechanism can distinguish between rounding errors in numerical calculations (e.g., 8.4 dB vs 8.5 dB) and catastrophic unit confusion errors (e.g., dB vs dBm), more accurately reflecting the reliability of agents in real-world engineering tasks.

Performance Comparison on WirelessBench

Method WCHW WCNS WCMSA Average
Qwen-Turbo-Latest 58.34 62.13 66.43 62.3
GPT-4o 60.32 72.45 71.22 68
CoT-SC (k=5) 60.01 74.82 73.56 69.46
MedPrompt 61.22 73.18 72.89 69.1
ADAS 53.13 68.42 65.41 62.32
AFlow 69.92 76.12 73.90 73.29
WIRELESSBENCH-REF (Ours) 81.02 86.18 86.72 84.64

All values are in percentage (%)

Performance Comparison
Average Performance Ranking
Task Performance Heatmap
Qwen-Turbo
58.34
62.13
66.43
GPT-4o
60.32
72.45
71.22
CoT-SC
60.01
74.82
73.56
MedPrompt
61.22
73.18
72.89
ADAS
53.13
68.42
65.41
AFlow
69.92
76.12
73.90
WIRELESSBENCH-REF
81.02
86.18
86.72
50% 65% 80% 95%

Conclusion

We presented WIRELESSBENCH, an extensible benchmark for evaluating the operational reliability of LLM agents on wireless network intelligence.

WIRELESSBENCH emerged as the principal deployment bottleneck during the development of wireless AI agents. Built around a three-tier cognitive hierarchy (WCHW, WCNS, WCMSA) that spans static domain knowledge, intent-to-allocation mapping, and proactive mobility-aware service assurance, the benchmark operationalizes reliability through three design commitments: tolerance-aware scoring, tool-necessary tasks, and per-item CoT.

Evaluating seven methods reveals that the strongest frontier model achieves only 68.00% under direct prompting, and that formula misapplication (31%), reasoning-path breaks (28%), and unit confusion (23%) dominate the failure landscape (the last of which is nearly eliminated by tool-integrated verification).

While current limitations in environment coverage, data augmentation, and baseline diversity remain open, we hope WIRELESSBENCH provides a rigorous and community-accessible foundation for advancing the reliability of AI agents in mobile-network operations.

Accuracy Comparison
Performance Improvement
LLM Baseline
68.00%
Improvement of 16.64 percentage points
WIRELESSBENCH-REF
84.64%
Reliability Gap

Direct prompting achieves 68.00% accuracy on WirelessBench

68.00 %
Performance Ceiling

WIRELESSBENCH-REF achieves 84.64% accuracy

84.64 %
Tool Value

Domain-specific tools significantly improve reliability

16.64 pp

Citation

Paper Information

Title: WirelessBench: A Tolerance-Aware Benchmark for LLM Agents on Wireless Network Intelligence
Journal: arXiv preprint arXiv:2603.21251v1
arXiv: 2603.21251v1

BibTeX Citation:
@article{tong2026wirelessbench,
  title={WirelessBench: A Tolerance-Aware LLM Agent Benchmark for Wireless Network Intelligence},
  author={Jingwen Tong and Fang Liu and Linkai Xv and Shiliang Lu and Kangqi Li and Yiqian Zhang and Yijie Song and Zeyang Xue and Jun Zhang},
  journal={arXiv preprint arXiv:2603.21251v1},
  year={2026}
}