WIRELESSBENCH - A Tolerance-Aware Benchmark for LLM Agents on Wireless Network Intelligence

Abstract

LLM agents are emerging as a key enabler for autonomous wireless network management. Reliably deploying them, however, demands benchmarks that reflect real engineering risk.

We present WIRELESSBENCH, the first tolerance-aware, tool-integrated benchmark for LLM-based wireless agents. WIRELESSBENCH is organized as a three-tier cognitive hierarchy: domain knowledge reasoning (WCHW, 1,392 items), intent-driven resource allocation (WCNS, 1,000 items), and proactive multi-step decisions under mobility (WCMSA, 1,000 items). Moreover, WIRELESSBENCH is established on three design principles: tolerance-aware scoring with catastrophic-error detection; tool-necessary tasks requiring a 3GPP-compliant ray-tracing query for channel quality; and Chain-of-Thought (CoT)-traceable items, where every benchmark item ships with a complete CoT trajectory.

Our numerical results show that the direct-prompting model (GPT-4o) scores 68%, trailing a tool-integrated agent (84.64%) by 16.64 pp; 23% of errors are catastrophic failures invisible to exact-match metrics.

More importantly, the hierarchy decomposes errors into four actionable diagnostic categories that flat evaluation cannot reveal. Code and data: https://github.com/jwentong/WirelessBench.

Benchmark

Benchmark Comparison

general benchmarks	WirelessBench
Evaluate isolated capabilities with flat scoring	Distinguish benign approximation from catastrophic engineering errors via tolerance-aware scoring
Treat all errors uniformly without catastrophic-error detection	Explicitly detect unit/magnitude confusions and penalize them to zero credit
Ignore cascaded multi-step decision chains	Tool-necessary workflow with per-item CoT: position prediction → (ray-traced) CQI → slice selection → bandwidth allocation → QoS verification

Dataset Statistics

Task	Val Samples	Test Samples	Data Source	Avg Tokens	Reasoning Chain	Answer Type
WCHW	348	1044	Textbook	85	4.2	Numerical, Formula
WCNS	250	750	3GPP Protocol + Papers	156	3	Structured
WCMSA	250	750	3GPP Protocol + Papers	203	5.5	Structured
Total	848	2544	-	128	4.1	-

Sample Distribution

Reasoning Chain Length

Average Token Count

WCHW

85

WCNS

156

WCMSA

203

WCHW

Wireless communication homework, evaluating basic computational capabilities

WCNS

Network slicing resource allocation, simulating 5G scenarios

WCMSA

Mobile service assurance, proactive resource allocation

Task Design

1

Basic Computational Capability Assessment

Evaluates the agent's capabilities in wireless communication fundamental calculations, covering three major areas: signal processing, information theory, and system design.

2

Network Slicing Resource Allocation

Simulates real-time resource allocation scenarios for enhanced Mobile Broadband (eMBB) and Ultra-Reliable Low-Latency Communications (URLLC) slices in 5G networks.

3

Proactive Resource Allocation

Requires agents to perform proactive resource allocation based on mobility prediction, representing the most complex multi-step reasoning task.

Task Execution Workflow

Position Prediction

→

Path Loss Calculation

→

CQI Estimation

→

Slice Selection

→

Bandwidth Allocation

→

QoS Verification

Technical Features

Generates realistic mobility trajectories following 3GPP protocol specifications
Strictly follows standardized task execution workflows for data generation
Achieves dataset expansion through data augmentation modules that introduce diverse mobility patterns and channel dynamics

Data Cleaning and Validation

This work designs a rigorous data cleaning process based on psychometric theory, ensuring high quality and reliability of the dataset.

Step 1: Format Standardization

Unify data formats to ensure consistency:

Standardize unit representations (kHz / MHz / GHz)
Normalize scientific notation (10^-6 vs 1e-6)
Standardize answer formats

Step 2: Deduplication

Use semantic similarity algorithms to identify and remove near-duplicate questions:

Cosine similarity based on Term Frequency-Inverse Document Frequency (TF-IDF)

Step 3: Quality Assessment

Employ multi-dimensional metrics to identify low-quality questions:

Item-Total Correlation
Mokken Scale Analysis
Inter-item Consistency Metrics

Step 4: AI-Assisted Review

Employ advanced large language models as "reviewers" to diagnose and correct flagged issues, ensuring dataset accuracy and completeness.

Tolerance-Aware Scoring Mechanism

This work adopts a tolerance-aware scoring mechanism that distinguishes acceptable engineering errors from catastrophic unit or formula mistakes, better aligning with real-world engineering application scenarios.

1.0

|error| ≤ 1%

Exact match

0.9

1% < |error| ≤ 5%

Engineering acceptable

0.7

5% < |error| ≤ 10%

Approximate

0.0

|error| > 10% or Unit mismatch (dB/dBm)

Unacceptable

Design Philosophy: This scoring mechanism can distinguish between rounding errors in numerical calculations (e.g., 8.4 dB vs 8.5 dB) and catastrophic unit confusion errors (e.g., dB vs dBm), more accurately reflecting the reliability of agents in real-world engineering tasks.

Performance Comparison on WirelessBench

Method	WCHW	WCNS	WCMSA	Average
Qwen-Turbo-Latest	58.34	62.13	66.43	62.3
GPT-4o	60.32	72.45	71.22	68
CoT-SC (k=5)	60.01	74.82	73.56	69.46
MedPrompt	61.22	73.18	72.89	69.1
ADAS	53.13	68.42	65.41	62.32
AFlow	69.92	76.12	73.90	73.29
WIRELESSBENCH-REF (Ours)	81.02	86.18	86.72	84.64

All values are in percentage (%)

Performance Comparison

Average Performance Ranking

Task Performance Heatmap

Qwen-Turbo

58.34

62.13

66.43

GPT-4o

60.32

72.45

71.22

CoT-SC

60.01

74.82

73.56

MedPrompt

61.22

73.18

72.89

ADAS

53.13

68.42

65.41

AFlow

69.92

76.12

73.90

WIRELESSBENCH-REF

                                                        81.02
                                                    
                                                        86.18
                                                    
                                                        86.72

50% 65% 80% 95%

Conclusion

We presented WIRELESSBENCH, an extensible benchmark for evaluating the operational reliability of LLM agents on wireless network intelligence.

WIRELESSBENCH emerged as the principal deployment bottleneck during the development of wireless AI agents. Built around a three-tier cognitive hierarchy (WCHW, WCNS, WCMSA) that spans static domain knowledge, intent-to-allocation mapping, and proactive mobility-aware service assurance, the benchmark operationalizes reliability through three design commitments: tolerance-aware scoring, tool-necessary tasks, and per-item CoT.

Evaluating seven methods reveals that the strongest frontier model achieves only 68.00% under direct prompting, and that formula misapplication (31%), reasoning-path breaks (28%), and unit confusion (23%) dominate the failure landscape (the last of which is nearly eliminated by tool-integrated verification).

While current limitations in environment coverage, data augmentation, and baseline diversity remain open, we hope WIRELESSBENCH provides a rigorous and community-accessible foundation for advancing the reliability of AI agents in mobile-network operations.

Accuracy Comparison

Performance Improvement

LLM Baseline

68.00%

Improvement of 16.64 percentage points

WIRELESSBENCH-REF

84.64%

Reliability Gap

Direct prompting achieves 68.00% accuracy on WirelessBench

68.00 %

Performance Ceiling

WIRELESSBENCH-REF achieves 84.64% accuracy

84.64 %

Tool Value

Domain-specific tools significantly improve reliability

16.64 pp

Citation

Paper Information

Title: WirelessBench: A Tolerance-Aware Benchmark for LLM Agents on Wireless Network Intelligence
Journal: arXiv preprint arXiv:2603.21251v1
arXiv: 2603.21251v1

BibTeX Citation:

@article{tong2026wirelessbench,
  title={WirelessBench: A Tolerance-Aware LLM Agent Benchmark for Wireless Network Intelligence},
  author={Jingwen Tong and Fang Liu and Linkai Xv and Shiliang Lu and Kangqi Li and Yiqian Zhang and Yijie Song and Zeyang Xue and Jun Zhang},
  journal={arXiv preprint arXiv:2603.21251v1},
  year={2026}
}

View Paper