Online Environments · Visual Reasoning RL

TRON: Targeted Rule-Verifiable Online Environments
for Visual Reasoning RL

A substrate for visual-reasoning RL where every training rollout is generated on demand by a controllable generator–verifier program — fresh, difficulty-controlled, and exactly graded.

Tianze Yang*   Yucheng Shi*   Ruitong Sun   Jingyuan Huang   Ninghao Liu   Jin Sun
University of Georgia  ·  *Equal contribution
520
Rule-verifiable envs
5
Ability buckets
Fresh instances / run
10
External benchmarks
01 Abstract

Generate environments, don't collect data

Existing visual-RL post-training trains on static, curated datasets bounded by their collection budget. TRON replaces that with an online substrate.

Reinforcement learning for visual reasoning needs scalable, verifiable, and controllable training signals. Existing visual RL post-training trains on static curated datasets, with fixed image–question–answer samples bounded by their collection budget. We introduce TRON (Targeted, Rule-verifiable Online eNvironments), an online environment substrate: a training rollout is generated on demand by a controllable generator–verifier program that samples a fresh latent visual state, renders an image, asks a question, and exactly verifies the answer.

A single run can therefore draw an unbounded stream of fresh instances at the difficulty level required by the current curriculum. The current TRON suite contains 520 environments organized into five ability buckets (spatial, mathematical, diagram, pattern/logic, and counting); the same substrate supports both a single full model trained on all buckets and per-bucket ability-specialist models, with no additional data collection. We also introduce a substrate analysis covering generation reliability, instance and level diversity, cross-environment near-duplicates, and base-model pass rate by difficulty level. RL post-training with TRON-DAPO consistently improves performance on ten external multimodal reasoning benchmarks across Qwen3-VL-4B, Qwen2.5-VL-7B, and MiMo-VL-7B.

CONTRIBUTION 01

An online substrate

520 generator–verifier programs that produce fresh image–question rollouts at training time, with no cap on instances per run.

CONTRIBUTION 02

Full + specialist models

Five ability buckets train a single full model and per-bucket specialists from one substrate — no extra data — revealing new insights on ability transfer.

CONTRIBUTION 03

Audited & validated

We evaluate generation quality, diversity, and difficulty calibration, with consistent gains across three open VLM families on ten benchmarks.

02 How it works

A generator–verifier pair per environment

Each environment owns both a generator and a verifier. Because the answer is fixed before the image is drawn, the reward is exact — no LLM judge, no parsing.

01

Construct

One Python program per task type targets a single reasoning mechanism, with a built-in difficulty ladder (levels 0–9).

# one environment type state = sample(seed, ) # exact ground truth, by code answer = solve(state)
02

Generate

A fresh seed samples a latent state, renders the image with the answer slot left blank, and asks a question. No two steps see the same instance.

image, q = render(state) # answer fixed first ✦ # verifier is exact
03

Verify & train

The deterministic verifier scores the policy's answer with a binary reward, feeding a DAPO policy update and an advancing curriculum.

r = verify(ã, answer) # deterministic 0/1 # reward → DAPO update
// hover a bucket to highlight Spatial Math Diagram Pattern Counting
TRON method overview: constructing environments, generating training units, and DAPO RL training

TRON organizes 520 rule-verifiable generators into ability buckets. Each environment produces fresh
difficulty-controlled image–question rollouts with a deterministic verifier, audited before mixed or ability-specialist RL.

03 The suite

Five ability buckets, 520 environments

Each bucket groups generator–verifier programs around reusable visual-reasoning mechanisms. Diversity enters at three levels: mechanism, instance, and difficulty.

Spatial
111
environments
3D rotation, cube nets & folding, navigation & pathfinding, perspective shifts, mechanical layout.
Math
131
environments
Geometric theorems, analytic geometry, algebra, probability over visual quantities.
Diagram
144
environments
Chart aggregation, tables, graph algorithms, flowcharts, scientific figures.
Pattern / Logic
104
environments
Constraint puzzles, visual analogies, syllogistic deduction, state-space planning.
Counting
30
environments
Visual enumeration of objects, cells & regions; path counting; feature estimation.
04 Sampled rollouts

What an environment produces

Each rollout pairs a rendered visual instance given to the policy with the answer accepted by the executable verifier. Below are illustrative renders per bucket.

Renders above are stylized reconstructions for the web. See the paper's appendix for sampled outputs across Levels 0, 5, and 9.

05 Results

Consistent gains across families & benchmarks

TRON improves three SOTA VLMs on all external benchmarks. The ability-specialist analysis shows transfer is driven by underlying capability, not visual format.

Main results Ability specialists
ModelRun WeMath-SWeMath-LMathVDyna MME-RSpat.LogicHELIX CharXivChartQAPuzzle Mean
Qwen3-4BBase52.8670.9543.1566.1142.8577.5257.0521.6739.8034.4172.3552.61
+TRON58.2974.8644.5467.4945.8880.4759.7325.5642.4035.0473.2555.23
Δ+5.43+3.91+1.39+1.38+3.03+2.95+2.68+3.89+2.60+0.63+0.90+2.62
Qwen2.5-7BBase36.1055.7143.5553.5526.6057.6746.984.4437.4040.5046.8040.85
+TRON39.2458.4846.5055.3528.6262.0347.435.1638.9042.6452.5543.35
Δ+3.14+2.77+2.95+1.80+2.02+4.36+0.45+0.72+1.50+2.14+5.75+2.50
MiMo-7BBase62.1080.5770.8974.3745.2978.7763.3126.1958.7059.6577.2063.37
+TRON68.8682.1973.6576.2346.4686.5866.8930.3262.6060.3477.3566.50
Δ+6.76+1.62+2.76+1.86+1.17+7.81+3.58+4.13+3.90+0.69+0.15+3.13
All scores are percentages from VLMEvalKit. Qwen3-4B = Qwen3-VL-4B-Instruct · Qwen2.5-7B = Qwen2.5-VL-7B-Instruct · MiMo-7B = MiMo-VL-7B-SFT. ChartQA Pro for MiMo uses a format-normalized rejudge.
RQ1 · within-bucket

Does a format-defined bucket train its own domain?

Each specialist substantially improves external subtasks whose visual format matches its bucket — Math +11.2 on WeMath angles, Spatial +16.7 on MM-HELIX maze.

→ Visual-format alignment has a clear effect.
RQ2 · cross-bucket

Does capability transfer across formats?

Each specialist also lifts subtasks outside its bucket. Math transfers to MM-HELIX maze (+20.0); Count to MathVerse volume (+7.8); Diagram to PuzzleVQA rect-height (+10.0).

→ Trained capability transfers across formats.
RQ3 · format alone

Is visual format alone sufficient?

The visually-aligned specialist wins only 1 of 10 benchmarks. On MathVerse, Diagram (figure-reading) beats Math; on CharXiv, Math (inference chains) beats Diagram.

→ Cover both format and capability.
06 Citation

Cite TRON

If you find this work useful, please cite the paper.

@article{yang2025tron,
  title   = {TRON: Targeted Rule-Verifiable Online Environments for Visual Reasoning RL},
  author  = {Yang, Tianze and Shi, Yucheng and Sun, Ruitong and Huang, Jingyuan and Liu, Ninghao and Sun, Jin},
  journal = {arXiv preprint},
  year    = {2025},
  url     = {https://github.com/YangTianze009/TRON}
}