TRON — Targeted Rule-Verifiable Online Environments for Visual Reasoning RL

01 Abstract

Generate environments, don't collect data

Existing visual-RL post-training trains on static, curated datasets bounded by their collection budget. TRON replaces that with an online substrate.

Reinforcement learning for visual reasoning needs scalable, verifiable, and controllable training signals. Existing visual RL post-training trains on static curated datasets, with fixed image–question–answer samples bounded by their collection budget. We introduce TRON (Targeted, Rule-verifiable Online eNvironments), an online environment substrate: a training rollout is generated on demand by a controllable generator–verifier program that samples a fresh latent visual state, renders an image, asks a question, and exactly verifies the answer.

A single run can therefore draw an unbounded stream of fresh instances at the difficulty level required by the current curriculum. The current TRON suite contains 520 environments organized into five ability buckets (spatial, mathematical, diagram, pattern/logic, and counting); the same substrate supports both a single full model trained on all buckets and per-bucket ability-specialist models, with no additional data collection. We also introduce a substrate analysis covering generation reliability, instance and level diversity, cross-environment near-duplicates, and base-model pass rate by difficulty level. RL post-training with TRON-DAPO consistently improves performance on ten external multimodal reasoning benchmarks across Qwen3-VL-4B, Qwen2.5-VL-7B, and MiMo-VL-7B.

CONTRIBUTION 01

An online substrate

520 generator–verifier programs that produce fresh image–question rollouts at training time, with no cap on instances per run.

CONTRIBUTION 02

Full + specialist models

Five ability buckets train a single full model and per-bucket specialists from one substrate — no extra data — revealing new insights on ability transfer.

CONTRIBUTION 03

Audited & validated

We evaluate generation quality, diversity, and difficulty calibration, with consistent gains across three open VLM families on ten benchmarks.

02 How it works

A generator–verifier pair per environment

Each environment owns both a generator and a verifier. Because the answer is fixed before the image is drawn, the reward is exact — no LLM judge, no parsing.

Construct

One Python program per task type targets a single reasoning mechanism, with a built-in difficulty ladder (levels 0–9).

# one environment type state = sample(seed, ℓ) # exact ground truth, by code answer = solve(state)

Generate

A fresh seed samples a latent state, renders the image with the answer slot left blank, and asks a question. No two steps see the same instance.

image, q = render(state) # answer fixed first ✦ # verifier is exact

Verify & train

The deterministic verifier scores the policy's answer with a binary reward, feeding a DAPO policy update and an advancing curriculum.

r = verify(ã, answer) # deterministic 0/1 # reward → DAPO update

// hover a bucket to highlight Spatial Math Diagram Pattern Counting

TRON method overview: constructing environments, generating training units, and DAPO RL training

TRON organizes 520 rule-verifiable generators into ability buckets. Each environment produces fresh
difficulty-controlled image–question rollouts with a deterministic verifier, audited before mixed or ability-specialist RL.

03 The suite

Five ability buckets, 520 environments

Each bucket groups generator–verifier programs around reusable visual-reasoning mechanisms. Diversity enters at three levels: mechanism, instance, and difficulty.

Spatial

111

environments

3D rotation, cube nets & folding, navigation & pathfinding, perspective shifts, mechanical layout.

Math

131

environments

Geometric theorems, analytic geometry, algebra, probability over visual quantities.

Diagram

144

environments

Chart aggregation, tables, graph algorithms, flowcharts, scientific figures.

Pattern / Logic

104

environments

Constraint puzzles, visual analogies, syllogistic deduction, state-space planning.

Counting

environments

Visual enumeration of objects, cells & regions; path counting; feature estimation.

04 Sampled rollouts

What an environment produces

Each rollout pairs a rendered visual instance given to the policy with the answer accepted by the executable verifier. Below are illustrative renders per bucket.

Spatial · maze navigation

Shortest path

"Minimum number of moves from S to G?"

✓ verified answer 8

Math · angle chasing

Triangle interior angle

"Two interior angles are 55° and 70°. Find x."

✓ verified answer 55°

Diagram · chart aggregation

Bar-chart reading

"Which category has the highest value?"

✓ verified answer C

Pattern · Raven matrix

Matrix completion

"Which shape completes the bottom-right cell?"

✓ verified answer ▲

Counting · object enumeration

Occluded counting

"How many circles are in the image?"

✓ verified answer 7

Math · clock angle

Angle between hands

"What is the angle between the hour and minute hands?"

✓ verified answer 47.5°

Renders above are stylized reconstructions for the web. See the paper's appendix for sampled outputs across Levels 0, 5, and 9.

05 Results

Consistent gains across families & benchmarks

TRON improves three SOTA VLMs on all external benchmarks. The ability-specialist analysis shows transfer is driven by underlying capability, not visual format.

Main results Ability specialists

Model	Run	WeMath-S	WeMath-L	MathV	Dyna	MME-R	Spat.	Logic	HELIX	CharXiv	ChartQA	Puzzle	Mean
Qwen3-4B	Base	52.86	70.95	43.15	66.11	42.85	77.52	57.05	21.67	39.80	34.41	72.35	52.61
	+TRON	58.29	74.86	44.54	67.49	45.88	80.47	59.73	25.56	42.40	35.04	73.25	55.23
	Δ	+5.43	+3.91	+1.39	+1.38	+3.03	+2.95	+2.68	+3.89	+2.60	+0.63	+0.90	+2.62
Qwen2.5-7B	Base	36.10	55.71	43.55	53.55	26.60	57.67	46.98	4.44	37.40	40.50	46.80	40.85
	+TRON	39.24	58.48	46.50	55.35	28.62	62.03	47.43	5.16	38.90	42.64	52.55	43.35
	Δ	+3.14	+2.77	+2.95	+1.80	+2.02	+4.36	+0.45	+0.72	+1.50	+2.14	+5.75	+2.50
MiMo-7B	Base	62.10	80.57	70.89	74.37	45.29	78.77	63.31	26.19	58.70	59.65	77.20	63.37
	+TRON	68.86	82.19	73.65	76.23	46.46	86.58	66.89	30.32	62.60	60.34	77.35	66.50
	Δ	+6.76	+1.62	+2.76	+1.86	+1.17	+7.81	+3.58	+4.13	+3.90	+0.69	+0.15	+3.13

All scores are percentages from VLMEvalKit. Qwen3-4B = Qwen3-VL-4B-Instruct · Qwen2.5-7B = Qwen2.5-VL-7B-Instruct · MiMo-7B = MiMo-VL-7B-SFT. ChartQA Pro for MiMo uses a format-normalized rejudge.

Run	WeMath-S	WeMath-L	MathV	Dyna	MME-R	Spat.	Logic	HELIX	CharXiv	ChartQA	Puzzle	Mean
Base	52.86	70.95	43.15	66.11	42.85	77.52	57.05	21.67	39.80	34.41	72.35	52.61
Math	58.76	75.90	41.22	66.25	44.70	77.61	59.51	26.90	42.00	35.52	70.45	54.44
Spatial	56.38	75.62	42.56	65.41	44.11	78.75	59.06	23.65	39.30	35.52	71.85	53.84
Count	57.14	74.29	44.14	66.27	45.20	77.52	59.28	23.97	42.00	34.08	69.90	53.98
Pattern	54.48	71.81	41.57	65.33	43.27	76.22	56.38	23.49	41.60	34.80	71.45	52.76
Diagram	57.81	74.76	44.67	66.69	44.70	79.01	60.18	25.40	41.90	33.94	71.40	54.59
Full	58.29	74.86	44.54	67.49	45.88	80.47	59.73	25.56	42.40	35.04	73.25	55.23

RQ1 · within-bucket

Does a format-defined bucket train its own domain?

Each specialist substantially improves external subtasks whose visual format matches its bucket — Math +11.2 on WeMath angles, Spatial +16.7 on MM-HELIX maze.

→ Visual-format alignment has a clear effect.

RQ2 · cross-bucket

Does capability transfer across formats?

Each specialist also lifts subtasks outside its bucket. Math transfers to MM-HELIX maze (+20.0); Count to MathVerse volume (+7.8); Diagram to PuzzleVQA rect-height (+10.0).

→ Trained capability transfers across formats.

RQ3 · format alone

Is visual format alone sufficient?

The visually-aligned specialist wins only 1 of 10 benchmarks. On MathVerse, Diagram (figure-reading) beats Math; on CharXiv, Math (inference chains) beats Diagram.

→ Cover both format and capability.

TRON: Targeted Rule-Verifiable Online Environments
for Visual Reasoning RL

Generate environments, don't collect data

An online substrate

Full + specialist models

Audited & validated

A generator–verifier pair per environment

Construct

Generate

Verify & train

Five ability buckets, 520 environments

What an environment produces

Shortest path

Triangle interior angle

Bar-chart reading

Matrix completion

Occluded counting

Angle between hands

Consistent gains across families & benchmarks

Does a format-defined bucket train its own domain?

Does capability transfer across formats?

Is visual format alone sufficient?

Cite TRON