Chapter 6

Generalisations

Real-world deployment demands that learned policies generalise beyond their training distribution—a requirement difficult to validate when benchmarks comprise dozens of manually designed tasks rather than diverse families of environments. This chapter introduces EnvCraft, a validation-first system that generates thousands of validated Gymnasium environments from natural-language concepts, enabling larger-scale generalisation studies than are usually practical with hand-built benchmarks. A multi-stage pipeline combines large language model code generation with automated testing and agent-based validation, producing environments that share a fixed observation-action interface whilst varying in dynamics, reward structures, and win conditions. Privileged agents with source-code access screen for unsuitable difficulty extremes and generate demonstration trajectories that bootstrap vision-based learning. Cross-validation experiments centred on procedurally generated Tetris variants provide within-family evidence that broader training distributions can improve performance on held-out tasks. This infrastructure addresses the scarcity of validated benchmark diversity identified in earlier chapters and provides a basis for future evaluation of whether modular systems genuinely generalise or merely overfit to narrow task distributions.

Previous chapter Next chapter

Chapter Abstract

6.1

Introduction

Real-world deployment demands generalisation. Chapter 2 traced the division of labour from pin factories to flight computers, establishing that specialisation enables productivity gains only when workers—or policy units—transfer skills across contexts. Chapter 3 examined how the A320's flight computers, the French power grid's hierarchical control, and the Kangduo surgical robot's dual-console handover achieve reliability through architectural patterns: specialisation, redundancy, constrained transitions. Chapter 4 identified deployment challenges in sepsis treatment and telesurgery, revealing that learned policies must generalise beyond their training distribution whilst maintaining interpretability and bounded execution. The modular systems developed in Chapter 5—policy graphs with hard routing, commitment bounds, and distributed execution—inherit these principles. However, evaluating whether such systems genuinely generalise or merely overfit to narrow task distributions requires benchmark diversity at scales beyond existing suites.

Traditional RL benchmarks comprise dozens of manually designed tasks. The Arcade Learning Environment standardised evaluation across Atari games; DeepMind Control Suite provided continuous control tasks; Procgen introduced procedurally generated levels. These contributions enabled algorithmic progress, yet benchmark scarcity creates a fundamental tension: as agents approach human-level performance on fixed suites, distinguishing genuine competence from task-specific memorisation becomes difficult. Procgen demonstrated that agents trained on limited level seeds catastrophically overfit when evaluated on held-out levels. NetHack and Crafter push complexity further, yet represent singular rule systems rather than diverse families of mechanics.

Prior approaches to environment diversity operate at different granularities. Procedural content generation varies layouts and textures within fixed rules. Automatic environment design methods such as POET and PAIRED co-evolve tasks and agents but rarely certify pixel-learnability. Game description languages like VGDL enable compact specification but require bespoke tooling. What remains scarce is validated diversity at the level of rules—new dynamics, reward structures, and win conditions.

EnvCraft addresses this gap through a validation-first pipeline. Ideas become design briefs via gpt-oss-20bhttps://openai.com/index/introducing-gpt-oss/; briefs become code via gpt-oss-120bhttps://openai.com/index/introducing-gpt-oss/; code is tested and repaired; agent-based checks screen for degenerate cases. A privileged agent with full access to environment internals removes unsuitably difficult environments and generates demonstration data for bootstrapping vision-based policies. The final corpus comprises environments that are syntactically correct, API-compliant, and screened for obvious degeneracies—pixel-based learnability is not systematically verified.

Every EnvCraft environment exposes a fixed specification enabling cross-game training:

Observation: 84 $\times$ 84 $\times$ 3 RGB array (uint8)
Action: MultiDiscrete([5,2,2])—five movement options plus two binary buttons
Episode: Maximum 1,000 steps
API: Gymnasium-compliant with deterministic seeding

Representative examples of generated environments are shown in Figure 6.1.

Example EnvCraft environments. Six representative rendered observations from distinct generated games, illustrating diversity in mechanics and visual style whilst sharing the fixed 84$$84$$3 RGB observation and MultiDiscrete([5,2,2]) action interface. — Figure 6.1Example EnvCraft environments. Six representative rendered observations from distinct generated games, illustrating diversity in mechanics and visual style whilst sharing the fixed 84 $\times$ 84 $\times$ 3 RGB observation and `MultiDiscrete([5,2,2])` action interface.

The pipeline also generates privileged demonstration trajectories used to bootstrap vision-based learning; Section 6.4 describes this process.

This chapter makes three primary contributions:

A multi-stage validation pipeline producing 9,694 validated environments from 20,000 initial concepts (48.5% yield), incorporating privileged agent screening and privileged-rollout replay seeding for vision agent pretraining.
A privileged agent methodology that uses language model access to source code for difficulty screening and demonstration trajectory generation, bootstrapping vision-based learning via replay seeding and pretraining.
Within-family generalisation evidence at scale: Using 1,000 procedurally generated Tetris environments with 10-fold cross-validation, this chapter demonstrates that broader training distributions produce significant positive transfer to held-out tasks: 68.7% of 1,000 environments show gains (7.4% mean improvement on split 0 as a representative example), with a monotonic relationship between training diversity and generalisation performance under this protocol.

6.2

Related Work

6.2.1

Benchmarks and Procedural Content Generation

The Arcade Learning Environment established standardised evaluation across 57 Atari 2600 games, though subsequent work revealed issues with deterministic dynamics . Mnih et al. demonstrated that deep Q-networks could achieve human-level performance on many Atari games, whilst Rainbow pushed performance further by combining multiple algorithmic improvements. DeepMind Control Suite provided continuous control tasks with interpretable rewards. Whilst these suites enabled algorithmic progress, they prioritise consistency and reproducibility over rule-level diversity.

Recent benchmarks address overfitting through procedural content generation. Procgen generates unlimited level variants within 16 fixed game types using deterministic seeds, demonstrating that agents trained on limited seeds catastrophically overfit when evaluated on held-out seeds. MiniGrid provides gridworld navigation tasks with procedurally generated mazes and layouts, enabling studies of sample efficiency and partial observability. NetHack wraps the classic roguelike game as a Gymnasium environment, offering extraordinary complexity through procedural dungeon generation, though the symbolic state representation and ASCII rendering differ substantially from pixel-based benchmarks. MiniHack extracts NetHack mechanics into smaller, controllable tasks with faster episode turnover. Crafter provides a single open-world survival game with procedurally generated terrain, evaluating agents through achievement-based metrics across diverse skills.

These approaches vary content—level layouts, terrain, entity placements—within fixed rule systems. EnvCraft operates at a different granularity: the rules themselves are generated, producing entirely distinct game mechanics, reward structures, and win conditions whilst maintaining a common observation and action interface.

6.2.2

Automatic Environment Design

A growing body of work explores automatic generation of training environments to improve generalisation and robustness. POET introduced open-ended co-evolution of environments and agents, progressively increasing difficulty through evolutionary selection whilst maintaining a diverse population of agent-environment pairs. PAIRED frames environment design as an adversarial game in which an antagonist generates challenging environments whilst a protagonist learns to solve them, leading to robust zero-shot transfer. PLR maintains a distribution over procedurally generated levels, prioritising replay of environments with high temporal-difference error to focus learning on the curriculum frontier. Quality-diversity methods such as MAP-Elites search for diverse, high-performing solutions across behavioural dimensions, producing archives of environments that cover different challenge characteristics.

These approaches verify learnability through agent training: environments that prove unlearnable within the training budget are discarded or down-weighted. However, this verification occurs during the training loop, requiring agents to attempt learning on potentially futile tasks. EnvCraft separates validation from training: privileged agents with state access provide rapid learnability probes before vision-based training begins, and the fixed interface enables independent validation once rather than per-training-run verification.

6.2.3

Language Models for Code Generation

Large language models now enable direct code synthesis from natural language . Game description languages such as VGDL provide declarative alternatives, but require bespoke interpreters; EnvCraft instead generates executable Gymnasium code directly, avoiding bespoke tooling whilst addressing correctness and learnability through automated testing and agent-based validation. Once environments are validated, DQfD-inspired replay seeding provides an efficient path to bootstrapping vision-based policies from privileged-agent trajectories using standard temporal-difference objectives only (no supervised margin loss).

6.3

Code Generation Pipeline

The EnvCraft system decomposes environment creation into code generation and agent-based validation phases, implementing progressive refinement with empirical gates between stages. Figure 6.2 presents the complete pipeline architecture, which transforms natural-language game concepts into validated Gymnasium environments.

Figure 6.2Complete EnvCraft pipeline. Game concepts progress through code generation (idea → brief → implementation → testing), random agent filtering (removing InstaWin/InstaDeath cases), and privileged rollout generation (difficulty assessment removes too-hard/too-easy cases; rollouts seed replay buffer for pretraining). Successfully validated environments are paired with their prompts as training data for fine-tuning code generation models.

The first half of the pipeline transforms natural-language concepts into executable Gymnasium environments through progressive refinement.

6.3.1

Concept Generation and Code Synthesis

The pipeline generates 20,000 diverse game concepts by sampling from curated pools comprising 42 genres, 56 mechanics, 51 themes, 36 twists, 19 mashups, and 30 experimental concepts. Four complementary strategies ensure broad coverage: genre-blend (combining three distinct genres), mechanical (assembling four individual mechanics), thematic (pairing visual theme with core mechanic and twist), and experimental (unusual single-concept games). Each idea specifies concrete mechanics, objects, win/loss conditions, and numerical parameters to ensure implementability. Deduplication via normalised text hashing removes exact duplicates whilst preserving meaningful variations.

These concepts are expanded into detailed 1,500--3,000 word design specifications using gpt-oss-20b, covering core mechanics, visual design, action space mapping to MultiDiscrete([5,2,2]), reward structure, and termination conditions. An early viability check critiques each brief for internal consistency and implementability, filtering out specifications with impossible physics, contradictory win conditions, or action space mismatches. Of 20,000 initial ideas, 18,878 briefs (94.4%) pass this filter.

Validated briefs are transformed into executable Python code using gpt-oss-120b, generating complete 500--1,000 line Gymnasium environments that implement the full API (reset(), step(), render()), produce 84 $\times$ 84 $\times$ 3 RGB observations, handle edge cases gracefully, and maintain deterministic behaviour under fixed seeding. This achieves a 94.9% success rate: 17,915 of 18,878 briefs produce syntactically valid, importable Python code. The 963 failures arise from malformed syntax, circular imports, or non-existent library references.

6.3.2

Testing and Repair

Generated code undergoes a comprehensive test suite covering syntactic correctness, API conformance, reinforcement learning invariants (bounded rewards, eventual termination, informative observations), and deterministic behaviour under fixed seeding. Of the 17,915 environments that pass code generation, 8,503 initially fail one or more tests. Rather than discarding these environments immediately, an automated repair loop is implemented in which error messages, stack traces, and failing test descriptions are provided to a language model tasked with fixing the code whilst preserving the original design intent.

The repair process proceeds iteratively, with up to three attempts permitted per environment. The first repair pass successfully fixes 1,520 environments, representing 17.9% of the initial failures. Many of these are straightforward errors: incorrect variable references, off-by-one indexing mistakes, or missing imports. The second pass recovers an additional 701 environments (8.2% of failures), typically addressing more subtle issues such as edge cases in collision detection or state update ordering. The third and final pass fixes 192 environments (2.3% of failures), capturing a small number of complex multi-step repairs. In total, the iterative repair process recovers 2,413 environments that would otherwise have been lost.

Despite these efforts, 6,090 environments remain unfixable after three attempts and are discarded. Ultimately, 11,825 environments pass all code-level tests and proceed to agent-based validation. Figure 6.3 and Table 6.1 show the complete filtering cascade and exact counts at each stage.

Figure 6.3Environment filtering cascade. Starting from 20,000 generated game concepts, progressive filtering through design brief validation (18,878 pass), code generation (17,915 valid), automated testing with repair (11,825 pass after up to three repair iterations), and agent-based checks (9,694 final). Major losses occur during testing/repair (6,090 irreparable), random agent filtering (935 InstaWin + 606 InstaDeath), and privileged agent assessment (590 unsuitable difficulty). The 48.5% overall yield represents environments that are syntactically correct, API-compliant, and free of degenerate reward structures.

lrrr Stage	Input	Output	Pass Rate
S1: Ideas	—	20,000	—
S2: Brief generation	20,000	18,878	94.4%
S3: Code generation	18,878	17,915	94.9%
S4: Test + repair	17,915	11,825	66.0%
S5: Random agent	11,825	10,284	87.0%
S6: Privileged agent	10,284	9,694	94.3%
Overall	20,000	9,694	48.5%

Table 6.1Pipeline statistics showing input/output counts and pass rates at each stage.

6.3.3

Random Agent Filtering

Environments that pass code-level testing may nonetheless exhibit degenerate behaviours that render them unsuitable for reinforcement learning research. Two baseline agent checks are applied to identify and eliminate such edge cases.

The first check, InstaWin detection, executes a random policy for multiple episodes and monitors the reward distribution. Environments in which random actions consistently achieve high returns—indicating that success requires no learning whatsoever—are flagged as degenerate. Such environments typically arise from overly generous reward shaping, trivial win conditions, or bugs that inadvertently reward all actions equally. This check removes 935 environments from the corpus.

The second check, InstaDeath detection, verifies that a no-op policy (an agent that takes no actions) does not immediately fail. Whilst it is acceptable for a no-op agent to eventually lose by timeout, instant death without any agency indicates unavoidable failure states that make learning impossible. These failures often stem from spawn-point collisions, initial conditions that violate game constraints, or aggressive enemies that attack before the agent can react. This check removes an additional 606 environments.

Following these random agent filters, 10,284 environments remain and proceed to privileged agent evaluation. These environments are syntactically correct, API-compliant, and free of the most obvious degeneracies, though they have not yet been verified for learnability from pixel observations.

6.4

Privileged Rollout Generation

“Demonstrations'' here means trajectories collected from a non-learning policy and used to seed or supervise a vision-based learner. In EnvCraft, demonstrations are generated automatically by the privileged code-access policy rather than collected from humans or external datasets. All vision-based training uses privileged-rollout replay seeding and a pretraining phase: the replay buffer is initialised with privileged rollouts, the vision agent is pretrained by sampling exclusively from this seeded replay, then continues with standard online training using temporal-difference objectives only (no supervised margin loss as in canonical DQfD ); code access is never available to the learner.

Passing code-level tests and random agent checks does not guarantee an environment is suitable for reinforcement learning. A game might execute correctly yet be unlearnable due to hidden state dependencies, adversarial dynamics, or reward structures that require capabilities beyond current algorithms. This is addressed through a privileged agent that has access to information unavailable to a standard vision-based learner.

6.4.1

Privileged Policy Synthesis

For each environment that passes random agent filtering, a privileged policy is synthesised using gpt-oss-120b with read-only access to the environment's source code. The model analyses the game logic, state representation, reward function, and termination conditions, then generates a Python policy class with an act(state) method that maps internal game state to actions.

This privileged agent operates on the complete internal state—player positions, enemy locations, item inventories, timers, and any other variables defined in the code—rather than the 84 $\times$ 84 $\times$ 3 rendered observation. The agent “plays'' the environment in a read-only capacity: it observes the full state at each timestep and selects actions, but cannot modify any game variables directly. This asymmetry is intentional: the privileged agent provides a high-performing baseline representing what could be achieved with complete state information.

6.4.2

Difficulty Assessment

The privileged agent serves as a pragmatic screening heuristic for environments at difficulty extremes. The synthesised privileged policy is executed for multiple episodes and the outcome distribution is analysed under the rollout budget. If the privileged agent—with full state access—cannot consistently achieve positive outcomes, the environment may have design issues that make it unsuitable for our benchmark: potentially impossible win conditions, adversarial dynamics, or reward structures with no readily discoverable optima. Whilst the privileged agent is not guaranteed optimal and may fail for reasons unrelated to intrinsic unsolvability, environments it cannot solve are unlikely to provide useful learning signal for vision-based policies and are removed from the corpus. Conversely, if the privileged agent achieves maximum possible performance with near-certainty, the environment may lack meaningful challenge or have degenerate solutions accessible even to simple heuristics. Whilst not as problematic as potentially unsolvable games, trivially easy environments (as assessed by this heuristic) provide limited value for evaluating agent capabilities and are likewise removed.

This pragmatic filtering removes 590 environments from the 10,284 candidates, yielding a final corpus of 9,694 environments. This heuristic may exclude some learnable environments (where the privileged agent fails but vision agents might succeed) and retain some poorly-designed ones (where the privileged agent succeeds by exploiting structure unavailable to vision agents); it biases the corpus towards environments whose challenge is visible to a code-access policy, and may therefore under-represent tasks where perceptual difficulty is the dominant obstacle.

6.4.3

Privileged Rollout Generation and Replay-Seeded Pretraining

For all 9,694 environments, the privileged agent executes extended rollouts and complete trajectories are recorded as tuples:

\mathcal{D} = \{(o_t, a_t, r_t, o_{t+1}, d_t)\}_{t=1}^{T}

Equation 6.1

where $$o_t$$ is the rendered 84 $\times$ 84 $\times$ 3 observation (not the internal state), $$a_t$$ is the action selected by the privileged policy (based on internal state), $$r_t$$ is the reward, $o_{t+1}$ is the next observation, and $$d_t$$ is the termination flag.

These demonstrations have a distinctive property: the actions are informed by information not present in the observations. A vision-only agent must infer, from pixel patterns alone, the action choices that the privileged agent made using complete state knowledge — the demonstrations thus encode implicit information about what visual features correlate with high-performing behaviour. For the generalisation experiments, 1,000 transitions per environment seed the replay buffer; complete pretraining and online training details are provided in Section 6.5.

6.5

Generalisation Experiments

The scale of the corpus—9,694 validated environments—enables experimental designs that are rarely practical with hand-built benchmarks. Tetris is chosen as the primary evaluation domain because its objective is unambiguous: longer episodes are always better, irrespective of the underlying reward scale. The 1,000 generated Tetris environments differ markedly in board geometry, block distributions, gravity schedules, termination rules, and reward shaping, making raw episode lengths incomparable across environments; episode length relative to a random baseline provides a clean, monotonic, per-environment metric.

6.5.1

Experimental Protocol

One thousand distinct Tetris environments were procedurally generated within the EnvCraft framework and randomly partitioned into ten folds of 100 environments each. For each cross-validation split, one fold serves as the test set whilst the remaining 900 environments form the training set, yielding ten disjoint train--test partitions. This 900/100 split achieves two aims: (i) it provides sufficient diversity during training to support non-trivial generalisation, and (ii) it furnishes a held-out panel of 100 genuinely unseen environments on which to assess out-of-distribution performance. Each environment appears in the test set exactly once across the ten folds, providing 1,000 environment-level generalisation measurements.

The agent uses a Duelling Deep Q-Network architecture with approximately 12 million parameters, processing 84 $\times$ 84 $\times$ 3 RGB observations through a five-layer convolutional backbone before splitting into separate value and advantage streams. Double DQN with prioritised experience replay , $\epsilon$ -greedy exploration (annealed from 1.0 to 0.1), Adam optimisation (learning rate $3 \times 10^{-4}$ ), and n-step returns ( $$n=3$$ , $\gamma = 0.99$ ).

The training protocol (held constant across all diversity conditions) proceeds as follows:

Replay seeding: 1,000 transitions per training environment seed the replay buffer before any learning begins (10,000 collected per environment; the remainder are available for extended runs).
Pretraining: The vision agent is pretrained for 250,000 gradient updates, sampling minibatches exclusively from the seeded replay buffer, before any online interaction.
Online training: Standard Double/Duelling DQN with prioritised experience replay, sampling from the replay buffer containing both privileged-generated and agent-generated transitions.

The training curriculum cycles through the 900 training environments in shuffled order, with the agent experiencing 10,000 steps per environment before rotating to the next, yielding approximately 9 million total environment interactions per cross-validation split. Gradient updates occur every four environment steps.

For each held-out environment, mean episode lengths under the trained and random policies are estimated from 1,000 episodes each (reducing Monte Carlo noise), with standard errors based on empirical episode-length variance. The 10-fold design ensures no individual environment dominates the evaluation and that results are not artefacts of a particular train--test partition.

6.5.2

Results and Analysis

Performance is measured as the percentage change in mean episode length of the trained policy relative to a random policy, normalised per environment so that each environment contributes equally to the aggregate statistics regardless of absolute episode scale.

Across all ten folds, 687 of 1,000 environments (68.7%) show positive transfer from the trained policy over the random baseline. The overall mean improvement is approximately 1.96 steps (standard deviation 3.92 steps), indicating that training on 900 heterogeneous Tetris variants induces a systematic generalisation benefit despite considerable variability across individual environments. Figure 6.4 illustrates split 0 as a representative example: the mean improvement is 7.4% (95% CI [4.97%, 9.81%]), with the confidence band lying wholly to the right of zero confirming statistically significant positive transfer. Figure fig:scaling shows the scaling behaviour across diversity conditions.

Generalisation to held-out environments. Each horizontal bar represents percentage improvement in mean episode length for one of 100 test environments (split 0), sorted by performance. The solid vertical line (7.39%) marks mean improvement; shaded band shows 95% CI [4.97%, 9.81%]. Error bars indicate per-environment 95% CIs. The majority of environments show positive transfer.

Generalisation decreases with reduced training-set diversity. Mean percentage improvement in episode length as training set size varies (100, 300, 500, 700, 900 environments). Error bars show standard deviations across test environments and cross-validation splits (note: low-diversity conditions use fewer splits). — Generalisation to held-out environments. Each horizontal bar represents percentage improvement in mean episode length for one of 100 test environments (split 0), sorted by performance. The solid vertical line (7.39%) marks mean improvement; shaded band shows 95% CI [4.97%, 9.81%]. Error bars indicate per-environment 95% CIs. The majority of environments show positive transfer.

6.5.3

Scaling with Training Diversity

To assess scaling, policies were trained on subsets of 700, 500, 300, and 100 environments using the same evaluation framework. Reduced-diversity conditions use fewer cross-validation splits (five for the 700- and 500-environment conditions; one each for 300 and 100), so these results are indicative rather than a precise estimate of a change-point. Figure fig:scaling shows a monotonic trend: as the number of training environments decreases, the mean generalisation effect degrades markedly, reaching near-zero in the 100--300 settings. This confirms that the generalisation benefit reflects broad environmental coverage during training rather than a few privileged environments.

6.6

Discussion

The pipeline has inherent constraints: the fixed action space MultiDiscrete([5,2,2]) excludes continuous-control settings; visual style is biased towards 2D arcade games; the three-attempt repair cap leaves complex multi-object interactions as the primary remaining failure mode. The privileged agent provides a useful heuristic but may not find optimal strategies for all games, biasing the corpus towards environments whose challenge is visible to a code-access policy. The Tetris generalisation results are honest about their scope: statistically significant transfer (68.7% of 1,000 environments, 1.96-step mean improvement) with modest effect sizes and high variability, all within a single game family. The core empirical claim is a within-family result; cross-domain generalisation remains an open question, and the benchmark is primarily a tool for making that question tractable at larger scale than existing hand-built suites allow.

6.6.1

Production Deployment

The research pipeline described in this chapter has been deployed as a publicly accessible web service at https://envcraft.com. The production system extends the research pipeline in several ways that did not form part of the evaluated contribution but merit description here, both as evidence of the pipeline's deployability and to distinguish clearly between the validated research system and the extended production system.

The production pipeline adds three validation stages beyond those evaluated in this chapter. A visual validation gate captures rendered frames from newly generated environments and submits them to a multimodal vision model for a majority-vote pass/fail assessment; this gate filters environments that execute without error but render incorrectly or produce visually degenerate output. A policy smoke test runs a random rollout of five hundred steps and verifies that the environment makes positive reward achievable; this detects environments in which valid agent behaviour is impossible regardless of strategy. A trivial-policy check verifies that reward is not trivially farmable by a fixed or near-random strategy; this prevents environments in which the maximum return is achieved without meaningful learning. The resulting pipeline has nine stages, compared to the seven-stage evaluated research system. Each generated environment additionally defines a module-level MINIMUM_SKILL_THRESHOLD constant—a human-readable declaration of the minimum performance level required to demonstrate non-trivial agent competence—operationalising the difficulty-filtering principle described in Section 6.4.

Generated environments are playable in the browser at ten to fifteen frames per second using the same action interface (MultiDiscrete([5,\,2,\,2])) that training agents use, and the same observation interface (Box(0,\,255,\,(H,\,W,\,3),\,uint8)) that MiniConv (Chapter 7) encodes. Users can submit GPU-accelerated DQN training jobs against their generated environments; trained agents can be watched playing back in the browser. The production system uses prompt version 1.5.0 of the generation pipeline, which differs from the version used in the evaluated research system.

The empirical results reported in this chapter—68.7% positive transfer, a mean improvement of 1.96 episode steps, and the within-family generalisation scaling experiment—were produced exclusively with the research pipeline under the ten-fold cross-validation protocol described in Section 6.5. No equivalent evaluation has been run on the extended production pipeline; the production system should be understood as a deployment of the research contribution, extended and refined, rather than as the subject of an additional empirical claim. Chapter 9 (Realisations) describes the production EnvCraft service in the context of the full deployment architecture.

6.7

Conclusion

This chapter presents EnvCraft, a validation-first system producing 9,694 diverse, validated Gymnasium environments from 20,000 initial concepts. The pipeline combines code generation with multi-stage agent-based filtering: random agents eliminate degenerate cases, whilst privileged agents with source code access screen for difficulty extremes and generate demonstration trajectories for replay seeding and pretraining of vision-based policies.

Large-sample within-family generalisation experiments on 1,000 procedurally generated Tetris environments show statistically significant positive transfer across all ten cross-validation folds: 68.7% of environments show gains, with a mean improvement of approximately 1.96 steps overall. Scaling experiments confirm a monotonic relationship between training diversity and generalisation performance, with near-zero transfer in the lowest-diversity settings. These results demonstrate that the benchmark supports generalisation studies at a scale still uncommon in reinforcement learning, whilst leaving cross-domain generalisation as an open question.

The complete corpus, interactive exploration tool, and open-source library are available at https://experiments.standardrl.com/envcraft.The research prototype is accessible at https://experiments.standardrl.com/envcraft. A production deployment with extended validation pipeline is available at https://envcraft.com; the latter differs from the evaluated research system as described in Section 6.6.1.