Chapter 1

Introduction

Reinforcement learning offers a route to autonomous decision-making through environmental interaction. Yet the path from simulated Atari games to real-world deployment confronts fundamental obstacles: policies overfit to narrow training distributions, learned behaviours lack interpretability, communication latency undermines reactive control, and edge devices impose severe computational constraints. This thesis addresses these challenges by asking a concrete design question: what happens if reinforcement learning borrows its architecture from real systems that already operate reliably under constraint—the A320's flight computers, the French power grid's layered control, and, at a still more abstract level, the division of labour in pin factories?

Next chapter

1.1

Introduction

Chapter 2 (Principles) traces automation from first principles. Adam Smith observed that dividing pin-making into eighteen operations enabled ten workers to produce 48,000 pins daily—a 240-fold improvement over craftwork. Dopamine neuroscience reveals how phasic spikes encode reward prediction error, the brain's mechanism for reinforcing successful actions and chunking them into reusable routines. These threads converge: specialisation improves productivity, reward signals drive learning, and modular organisation enables both.

Chapter 3 (Lessons) examines how engineered systems achieve reliability through redundancy, sensor fusion, and failsafes. The A320 distributes responsibility across dedicated computers (ELACs for pitch and roll, SECs for spoilers and backup, FCGCs for autopilot); the power grid coordinates IEDs at substations with SCADA at national scale; the Kangduo surgical robot maintains sub-300ms latency through dual-console handover. These systems embody principles that learned policies must inherit: constrained transitions prevent mode confusion, commitment bounds enable predictable execution, explicit delegation provides accountability.

Chapter 4 (Works) surveys real-world RL deployments. Across sepsis treatment, surgical robotics, and autonomous driving, a consistent pattern emerges: policies overfit to training conditions, cannot explain their decisions, cannot guarantee bounded execution, and fail under the computational constraints of deployment hardware. These four gaps organise the contributions that follow.

Chapter 5 (Effects) introduces policy graphs, a formalism distilling real-world architectural patterns into reinforcement learning. A directed graph $$G=(V,E)$$ defines callable policy units with hard routing—exactly one unit active at any moment—providing accountability (call traces identify responsible units), conditional computation (only active unit incurs cost), and distributed execution (units map to heterogeneous hardware). The architecture is designed so that System 1 impulses can execute on low-power edge devices near actuators whilst System 2 reasoning runs on remote GPU clusters. Commitment bounds $(k_{\min}, k_{\max})$ prevent unstable switching whilst ensuring progress. The chapter then studies two construction routes: a saliency-guided synthesis path that derives specialists from a teacher policy in controlled MiniGrid settings, and a hard-routing study over fixed specialists in deployment-motivated environments such as BrowserEnv, ViZDoom, and Procgen.

Chapter 6 (Generalisations) addresses benchmark scarcity. Traditional RL benchmarks comprise dozens of manually designed tasks; distinguishing generalisation from memorisation requires diverse environment families. EnvCraft generates thousands of validated Gymnasium environments from natural-language concepts through a multi-stage pipeline combining a code-generation LLM (a lightweight model for brief generation, a larger model for implementation), automated testing, and agent-based validation. Cross-validation experiments on procedurally generated Tetris variants provide within-family evidence that training diversity can improve performance on held-out variants.

Chapter 7 (Models) realises an edge-oriented split-policy deployment path. MiniConv provides compact convolutional encoders that compile to OpenGL fragment shaders for broad embedded GPU support. A split-policy architecture places lightweight encoders on-device (Raspberry Pi Zero 2 W, NVIDIA Jetson Nano), extracting compact features transmitted to remote policy heads. This reduces decision latency in bandwidth-limited settings and lowers server-side compute per request whilst remaining competitive with the Stable-Baselines3 Full-CNN baseline in the reported fixed-seed pixel-observation experiments.

Chapter 8 (Systems) extends the thesis to network-aware distributed execution. CALF treats environments and policy units as networked services, injecting latency, jitter, and packet loss during training. Without network-aware training, a CartPole policy loses over 80% of its return under degraded Wi-Fi; the same policy trained under CALF degrades by only 21%, a roughly four-fold reduction in the sim-to-real gap. Small hierarchical deployments then illustrate how time-critical units can remain local whilst higher-level coordination runs remotely.

Chapter 9 (Realisations) assembles the thesis's deployment vision in three concrete instantiations. The first is a hardware device—a USB-C prototype built around a Raspberry Pi Zero 2 W that captures DisplayPort video, runs MiniConv inference locally, and returns HID actions over the same connection. The second is EnvCraft (https://envcraft.com), a live production deployment of the environment-generation research of Chapter 6, in which users describe games in plain language and receive validated, browser-playable Gymnasium environments within minutes. The third is RLPlayground (https://rlplayground.com), a hosted deployment of the CALF framework in which users run personal distributed training sessions, with GPU-accelerated DQN training, over the NEXUS relay infrastructure. Together, the three realisations show that the division of labour the thesis proposes—from environment generation through edge encoding to distributed execution—has been instantiated at each stage.

Chapter 10 (Endings) synthesises the contributions and returns the thesis argument to its origins. Smith's pin-factory productivity claim is revisited alongside Diderot's correction, and Plato's cave is extended to network delay: agents cannot escape incomplete observations, but network-aware training teaches them to navigate the shadows they perceive.

Taken together, the chapters argue that real-world deployment of reinforcement learning requires more than algorithmic performance on benchmarks. It demands operational semantics that provide interpretability and bounded execution, training infrastructure that tests generalisation beyond narrow distributions, and system architectures that distribute computation across heterogeneous hardware whilst maintaining accountability under communication constraints. By grounding modular RL in the architectural patterns of engineered systems—from pin factories to flight computers—this thesis offers a principled pathway from simulation to deployment.

1.2

A Note on Reproducibility

This thesis prioritises replicable work. Non-replicable publications attract more citations than reproducible ones , likely because reviewers “apply lower standards regarding reproducibility'' for “more interesting'' findings; this work does not aspire to that trade-off. Original contributions provide code at publication or shortly after. Where practicable, a QR symbol indicates browser-based reproducibility support: the associated QR code links to an executable artefact or live validation page for the claim in question. Server capacity is committed for one year post-publication, with some services potentially remaining available for longer. EnvCraft (https://envcraft.com) and RLPlayground (https://rlplayground.com) are publicly accessible services hosted at the University of Cambridge; the former generates and serves validated Gymnasium environments, the latter provides hosted CALF training sessions.

Introduction

Introduction

A Note on Reproducibility

Browse the works behind the thesis.

Select a reference

Save your place and keep notes tied to this thesis.

Attach a note to this exact selection.

Your saved notes