Introduction
Reinforcement learning offers a route to autonomous decision-making through environmental interaction. Yet the path from simulated Atari games to real-world deployment confronts fundamental obstacles: policies overfit to narrow training distributions, learned behaviours lack interpretability, communication latency undermines reactive control, and edge devices impose severe computational constraints. This thesis addresses these challenges by asking a concrete design question: what happens if reinforcement learning borrows its architecture from real systems that already operate reliably under constraint—the A320's flight computers, the French power grid's layered control, and, at a still more abstract level, the division of labour in pin factories?
Introduction
Reinforcement learning offers a route to autonomous decision-making through environmental interaction. Yet the path from simulated Atari games to real-world deployment confronts fundamental obstacles: policies overfit to narrow training distributions, learned behaviours lack interpretability, communication latency undermines reactive control, and edge devices impose severe computational constraints. This thesis addresses these challenges by asking a concrete design question: what happens if reinforcement learning borrows its architecture from real systems that already operate reliably under constraint—the A320's flight computers, the French power grid's layered control, and, at a still more abstract level, the division of labour in pin factories?
Chapter 2 (Principles) traces automation from first principles. Adam Smith observed that dividing pin-making into eighteen operations enabled ten workers to produce 48,000 pins daily—a 240-fold improvement over craftwork. Dopamine neuroscience reveals how phasic spikes encode reward prediction error, the brain's mechanism for reinforcing successful actions and chunking them into reusable routines. These threads converge: specialisation improves productivity, reward signals drive learning, and modular organisation enables both.
Chapter 3 (Lessons) examines how engineered systems achieve reliability through redundancy, sensor fusion, and failsafes. The A320 distributes responsibility across dedicated computers (ELACs for pitch and roll, SECs for spoilers and backup, FCGCs for autopilot); the power grid coordinates IEDs at substations with SCADA at national scale; the Kangduo surgical robot maintains sub-300ms latency through dual-console handover. These systems embody principles that learned policies must inherit: constrained transitions prevent mode confusion, commitment bounds enable predictable execution, explicit delegation provides accountability.
Chapter 4 (Works) surveys real-world RL deployments. Across sepsis treatment, surgical robotics, and autonomous driving, a consistent pattern emerges: policies overfit to training conditions, cannot explain their decisions, cannot guarantee bounded execution, and fail under the computational constraints of deployment hardware. These four gaps organise the contributions that follow.
Chapter 5 (Effects) introduces policy graphs, a formalism distilling real-world architectural patterns into reinforcement learning. A directed graph \(G=(V,E)\) defines callable policy units with hard routing—exactly one unit active at any moment—providing accountability (call traces identify responsible units), conditional computation (only active unit incurs cost), and distributed execution (units map to heterogeneous hardware). System 1 impulses execute on low-power edge devices near actuators; System 2 reasoning runs on remote GPU clusters. Commitment bounds \((k_{\min}, k_{\max})\) prevent unstable switching whilst ensuring progress. The chapter then studies two construction routes: a saliency-guided synthesis path that derives specialists from a teacher policy in controlled MiniGrid settings, and a hard-routing study over fixed specialists in deployment-motivated environments such as BrowserEnv, ViZDoom, and Procgen.
Chapter 6 (Generalisations) addresses benchmark scarcity. Traditional RL benchmarks comprise dozens of manually designed tasks; distinguishing generalisation from memorisation requires diverse environment families. EnvCraft generates thousands of validated Gymnasium environments from natural-language concepts through a multi-stage pipeline combining a code-generation LLM (a lightweight model for brief generation, a larger model for implementation), automated testing, and agent-based validation. Cross-validation experiments on procedurally generated Tetris variants provide within-family evidence that training diversity can improve performance on held-out variants.
Chapter 7 (Models) realises an edge-oriented split-policy deployment path. MiniConv provides compact convolutional encoders that compile to OpenGL fragment shaders for broad embedded GPU support. A split-policy architecture places lightweight encoders on-device (Raspberry Pi Zero 2 W, NVIDIA Jetson Nano), extracting compact features transmitted to remote policy heads. This reduces decision latency in bandwidth-limited settings and lowers server-side compute per request whilst remaining competitive with the Stable-Baselines3 Full-CNN baseline in the reported fixed-seed pixel-observation experiments.
Chapter 8 (Systems) extends the thesis to network-aware distributed execution. CALF treats environments and policy units as networked services, injecting latency, jitter, and packet loss during training. Without network-aware training, a CartPole policy loses over 80% of its return under degraded Wi-Fi; the same policy trained under CALF degrades by only 21%, a roughly four-fold reduction in the sim-to-real gap. Small hierarchical deployments then illustrate how time-critical units can remain local whilst higher-level coordination runs remotely.
Chapter 9 (Realisations) sketches an initial physical-device pathway: a USB-C device built around a Raspberry Pi Zero 2 W that captures DisplayPort video from a host computer, runs MiniConv inference locally, and returns HID actions over the same connection—placing a trained policy graph inside an unmodified host machine's input chain.
Chapter 10 (Endings) synthesises the contributions and returns the thesis argument to its origins. Smith's pin-factory productivity claim is revisited alongside Diderot's correction, and Plato's cave is extended to network delay: agents cannot escape incomplete observations, but network-aware training teaches them to navigate the shadows they perceive.
Taken together, the chapters argue that real-world deployment of reinforcement learning requires more than algorithmic performance on benchmarks. It demands operational semantics that provide interpretability and bounded execution, training infrastructure that tests generalisation beyond narrow distributions, and system architectures that distribute computation across heterogeneous hardware whilst maintaining accountability under communication constraints. By grounding modular RL in the architectural patterns of engineered systems—from pin factories to flight computers—this thesis offers a principled pathway from simulation to deployment.
A Note on Reproducibility
This thesis prioritises replicable work. Non-replicable publications attract more citations than reproducible ones , likely because reviewers “apply lower standards regarding reproducibility'' for “more interesting'' findings; this work does not aspire to that trade-off. Original contributions provide code at publication or shortly after. Where practicable, a QR symbol indicates browser-based reproducibility support: the associated QR code links to an executable artefact or live validation page for the claim in question. Server capacity is committed for one year post-publication, with some services potentially remaining available for longer.