ActionLearn logo ActionLearn
Chapter 10 Endings
Open PDF Download PDF
Chapter 10

Endings

The central contribution of this thesis is the development and initial empirical study of policy graphs, a modular reinforcement learning framework that decomposes complex control tasks into specialised units organised in directed graph structures with call-and-return semantics. Introduced in Chapter 5, policy graphs embody the division-of-labour principle at multiple levels: individual units specialise in particular environmental regimes or behavioural patterns, whilst the graph structure coordinates their deployment through learned routing decisions. In the settings studied here, this architecture offers a practical way to address a recurring limitation of monolithic policies—their brittleness when faced with diverse operational conditions—by enabling different specialists to handle different contexts, much as Smith's pin-makers each mastered a single operation rather than attempting to produce entire pins alone.

Saved Reading
Provide your email to save your position for later and add notes.

Keep moving through the thesis now, and link this browser to your email whenever you want your place remembered.

10.1

Ending

10.1.1

Synthesis of Contributions

The central contribution of this thesis is the development and initial empirical study of policy graphs, a modular reinforcement learning framework that decomposes complex control tasks into specialised units organised in directed graph structures with call-and-return semantics. Introduced in Chapter 5, policy graphs embody the division-of-labour principle at multiple levels: individual units specialise in particular environmental regimes or behavioural patterns, whilst the graph structure coordinates their deployment through learned routing decisions. In the settings studied here, this architecture offers a practical way to address a recurring limitation of monolithic policies—their brittleness when faced with diverse operational conditions—by enabling different specialists to handle different contexts, much as Smith's pin-makers each mastered a single operation rather than attempting to produce entire pins alone.

Chapter 5 develops the policy-graph formalism and provides two complementary empirical construction routes. The first is a controlled teacher-guided synthesis study, in which action-conditioned saliency traces from a competent teacher are clustered into candidate behavioural regimes and distilled into specialist units plus a router. The second is a hard-routing study with commitment horizons and anti-collapse penalties over a fixed pool of specialists. The chapter also introduces deployment-motivated proxy environments such as BrowserEnv and FilesEnv. Taken together, these elements establish policy graphs not merely as a theoretical construct but as a practical framework with both a controlled synthesis result and a concrete hard-routing evaluation across ViZDoom, Procgen, and BrowserEnv.

The motivation for these technical contributions emerged from the deployment challenges surveyed in Chapters 3 and 4. Chapter 3 analysed real-world systems—the Airbus A320's redundant flight computers, the French power grid's three-tier control hierarchy, the Kangduo surgical robot's handover protocols—identifying recurring patterns: modular decomposition with explicit delegation, commitment to phases that prevent unstable switching, and accountability through inspectable control flows. These architectural principles, refined over decades in safety-critical domains, directly informed the design of policy graphs with their call-and-return semantics and commitment bounds. Chapter 4 then established the empirical necessity of these features through case studies of RL deployments: sepsis treatment policies requiring interpretable decision traces, telesurgery systems demanding predictable latency, autonomous vehicles needing safety guarantees. Together, these chapters transformed the division-of-labour principle from philosophical abstraction into concrete system requirements that the subsequent technical chapters operationalise.

However, modular policies alone are insufficient for real-world deployment. Chapter 6 addressed the challenge of generalisation by introducing EnvCraft, a validation-first system that generates diverse Gymnasium environments from natural-language specifications using large language models. The pipeline produced 9,694 validated environments from 20,000 initial concepts (a 48.5% overall yield), and the empirical study showed a 7.4% mean improvement on held-out Tetris variants in one representative split, with 68.7% of held-out environments showing gains overall. These results provide useful within-family evidence that training diversity matters, whilst also showing that broader cross-domain generalisation remains to be established.

Real-world deployment imposes strict computational constraints, particularly on edge devices with limited processing power. Chapter 7 tackled this challenge through MiniConv, a library of compact convolutional encoders implemented as OpenGL fragment shaders. By executing visual encoding directly on GPU hardware using programmable graphics pipelines, MiniConv achieves efficient inference on resource-constrained devices whilst remaining competitive with the chapter's Stable-Baselines3 Full-CNN baseline in the reported fixed-seed evaluations. The split-policy architecture—where lightweight encoders run on-device whilst policy heads execute remotely—demonstrates how division of labour extends beyond algorithmic decomposition to encompass strategic distribution of computational workload across heterogeneous hardware.

One of the most significant barriers to deploying distributed policies is network communication. Chapter 8 introduced the Communication-Aware Learning Framework (CALF), which treats network conditions—latency, jitter, packet loss—as a distinct axis of the sim-to-real gap. Policies trained under idealised assumptions of instantaneous communication suffer severe performance degradation when deployed over realistic networks: in CartPole, the baseline policy's return fell by just over 80% under the degraded Wi-Fi setting. CALF addresses this through network-aware training that exposes policies to realistic communication dynamics during learning. This work shows that robust distributed deployment requires systems thinking: network conditions are not peripheral implementation details but environmental properties that must be accounted for during training.

Finally, Chapter 9 sketches an initial physical hardware path for these ideas through a USB-C-connected device for distributed policy execution across computer peripherals. In its current form, that chapter functions as an early systems realisation and prototype path rather than as a completed empirical endpoint.

10.1.2

Lessons Learned

The trajectory of this research has yielded several insights that extend beyond the specific technical contributions, offering broader lessons for the field of real-world reinforcement learning.

Real-world RL requires systems thinking, not merely better algorithms. The performance collapse observed for baseline CartPole policies under degraded Wi-Fi occurred not because the learning algorithm was inadequate but because the training environment failed to model a critical aspect of the deployment context: communication dynamics. Similarly, MiniConv's deployment on \$15 devices succeeded not through novel network architectures but by recognising that GPU shader pipelines provide the appropriate computational primitive for edge inference. Effective real-world RL demands a holistic perspective that treats hardware capabilities, network infrastructure, and environmental variability as integral components of the learning problem rather than as external implementation concerns.

Network conditions constitute a critical axis of the sim-to-real gap. CALF experiments reveal that communication latency and jitter introduce a distinct and substantial form of distributional shift: policies learn control strategies predicated on instantaneous observation-action-state feedback loops, and when observations arrive 80 ms late and actions apply to outdated state estimates, even simple control tasks become unmanageable. This gap cannot be closed through domain randomisation of visual or physical properties alone—it requires explicit modelling of temporal dynamics and communication delays during training.

Validation-first approaches to environment generation are tractable and effective. The success of EnvCraft demonstrates that automatically generated training environments need not sacrifice quality for quantity: by incorporating privileged agents to verify solvability and semantic coherence, the system retained 9,694 valid environments from 20,000 generated concepts whilst maintaining sufficient diversity to yield positive transfer on held-out Tetris variants. Quality assurance is not antithetical to automation—it is essential to it.

Modular architectures improve diagnosability and support incremental refinement. Policy graphs tend to exhibit failure modes—over-reliance on particular specialists, ineffective routing—that admit clearer diagnosis through routing pattern analysis than the unpredictable out-of-distribution outputs of monolithic policies. The modular structure also supports incremental improvement: poorly performing specialists can be replaced without retraining the entire system, and new specialists can be added to handle previously unseen regimes.

Edge deployment is achievable with strategic architectural choices. MiniConv demonstrates that the perceived computational requirements of image-based policies are not fundamental but rather a consequence of architectural decisions: implementing visual encoding as GPU shader programs enabled deployment on \$15 devices. The broader lesson is that perceived hardware constraints often reflect a mismatch between algorithmic design and available computational primitives rather than absolute capability limitations.

These lessons collectively point towards a maturation of reinforcement learning as an engineering discipline. Early RL research appropriately focused on establishing theoretical foundations and demonstrating feasibility on benchmark tasks. As the field progresses towards real-world impact, success increasingly depends on integrating insights from distributed systems, computer architecture, and domain-specific deployment constraints—moving beyond the boundaries of machine learning proper to embrace the full complexity of building systems that work outside the laboratory.

10.1.3

Returning to First Principles

In Chapter 2, the intellectual lineage of this work was traced from Adam Smith's observation that dividing pin-making into eighteen distinct operations enabled ten workers to produce 48,000 pins per day—a feat impossible through individual effort. Smith identified division of labour as “the greatest improvement in the productive powers of labour'', a principle that has since transformed manufacturing, software engineering, and now, as this thesis demonstrates, artificial intelligence.

Yet Smith's account of pin-making, as discussed, was likely embellished. Denis Diderot's Encyclopédie suggests the actual number of distinct operations was closer to six than eighteen, and the productivity gains, whilst real, may have been exaggerated for rhetorical effect. This historical footnote offers its own lesson for contemporary AI research: the specific mechanisms matter less than the underlying principle. Whether pins require six steps or eighteen, the insight that specialisation yields efficiency remains valid. Similarly, whether policy graphs comprise five units or fifty, structured in hierarchies or flat graphs, the central hypothesis remains the same: modular specialists coordinated through learned routing can offer operational and, in some settings, performance advantages over monolithic alternatives. The contribution here is not a single fixed architecture but a framework that instantiates the division-of-labour principle in reinforcement learning.

The philosophical trajectory traced in Chapter 2—from Epicurus through Bentham to the neuroscience of dopamine as reward signal—finds its computational endpoint in the temporal-difference learning and policy gradient methods on which these contributions build.

Kahneman explicitly describes the relationship between fast intuitive processing and slow deliberative reasoning as a “division of labor''. Policy graphs embody this same principle: routing decisions execute rapidly using learned heuristics (analogous to System 1), delegating to specialists that engage in deeper, context-specific computation (analogous to System 2). The hierarchical structure mirrors the human cognitive architecture, suggesting that effective intelligence—whether biological or artificial—may inherently require modular organisation with coordination mechanisms.

The history of automation, from Daedalus's mythical living statues to Leonardo Torres Quevedo's El Ajedrecista, reveals a persistent fascination with machines that exhibit apparent autonomy and decision-making capacity. Torres's 1914 essay insisted that automata should possess discernment—the ability to “weigh the circumstances surrounding them in determining their actions''. This vision, articulated decades before digital computers existed, anticipated the core challenge of reinforcement learning: how can machines select actions by evaluating context rather than blindly executing predetermined sequences? Policy graphs provide one answer: by learning both specialist behaviours and the routing logic that selects among them based on observed state, they approach the discernment Torres envisioned. The hardware device outlined in Chapter 9 points towards a modern realisation of that ambition, faithful to the same principle of context-dependent automated decision-making.

Finally, returning to the metaphor of Plato's cave, invoked in Chapter 2 to illustrate the relationship between environments and reality: the prisoners see only shadows cast on the wall, developing beliefs about the world based on incomplete projections of the truth. Similarly, RL environments present agents with state representations that are projections of underlying reality: a drone navigates using camera images that capture only partial information about the forest, a BrowserEnv policy interacts with web pages through DOM observations that omit server-side state. The Markov assumption—“if it looks the same, it is the same''—formalises this constraint: the state must contain sufficient information for optimal action selection, even if it does not represent complete ground truth.

Network-aware training extends this metaphor in a subtle but important way. Conventional RL assumes that state transitions occur instantaneously in response to actions, maintaining the temporal coherence of the observation-action-reward loop. Real-world deployment breaks this assumption: observations arrive delayed, actions execute on outdated state estimates, and the agent effectively operates in a temporally distorted shadow-world. CALF addresses this by training policies to operate within these distorted shadows, learning control strategies robust to temporal misalignment. We cannot escape Plato's cave, but we can train agents to navigate it effectively by acknowledging the nature of the shadows they perceive.

From pins to policies, from dopamine to deployment, from philosophical first principles to engineered systems—this thesis has traversed a vast intellectual landscape. Yet the through-line remains constant: understanding how complex tasks can be decomposed into manageable components, how those components can be coordinated effectively, and how systems built on these principles can operate robustly in the unpredictable real world. Division of labour, whether applied to 18th-century manufacturing or 21st-century artificial intelligence, remains the greatest improvement in productive powers.

10.1.4

Future Work

The framework established in this thesis opens several promising directions for future research.

Scaling to larger, deeper policy graphs. The policy graphs demonstrated here comprised relatively shallow structures with modest numbers of specialist units. Exploring deeper hierarchies—where specialists themselves decompose into sub-specialists, forming tree or DAG structures of arbitrary depth—represents a natural extension that might enable more sophisticated abstractions. The primary challenge is credit assignment across multiple levels of delegation; recent work on feudal reinforcement learning and hierarchical actor-critic methods provides potential starting points, though adapting these to the policy graph formalism requires careful investigation.

Multi-hop network deployments and wide-area distribution. CALF demonstrated network-aware training for single-hop communication between edge devices and servers. Real-world deployments often involve multi-hop routing through heterogeneous network infrastructures spanning local area, wide area, and cellular connections. Extending CALF to model these more complex topologies—including congestion dynamics and quality-of-service constraints—would bring network-aware training closer to the conditions encountered in production robotics and distributed sensor systems.

Larger-scale EnvCraft corpus and cross-domain generalisation. The EnvCraft corpus of 9,694 validated environments demonstrates the viability of LLM-based environment generation. Scaling this by an order of magnitude would enable more ambitious generalisation experiments: could policies trained on 100,000 diverse environments exhibit zero-shot transfer to entirely new task families? Cross-domain generalisation—spanning multiple physics engines, visual styles, and control modalities—would test whether policies can learn genuinely abstract principles of control, requiring advances in meta-learning to prevent catastrophic forgetting as the training distribution expands.

Hardware device completion and broader deployment. The USB-C device of Chapter 9 remains a research prototype. Future iterations should close the task-level evaluation loop—coupling trained MiniConv encoders to the capture pipeline, connecting their output to a CALF channel, and measuring closed-loop performance on real interaction tasks. If policy graph deployment on edge hardware proves sufficiently valuable, the path from prototype to field deployment would require improved power management, hardened mechanical design, and supply-chain considerations beyond the scope of academic research.

These directions collectively point towards a research programme that deepens the theoretical foundations of modular reinforcement learning, extends empirical validation to more complex and realistic settings, and progresses towards systems capable of robust, adaptable operation in open-ended real-world environments. The work presented in this thesis provides initial evidence for policy graphs, network-aware training, and distributed edge deployment; the task ahead is to scale these insights to the full complexity of autonomous systems operating beyond laboratory control.

10.1.5

Broader Impact and Real-World Considerations

The deployment of autonomous systems trained through reinforcement learning carries implications that extend beyond technical performance metrics, touching on questions of safety, equity, and accountability. As this thesis has focused on making real-world RL deployment practical, it is essential to consider the contexts in which such deployment might occur and the responsibilities that accompany technological capability.

Safety and robustness in high-stakes domains. The techniques developed here—particularly network-aware training through CALF—improve the robustness of distributed policies under communication constraints, reducing the likelihood of catastrophic failures due to latency or packet loss. However, robustness to network conditions does not guarantee safety in absolute terms. Autonomous systems deployed in healthcare, transportation, or industrial control must satisfy stringent safety requirements that go beyond preventing communication-induced failures. The modular structure of policy graphs may facilitate formal verification by enabling per-specialist analysis, but substantial research is required to establish whether this architectural advantage translates to practical safety assurances in safety-critical domains.

Labour, accountability, and the broader context of automation. The policy graph framework, by enabling more capable and robust autonomous systems, contributes to the ongoing expansion of tasks amenable to automation. Deployment decisions occur within socio-political contexts where the benefits and costs of automation are unevenly distributed; responsible deployment requires consideration of how automation reshapes labour markets and who benefits from increased productivity. Policy graphs also offer improved interpretability relative to monolithic policies—routing patterns reveal which specialists are active in particular contexts—which may facilitate accountability when deployed systems make decisions with significant consequences. Understanding which specialist is active differs from understanding why a particular action was selected, however, and establishing accountability frameworks for modular RL systems requires both technical tools for inspecting behaviour and normative standards for what constitutes adequate justification in different deployment contexts.

These considerations underscore that technical advances in reinforcement learning, whilst necessary for real-world deployment, are insufficient on their own. Effective and responsible deployment requires engagement with regulatory frameworks, ethical norms, and societal priorities that lie outside the traditional scope of machine learning research. The contributions of this thesis—policy graphs as a formalism and hard-routing study, EnvCraft as benchmark-generation infrastructure with within-family evidence, MiniConv as an edge-model deployment study, CALF as network-aware systems infrastructure, and the hardware device as an early prototype path—provide tools for building more capable autonomous systems. How those tools are used, in what contexts, and to whose benefit, are questions that the broader research community, policymakers, and society as a whole must address collectively.

10.1.6

Closing Reflections

This thesis began with pins and ends with policies. In the space between, centuries of intellectual history have been traversed, connections drawn between neuroscience and neural networks, and systems built that instantiate abstract principles in silicon and code. The journey has been one of synthesis: bringing together ideas from disparate fields—philosophy, psychology, distributed systems, machine learning—and demonstrating that their integration yields capabilities greater than the sum of their parts.

The most surprising aspect of this work, in retrospect, was the extent to which systems considerations proved more consequential than algorithmic sophistication: the performance collapse under degraded Wi-Fi occurred not because the algorithm was inadequate but because the training environment failed to model its deployment context. A related lesson concerns abstraction: policy graphs abstract away low-level control details behind specialist units, EnvCraft abstracts environment generation behind natural-language specifications, and CALF abstracts network dynamics behind stochastic delay models. Each abstraction trades precision for tractability, and the art of engineering intelligent systems lies in choosing abstractions that capture the essential structure of the problem whilst still admitting efficient solutions. Throughout this work, the principle of division of labour has served as a guiding abstraction, and the results reported here suggest that it remains a productive one.

This research has also reinforced the value of grounding technical work in broader intellectual traditions. The connections drawn to Adam Smith, Plato, Torres Quevedo, and Kahneman are not mere ornamentation; they provide conceptual frameworks that shape how problems are formulated and solutions evaluated. Recognising that policy graphs instantiate division of labour clarifies their purpose and suggests directions for improvement. Understanding reinforcement learning as the computational formalisation of behaviourist psychology informs how reward structures and training curricula are designed. Viewing environments as shadows on Plato's cave wall is a reminder that state representations are always incomplete projections of reality. These historical and philosophical foundations do not replace rigorous empirical validation, but they provide the intellectual scaffolding within which technical contributions find meaning.

As reinforcement learning matures from a subfield of machine learning into a discipline for building deployed autonomous systems, the challenges ahead are as much about integration as innovation. Powerful learning algorithms, scalable computational infrastructure, and increasingly sophisticated simulators are all in hand. What remains is to combine these tools into systems that work reliably outside controlled settings, respecting the constraints of real hardware, real networks, and real-world variability. This thesis has taken steps in that direction by demonstrating that modular architectures, validation-first generation, network-aware training, and edge deployment are not merely desirable features but essential components of practical real-world RL.

The pins produced by Smith's divided factories were unremarkable objects: simple fasteners, each indistinguishable from the thousands produced alongside it. Yet their production revealed a profound insight about how complex tasks can be accomplished through the coordination of specialised labour. The policies produced by the framework developed in this thesis are similarly unremarkable in isolation—modular neural networks trained to play games or control simulated robots. But the principles they embody—specialisation, coordination, and robustness to distributional shift—point towards a future where autonomous systems operate not as fragile laboratory demonstrations but as more reliable tools for the messy, networked, heterogeneous real world. This thesis contributes a framework for moving in that direction: one that respects the complexity of deployment, embraces modularity as a design principle, and grounds technical innovation in the enduring insight that division of labour remains a powerful guide to organised action.