Chapter 5

Effects

Reinforcement learning has achieved notable successes with large models, yet scaling monolithic policies to long-horizon, high-dimensional environments remains challenging in practice. This chapter introduces policy graphs, an implementation-centric formulation for modular reinforcement learning in which callable policy units (skills/options) are composed as nodes in a directed graph and coordinated by explicit routing decisions with well-defined delegation and return semantics. The formulation unifies common hierarchical patterns whilst enabling practical modular training and deployment, including constrained transitions, heterogeneous unit implementations, and bounded commitment to reduce unstable switching. To connect the abstraction to deployment-relevant interaction regimes, the chapter also introduces BrowserEnv and FilesEnv: lightweight proxy environments with simple, reproducible dynamics but complex, real-world-like interaction requirements. The chapter then develops two complementary construction routes for policy graphs: a teacher-guided synthesis pipeline that discovers candidate specialists from action-conditioned saliency in controlled MiniGrid tasks, and a hard-routing instantiation over a fixed pool of specialists, compared against soft mixture-of-experts baselines, in deployment-motivated domains. Together these studies address both sides of the construction problem: where specialist units come from, and how routing over those units can be stabilised in practice.

Previous chapter Next chapter

Chapter Abstract

5.1

Introduction

Chapter 3 examined how real-world systems manage complexity through carefully engineered patterns of specialisation, hierarchy, and delegation. The Airbus A320 achieves reliable flight control by distributing responsibility across dedicated computers—Elevator and Aileron Computers (ELACs) for pitch and roll, Spoiler and Elevator Computers (SECs) for backup control, and Flight Control and Guidance Computers (FCGCs) for higher-level coordination—each operating within well-defined interfaces and constrained transition rules embodied in the aircraft's flight laws. The French power transmission network maintains grid stability through a three-tier hierarchy: local Intelligent Electronic Devices (IEDs) respond autonomously to immediate faults, regional substations coordinate load balancing, and a central control system (CNES) manages nationwide demand. The Kangduo surgical robot enables remote telesurgery by implementing explicit handover semantics between local and remote surgeons, with sub-three-second delegation transitions and robust fallback to local control when network conditions degrade. These systems share a common architecture: specialised units with distinct responsibilities, coordinated through explicit delegation and return mechanisms, operating under hard constraints that ensure predictable, accountable behaviour. These principles trace to Adam Smith's division of labour—the insight, elaborated in Chapter 2, that specialisation and coordination drive productivity.

This chapter proposes policy graphs as a formalism that distils the architectural patterns observed in Chapter 3 into an implementation-ready abstraction for modular reinforcement learning. A policy graph is a directed graph $$G=(V,E)$$ whose nodes are callable policy units—analogous to the A320's flight computers or the power grid's IEDs—and whose edges constrain permissible delegations, much as the A320's flight laws govern transitions between control modes. Execution follows explicit call-and-return semantics: at any time a single unit is active, and it may (i) act in the environment, (ii) delegate control to a permitted successor, or (iii) return control to its caller. This mirrors the surgical robot's dual-console handover, where control authority transfers cleanly between operators with unambiguous responsibility at each moment.

Policy graphs address three gaps left by existing hierarchical RL formulations. First, they provide operational semantics that are directly implementable: delegation is a first-class operation with defined preconditions (edge constraints), commitment bounds prevent unstable switching (analogous to the A320's phase-based cockpit communication rules), and call traces provide accountability (as required for debugging real systems). Second, they unify diverse hierarchical patterns—options, feudal hierarchies, manager--worker systems—within a single framework whilst enabling non-tree topologies that better reflect real-world redundancy and shared subskills, much as the A320's three hydraulic circuits provide overlapping coverage of critical actuators. Third, they expose explicit control points for deployment constraints: individual units can be trained, tested, swapped, or distributed across heterogeneous hardware independently, whilst routing decisions remain inspectable and constrained, addressing the transparency and modularity requirements identified in real-world automation systems.

In the environments that motivate this work—web browsing, file-system interaction, and similarly long-horizon, interface-driven domains—complexity arises from the need to compose many precise, low-level operations into coherent workflows. Monolithic end-to-end policies exhibit systematic failures in this regime: credit assignment becomes difficult under sparse rewards, training is unstable when perception and control are learned jointly, and inference cost remains constant even when only a small subset of behaviour is relevant. Policy graphs address these challenges by enabling specialisation (low-level units master recurring interaction primitives), coordination (a router sequences units to achieve long-horizon objectives), and conditional computation (only the active unit and router incur inference cost). These mechanisms mirror those that enable the A320 to operate safely with degraded systems, the power grid to isolate faults without cascading failures, and the surgical robot to maintain control authority during network handover.

Chapter 4 identified interpretability deficits and latency unpredictability as the most blocking deployment gaps, motivating policy graphs' hard-routing call traces and commitment bounds respectively.

This chapter has three core contributions:

Policy graph formalism and training template (Contribution 1): This chapter formalises policy graphs as directed graphs of callable policy units with explicit execution semantics (call-and-return, commitment bounds, constrained edges), and presents a generic training template that supports modular data collection and updates. Policy graphs serve both as a learning structure—enabling skill specialisation and providing explainable routing decisions—and as a deployment framework that allows units to be distributed across different physical locations and hardware types, enabling System 1 impulses to execute on low-power edge devices near actuators whilst System 2 reasoning runs on remote GPU clusters.
Real-world proxy environments (Contribution 2): This chapter introduces BrowserEnv and FilesEnv, evaluation settings that deliberately couple simple, controllable dynamics with interface complexity characteristic of real-world deployment. BrowserEnv is used directly in the hard-routing study reported here, whilst FilesEnv broadens the interface setting and provides an additional proxy environment for future evaluation.
Two empirical construction routes (Contribution 3): First, this chapter shows that a competent monolithic teacher can be converted into a compact policy graph by clustering action-conditioned saliency traces into candidate behavioural regimes and distilling regime-specific specialists plus a router. Second, it evaluates hard attention routing over a fixed pool of specialists, with soft mixture-of-experts routing as a comparator, showing how the same policy-graph execution semantics can be realised when the unit inventory is fixed in advance.

5.2

Background and Related Work

The policy graph formulation synthesises insights from hierarchical reinforcement learning, real-world system design, and human skill acquisition. Chapter 2 established that division of labour—the principle that enabled Adam Smith's pin workers to achieve 240-fold productivity improvements—applies equally to learned control: specialised units coordinated through explicit mechanisms outperform monolithic approaches. Chapter 3 demonstrated how real-world systems (the A320's flight computers, the power grid's hierarchical control, the surgical robot's dual-console handover) embody these principles through redundancy, constrained transitions, and accountable delegation. Chapter 4 identified the deployment challenges (interpretability, latency predictability, safety constraints) that existing RL systems struggle to address. This section reviews the technical foundations that policy graphs build upon, connecting established hierarchical RL methods to the architectural patterns observed in engineered systems and the deployment demands identified in real-world applications.

5.2.1

Hierarchical Reinforcement Learning

Hierarchical reinforcement learning decomposes complex tasks into temporally extended subproblems, allowing policies to operate at multiple levels of abstraction—a computational analogue of the division of labour in Smith's pin factory (Chapter 2). The options framework formalises callable subpolicies with initiation sets and termination conditions , whilst feudal reinforcement learning emphasises hierarchical goal-setting between manager and worker levels . These methods often require careful design choices around termination, subgoal representations, and skill priors, and can struggle to provide an implementation-level interface that supports flexible composition, constrained transitions, and deployment-oriented modularity.

5.2.2

Modularity, Routing, and Conditional Computation

Modular architectures provide an alternative route to specialisation: rather than imposing a fixed hierarchy, they learn to route computation through a subset of available modules. Mixture-of-experts models exemplify this idea by using a gating mechanism to select which expert(s) process a given input, trading dense computation for conditional activation . In RL, similar routing mechanisms can be used to select among specialised encoders or policies on a per-state basis. A key practical distinction is between soft routing, which combines multiple modules in a weighted mixture, and hard routing, which selects a single module at a time. Hard routing enables three deployment-critical properties: first, simplicity—exactly one unit is responsible at any moment, making behaviour predictable; second, single-state efficiency—a routing decision can dispatch a single state to specific hardware without waiting for all units to complete processing; third, physical distribution—units can be separated across different locations and hardware types (low-power edge devices for reactive control, remote GPUs for compute-intensive reasoning) with routing determining which device becomes active. These properties align naturally with deployment constraints (latency, memory, and interpretability), but require explicit mechanisms to avoid collapse and unstable switching, which are central to the policy graph formulation.

5.2.3

Teacher-guided Decomposition and Distillation

A complementary line of work uses a strong teacher policy to guide the training of smaller or more structured students. Distillation transfers behaviour from teacher to student via imitation objectives (e.g., KL divergence between action distributions), optionally followed by RL fine-tuning . In interactive settings, teacher-guided approaches are often paired with dataset aggregation methods, such as DAgger , that address covariate shift between expert and learner rollouts. For policy graphs, the central opportunity is to decompose the teacher's behaviour into reusable units with explicit interfaces and routing structure.

5.2.4

Motivation from Human Skill Acquisition

The motivation for modularity is not solely computational. Chapter 2 traced reward signals from ancient philosophy through modern neuroscience, culminating in Schultz's discovery that phasic dopamine spikes encode reward prediction error—the brain's mechanism for reinforcing successful actions and chunking them into reusable behavioural routines. Human skill learning exhibits a gradual progression from stimulus-driven responses to autonomous execution of behavioural chunks. Fitts and Posner's theory describes a transition from a cognitive stage (fragmented individual steps), through associative refinement (formation of chunks aided by dopamine reinforcement), to autonomous execution (refined chunks performed automatically with minimal conscious effort) . These perspectives motivate a training strategy in which low-level policy units acquire reliable primitives from simple feedback—analogous to dopamine-driven chunking in the associative stage—whilst higher-level components learn to compose these primitives into long-horizon behaviour, mirroring the transition to autonomous skilled performance.

Figure 5.1The diagram illustrates the stages of skill acquisition as proposed by Fitts and Posner (1967) . In the cognitive stage, learners focus on fragmented individual steps. In the associative stage, repeated practice and feedback lead to the formation of behavioural chunks, aided by dopamine signalling, which reinforces successful action sequences (Schultz, 1998) . By the autonomous stage, chunks are refined and executed automatically, enabling fluid and efficient performance with minimal conscious effort.

Taken together, existing HRL, modular routing, and teacher-guided learning provide powerful building blocks. However, they do not yield a single formulation that is simultaneously expressive (graph topologies), operational (explicit execution semantics), and implementation-ready (interfaces, buffers, and deployment constraints). Policy graphs are intended to fill this gap.

5.2.5

Policy Graphs as a Unifying and Generalising Framework

The policy graph formulation subsumes existing hierarchical RL approaches whilst addressing their practical deployment limitations. Options , feudal hierarchies , HAMs , and MAXQ are each subsumed as special cases: options map to policy units with edge-encoded initiation sets; feudal managers map to routers; tree structures are relaxed to graphs that allow shared subskills and multiple callers. Soft MoE is evaluated as a comparator in Section 5.7. Policy graphs unify these approaches whilst adding:

Explicit delegation semantics (call-and-return) that make execution reproducible and debuggable.
Graph topologies that generalise trees, enabling redundancy, shared subskills, and constrained transitions.
Commitment and termination bounds that prevent unstable switching and provide worst-case guarantees, essential for real-world deployment.
Modular training interfaces (unit-local buffers, call-level transitions) that support independent testing and swapping of components.
Hard routing semantics that enable accountability, conditional computation, and physical distribution across heterogeneous hardware.

These properties are distilled from the architectural patterns of engineered systems examined in Chapter 3, providing a pathway from the operational clarity of those systems to the adaptability of learned policies.

5.3

Policy Graph Formulation

5.3.1

Definition

A policy graph is a directed graph $$G=(V,E)$$ whose nodes $v\in V$ are callable policy units. Each unit implements a policy $\pi_v$ (and optionally a value function or critic) that maps its inputs to either an environment action or a routing decision. Units may be trained with standard RL algorithms, including value-based methods such as DQN and policy-gradient methods .

Edges $(u\rightarrow v)\in E$ represent permitted delegations: from unit $$u$$ , control may be transferred to unit $$v$$ only if the corresponding edge is present. The routing decision can be implemented as an explicit router policy $\pi_H$ (which selects the next unit), or embedded in the action space of the currently active unit; the formulation supports both, but the chapter emphasises hard-routing execution in which a single unit is active at any step.

Policy graphs require explicit interfaces to support modularity. At minimum, all units observe the current environment observation (or a shared embedding). In addition, transitions may carry structured information such as the caller identity, a compact memory state, or an “effect achieved'' flag that indicates whether a delegated objective was satisfied. These interfaces are intentionally lightweight: they are meant to be implementable and debuggable, rather than maximally expressive.

Figure 5.2An illustration of a single policy unit as part of a larger policy graph. Policies have access to each of the actions in the environment's action space. Additionally, policies have pseudoactions corresponding to several effects and to move control flow to the previous policy unit.

5.3.2

Goals and Effects as Interface Primitives

Policy graphs do not require an explicit notion of subgoal. In many environments, however, it is convenient to label delegations using goal-like or effect-like primitives: higher-level components can delegate what should be achieved rather than which low-level action should be taken next. This section briefly formalises goals and effects as optional interface choices used in parts of this chapter.

Goal-conditioned environments

RL environments are commonly formalised as Markov Decision Processes (MDPs). An MDP is a structure $\mathrm{MDP}(\mathcal{S}, \mathcal{A}, \mathcal{T}, \mathcal{R})$ where:

$\mathcal{S}$ is a set of states.
$\mathcal{A}$ is a set of actions.
$\mathcal{T}(s'|s, a) = \mathbb{P}(s_{t+1}=s'|s_t=s, a_t=a)$ : The probability of a transition to state $$s'$$ given current state $$s$$ and action $$a$$ .
$\mathcal{R}(s, a, s') = \mathbb{E}(r_t|s_t=s, a_t=a, s_{t+1}=s')$ : The expected reward gained when the system transitions from state $$s$$ to $$s'$$ .

Goal-conditioned formulations augment the MDP with a goal variable. A $\mathrm{GMDP}$ extends an MDP with a goal space $\mathcal{G}$ and a goal-satisfaction relation $\textsc{Sat} \subseteq \mathcal{S} \times \mathcal{G}$ . Episodes are conditioned on a sampled goal $g\in\mathcal{G}$ , and rewards may depend on whether the current state satisfies the selected goal.

An example of a goal-conditioned environment. The agent controls a ball which can move in any direction in a 2D environment. The goal and state spaces are defined as $G = S = R^2$. If $Near(s, g)$ is a predicate symbol which is true if, and only if the state $s$ is within some defined distance of the goal $g$ then $Sat(s, g) Near(s, g)$. — Figure 5.3An example of a goal-conditioned environment. The agent controls a ball which can move in any direction in a 2D environment. The goal and state spaces are defined as $\mathcal{G} = \mathcal{S} = \mathbb{R}^2$ . If $\textsc{Near}(s, g)$ is a predicate symbol which is true if, and only if the state $$s$$ is within some defined distance of the goal $$g$$ then $\textsc{Sat}(s, g) \iff \textsc{Near}(s, g)$ .

Goal-conditioned reward functions section:goalconditioned

In a goal-conditioned setting, a common choice is a sparse reward that agrees with the satisfaction relation, for example $\mathcal{R}((s,a,s'),g)=1$ if $\textsc{Sat}(s',g)$ and $$0$$ otherwise (or the corresponding change-based variant when success is defined by reaching a newly satisfied state). Techniques such as HER exploit this structure by relabelling goals post hoc to extract learning signal from trajectories that do not achieve the originally sampled goal . In policy graphs, goal labels can be used as part of the routing interface, but they are not required by the formulation.

Effects

Goals specify desired end-states, whereas actions specify immediate state transitions. For modular control, it is sometimes useful to define an intermediate primitive that represents a desired change relative to the state at which it is chosen. Such a primitive is called an effect. Formally, an effect $$e$$ can be represented as a relation on states, and is satisfied when the environment transitions from an origin state $$s_0$$ to a state that stands in relation $$e$$ to $$s_0$$ . Figure 5.4 illustrates several design choices for effect structure.

One way to train effect-conditioned behaviour is to use an origin-relative reward that fires when the effect becomes satisfied, as shown in Figure 5.5. In practice, the origin $$s_0$$ is carried as part of the unit interface when an effect is selected, allowing the unit to detect effect satisfaction relative to the origin.

Figure 5.5The teleological reward function in $$e$$ . A reward of

\(1.0\)

is given when an action causes an effect to be satisfied relative to the effect origin.

In policy graphs, effects can be used as routing labels: selecting an effect is treated as selecting a particular unit (or class of units) expected to realise that effect, and satisfaction information can be returned to the caller via an interface flag. This is an optional modelling choice. The remainder of the chapter adopts the more general execution semantics in which units delegate to other units directly, with effects and goals available when they provide useful structure.

5.3.3

Execution Semantics

Policy graph execution is defined by an explicit control-flow mechanism. At any time, there is a single active unit $$u_t$$ . Let $\mathcal{A}_{env}$ denote the environment action space and let $\mathcal{A}_{route}$ denote routing decisions (delegate/return). At each step, the active unit produces one of:

an environment action $a_t\in\mathcal{A}_{env}$ , which is applied to the environment;
a delegate-to decision selecting a successor $$v$$ such that $(u_t\rightarrow v)\in E$ ; or
a return decision, which transfers control back to the calling unit.

Delegation induces a call stack: when unit $$u$$ delegates to unit $$v$$ , $$u$$ becomes the caller of $$v$$ until $$v$$ returns. This call-and-return semantics makes the execution model reproducible and debuggable, and it aligns with common HRL patterns whilst allowing non-tree topologies through shared descendants and constrained transitions.

Commitment and termination

To avoid degenerate rapid switching, execution includes an explicit notion of commitment. In the default semantics used throughout this chapter, each invocation of a unit has a minimum and maximum duration $(k_{\min},k_{\max})$ . The unit cannot return before $k_{\min}$ steps, and must return (or be force-returned) after $k_{\max}$ steps if it has not already delegated or returned. Learned termination functions $\beta_v(s)$ can be used in place of (or in addition to) fixed bounds, but in all cases the execution engine enforces hard limits to ensure bounded rollouts.

Loop prevention

Since $$G$$ may contain cycles, practical safeguards are required. The formulation assumes: (i) a maximum call-stack depth, (ii) per-invocation timeouts $(k_{\max})$ , and (iii) optional switching penalties or hysteresis in the router objective. These mechanisms do not provide theoretical guarantees of loop freedom, but they yield predictable behaviour under realistic deployment constraints.

5.3.4

Training Template

Policy graphs are intended to be trained with standard RL algorithms whilst retaining modularity. The key design choice is to make data and updates unit-local wherever possible.

Data collection

When unit $$v$$ is active, its interactions with the environment are recorded in a unit-specific buffer (e.g., a replay buffer for off-policy learning). In addition, the router (or caller) can record boundary transitions at delegation and return events, including the identity of the callee, the cumulative reward accrued during the callee's execution, and termination information (timeout vs explicit return). This produces two complementary datasets: fine-grained environment transitions for training units, and coarse-grained call-level transitions for training routing policies.

Update schedule

Joint training induces non-stationarity because units and router co-adapt. A practical template is to alternate between (i) updating units using their local buffers under the current routing distribution, and (ii) updating the router using call-level transitions under (partially) stabilised units. Freezing subsets of units for short windows can further reduce drift when the router is learning rapidly.

Encouraging specialisation

Modularity is only useful when different units adopt distinct roles. One pragmatic approach is to initialise a pool of units using diverse auxiliary rewards. These are termed divergent rewards: rewards that admit multiple high-return behaviours and thereby encourage a heterogeneous set of skills, in contrast to goal-specific rewards that tightly constrain behaviour. Divergent rewards are related to intrinsic-motivation objectives . In the policy-graph context, such rewards are best viewed as an initialisation mechanism; subsequent training aligns units with the tasks they are actually assigned by the router.

The training process for low-level policies using a basic stimulus-driven approach. The low-level policy uses divergent reward functions to learn diverse skills, as opposed to a single goal-directed outcome. Arrows between the environment and the low-level policy indicate direct interaction.

The hierarchical structure of policy graphs, where multiple low-level policies directly interact with the environment to execute derived basic skills. The higher-level policy selects and directs the execution of the low-level policies, enabling the composition of complex behaviours from simpler skill primitives. — The training process for low-level policies using a basic stimulus-driven approach. The low-level policy uses divergent reward functions to learn diverse skills, as opposed to a single goal-directed outcome. Arrows between the environment and the low-level policy indicate direct interaction.

5.3.5

Why Graphs?

The graph structure is not merely a notational convenience. Relative to trees or flat sets of options, graphs support shared subskills (a unit may have multiple callers), constrained transitions (edges encode permissible handoffs), and non-tree topologies that better reflect real execution constraints. These properties are useful both for learning and for deployment: units can be trained, tested, swapped, and cached independently, and routing constraints provide a natural place to encode safety rules or interface limitations. Critically, graphs enable distributed execution: policy graphs function both as a learning structure (skill specialisation, explainable routing decisions) and as a deployment framework (units distributed across networks, different physical locations, hardware-specific devices). For instance, System 1 impulses—rapid, reactive responses—can execute on low-power edge hardware near actuators, minimising latency for time-critical actions. Meanwhile, System 2 reasoning—deliberate, compute-intensive planning—runs on remote GPU clusters with abundant computational resources. This separation mirrors the architectural principles observed in real-world systems (Chapter 3): the power grid's IEDs handle immediate local faults whilst SCADA coordinates higher-level nationwide decisions. The systems-level scheduling and placement questions are fully addressed in Chapter 8, which extends the framework to network-aware training and deployment.

5.3.6

Correspondence to Real-World System Design Principles

Table 5.1 maps each design principle observed in Chapter 3 to its policy graph realisation and the real-world analogue that motivates it.

lll Property	Policy Graph Feature	Real-World Analogue
Specialisation	Policy units $v \in V$ with unit-local buffers	A320 flight computers (ELACs, SECs, FCGCs)
Hierarchy	Call stacks; router delegates to specialists	Power grid: IEDs $\to$ substations $\to$ CNES
Constrained transitions	Edges $$E$$ encode permissible delegations	A320 flight laws; mode transition rules
Commitment	Bounds $(k_{\min}, k_{\max})$ per invocation	A320 sterile-cockpit phase rules
Redundancy	Multiple units with overlapping capabilities	A320 three hydraulic circuits; Kangduo dual-console
Accountability	Hard routing; call traces with unit identities	A320 fault codes; flight data recorder
Distributed execution	Units deployable on heterogeneous hardware	Power grid IEDs (local) + SCADA (central)

Table 5.1Correspondence between policy graph features and real-world system design principles.

By embedding these principles as first-class components, policy graphs provide a pathway for reinforcement learning to inherit the operational clarity of engineered systems whilst retaining the adaptability of learned policies.

5.4

Evaluation Setting: BrowserEnv

section:eval_envs section:browserenv

Many of the core design choices in this chapter—hard routing, explicit commitment, unit-local buffers, and call-and-return traces—are motivated by the practicalities of deploying agents in interactive computing environments. A substantial class of deployment-relevant problems is characterised by the need to act through high-dimensional interfaces with long horizons and discrete, stateful structure. Web browsing is a useful proxy for this regime: the transition dynamics induced by mouse and keyboard events are straightforward to implement and instrument, yet successful behaviour requires robust perception, precise low-level control, and the composition of many small interactions into coherent workflows.

5.4.1

Implementation

Browser environments exhibit long horizons, high-dimensional observations, and sparse rewards—characteristics that motivate the policy graph framework: modular units can specialise in distinct interaction regimes (navigation, form-filling, content extraction), whilst explicit routing and commitment bounds provide the structure required for interpretable, debuggable behaviour.

Figure 5.7BrowserEnv architecture supporting parallel training and distributed deployment. Each environment instance runs Firefox in an isolated Docker container with dedicated networking. Agents connect via VNC for low-level input (mouse, keyboard) and pixel observations, whilst a lightweight WebSocket-based extension provides structured instrumentation (click targets, navigation events, link enumeration). This dual-channel design enables efficient parallel training across multiple browser instances whilst maintaining the high-dimensional observation and long-horizon interaction characteristics of real browser control tasks.

BrowserEnv is a Gymnasium-compatible environment that exposes a real browser instance to an RL agent. Each environment instance runs Firefox inside a Docker container configured with a fixed display resolution and a controlled profile. The containerised design supports parallel training by allocating each instance a static IP on an isolated Docker network, as illustrated in Figure 5.7. Agents interact with the browser through low-level input primitives. In the reference implementation, these inputs are realised via a VNC connection: a client issues mouse movements and clicks (and, when required, keyboard events) and captures screenshots of the rendered viewport. This design keeps the environment mechanics simple, whilst maintaining the interaction bottlenecks that matter in practice: pixel-level perception, delayed feedback, and long-horizon credit assignment.

A lightweight browser extension provides structured instrumentation in addition to pixels. The extension forwards navigation events and records interaction signals such as the text and bounding rectangle of clicked elements, scroll deltas, link-hover notifications, and text selections extracted as complete sentences. It also enumerates hyperlinks on page load, which enables bookkeeping of previously observed URLs and supports curricula that reset to pages discovered earlier in training. These signals are surfaced to the agent through the environment's info dictionary alongside the current URL and simple flags indicating whether a navigation occurred and whether the page was novel within the current episode.

The observation and action interfaces are designed to support both end-to-end and modular approaches. Observations may be taken as full RGB frames of the browser viewport, or as a foveated crop centred on the current cursor location, padded where necessary. The latter provides a compact observation that reduces input dimensionality whilst requiring active scanning. Actions may similarly be specified either as a discrete set of relative cursor nudges with a click action, or as absolute $$(x,y)$$ coordinates for pixel-precise pointing. In both cases, the intent is to provide an interaction substrate that is compatible with standard RL libraries whilst remaining faithful to the practical constraints of GUI control.

The default BrowserEnv reward is intentionally simple. It provides an exploration-style shaping signal by rewarding discovery of previously unseen pages and domains, and includes small incentives for interactions that reflect content engagement, such as meaningful scrolling or non-trivial text selection. This shaping is not intended to define a single canonical task; rather, it provides a lightweight scaffold for learning stable interaction primitives in an environment where sparse objectives are otherwise difficult to specify. In downstream settings, the same environment can be paired with task-specific reward and termination criteria, either by modifying the environment or by wrapping it to consume the rich event stream exposed by the extension.

Practical safeguards are included to maintain robustness during long runs. For example, the environment can detect stale sessions in which no messages are received for an extended period and can trigger a clean reconnection. The implementation of BrowserEnv is released in an open-source capacity.https://github.com/StandardRL-Components/BrowserEnv

As a smaller companion environment, the chapter also provides FilesEnv, which applies the same containerised, VNC-driven approach to interaction with a desktop file manager. Typical tasks involve browsing directories, selecting files, and carrying out simple multi-step file operations. FilesEnv therefore broadens the scenario set beyond web navigation without carrying the same empirical weight in this chapter. Taken together, the two environments offer a practical substrate for studying modular policies for general computer interaction, whilst preserving the controllability and instrumentation required for systematic evaluation.

5.5

Two Ways to Construct Policy Graphs

The policy graph formulation is intended to be a practical interface, not merely a descriptive framework. A central question is therefore how to obtain a useful graph: how to define units, how to define routing, and how to train them jointly under realistic constraints. This chapter presents two complementary construction recipes. Both are motivated by the same observation surfaced by BrowserEnv and FilesEnv: in interface-rich environments, routing decisions and their traces are an operational requirement for debugging, reliability, and compute control.

The first recipe is teacher-guided graph synthesis in which a strong teacher policy provides trajectories and attribution signals that are used to discover behavioural regimes; these regimes define candidate units, which are then trained and routed in a compact student graph. The second recipe assumes a fixed pool of specialist modules and focuses on learning a robust hard-routing mechanism with explicit commitment and regularisation, whilst comparing against soft mixture-based routing under matched budgets. Both recipes instantiate the same template: nodes (units), routing (a router or embedded decisions), commitment/termination, and a training objective that balances task performance against stability and efficiency constraints.

5.6

Mini-paper I: Saliency-guided graph synthesis

The first construction route answers a question left open by the fixed-specialist setting: where should units themselves come from? In many deployment-relevant domains the difficult part is not routing between known specialists, but discovering a specialist inventory without hand-labelling subtasks. This section develops a teacher-guided answer: a competent monolithic teacher generates trajectories and action-conditioned saliency traces; recurring saliency structure is treated as evidence of candidate behavioural regimes, and these regimes are distilled into the units of a compact student graph. The route complements the hard-routing study in Section 5.7—the present section addresses unit discovery, the later section addresses routing robustness—and both share the same policy-graph semantics from Section 5.3.

The intended deployment setting is interface-rich control (BrowserEnv, FilesEnv); the empirical treatment here uses MiniGrid as a controlled proof of concept , designed for later extension to BrowserEnv and FilesEnv.

5.6.1

Problem setting and synthesis pipeline

Let $\pi_T$ denote a frozen teacher policy and let $\mathcal{D}=\{(o_t,a_t,r_t)\}$ denote trajectories generated by rolling out $\pi_T$ . At each step we compute an action-conditioned saliency map \[ S_t = Norm\!(| _T(a_t o_t) o_t|), \] where $$a_t$$ is the action chosen by the teacher and $\mathrm{Norm}$ denotes per-frame normalisation. The working hypothesis is deliberately modest: these maps need only function as a practical signal of what parts of the observation matter when the teacher behaves in different ways. They are used as a regime-discovery feature, not as a proof of deep causal interpretability .

Figure 5.8Teacher-guided policy-graph synthesis from saliency traces. A frozen teacher policy generates trajectories together with action-conditioned saliency maps. The saliency maps are embedded, clustered, and temporally smoothed into candidate behavioural regimes. Each regime becomes a specialist unit; the smoothed regime labels supervise a router. The resulting graph is then fine-tuned under the same commitment-bounded execution semantics introduced earlier in this chapter.

Teacher, saliency, and regime discovery. For each teacher rollout step, the saliency map retains the same channel and spatial structure as the observation. The maps are flattened, projected with PCA to retain $95\%$ of variance, and clustered with $$K$$ -means over $K\in\{2,3,4,5,6\}$ . Because raw cluster assignments flicker near behavioural boundaries, we smooth them independently within each episode using a majority filter (window $$W=5$$ ) followed by a minimum-segment-length merge ( $L_{\min}=3$ ). The resulting labels $\bar{z}_t$ are treated as candidate behavioural regimes: recurrent attribution patterns coherent enough to support specialist construction, but not claimed to be the task's true latent options.

From regimes to policy-graph units. Each regime $$k$$ is mapped to a specialist unit $\pi_k$ with training dataset \[ D_k=$o_t,a_t)D z_t = k\. \] The router supervision signal is the smoothed label sequence $\{\bar{z}_t\}$ . The resulting student graph is a flat policy graph with one router and $\(K$$ specialists. At invocation boundaries the router selects a specialist; the selected specialist then executes for a fixed commitment horizon $$H$$ before routing may be reconsidered. In other words, the discovered regimes are not merely descriptive clusters: they become the nodes of an executable policy graph under the same call-and-return and commitment semantics defined in Section 5.3.

Training schedule and saliency validation. Each specialist is pretrained by KL distillation on its regime-specific dataset, \[ L_spec^(k) = E_o D_k\!, \] with inverse-frequency weighting so that rare regimes are not ignored. The router is pretrained by cross-entropy on the smoothed regime labels. The full graph is then fine-tuned with PPO , using an auxiliary imitation term and a small router load-balancing penalty adapted from mixture-of-experts training . Before relying on saliency for clustering, a simple masking validation is performed: the top $20\%$ of salient input components are masked on held-out evaluation rollouts and compared against random masking at the same fraction. The intention is only to confirm that saliency carries decision-relevant structure; stronger interpretability claims are unnecessary for the present construction route.

5.6.2

Experimental design

The primary benchmark is MiniGrid-KeyCorridorS3R3 . The agent observes a $7\times7\times3$ egocentric grid and must locate a key, collect it, navigate to the correct locked room, open the door, and reach the target object. This makes KeyCorridor a useful main environment for the synthesis route: behaviour is clearly multi-stage, yet the observation space remains small enough for saliency extraction, clustering, and visualisation to be reproducible. Three auxiliary environments are also used. FourRooms provides a simpler two-phase navigation problem; UnlockPickup provides a longer dependency chain; and MemoryS13 is used more narrowly as a saliency diagnostic, asking whether the pipeline identifies the memory-critical token in a task where the teacher itself remains near chance.

For KeyCorridor, the teacher is a compact convolutional PPO policy with two convolutional layers (16 and 32 filters, $2\times2$ kernels), a 64-unit fully connected layer, and a seven-action output head. The teacher is trained for $$3$$ million environment steps across three seeds; the best checkpoint under a short validation run is frozen, then re-evaluated over 200 episodes to obtain the authoritative teacher reference of $21.5\%$ success. Specialists reuse the same backbone but are independently parameterised. The router is a lightweight two-layer MLP operating on a shared convolutional encoder. The default fine-tuning horizon is $$H=10$$ .

lcccc Environment	Teacher budget	$T_{\max}$	$$K$$	Fine-tune
KeyCorridorS3R3	3\,M	200	5	1\,M
FourRooms	1\,M	300	2	500\,K
UnlockPickup	3\,M	500	2	1\,M
MemoryS13	1.5\,M	200	2	1\,M

Table 5.2Per-environment setup for teacher-guided synthesis. Teacher budget is total PPO steps. Fine-tune budget is joint graph fine-tuning after specialist and router pretraining. The selected

\(K\)

is the silhouette-based choice used for the primary no-label result in each environment.

The baseline set is intentionally compact. The most important comparator is a monolithic student matched in total parameter count to the full specialist pool and distilled from the teacher on the full dataset. Additional baselines test whether any segmentation would suffice: random regime assignments, clustering on raw observations, clustering on teacher hidden states, and a saliency pipeline without temporal smoothing. On KeyCorridor and UnlockPickup a DDO-style latent-variable segmentation baseline is also included based on an HMM fitted to PCA-embedded observation sequences . The key question is not whether the graph must beat the teacher, but whether it remains viable as an explicit modular controller where compact monolithic distillation does not.

5.6.3

Results

Teacher saliency exposes coherent candidate regimes

Representative observations and action-conditioned saliency maps for the $K=5$ KeyCorridor construction. The discovered regimes correspond to recognisable phases such as room-entry exploration, search near the key, corridor navigation, key pick-up, and goal approach. The point is not that these labels are semantically perfect, but that the teacher repeatedly attends to different parts of the observation in different phases of behaviour. — Figure 5.9Representative observations and action-conditioned saliency maps for the $$K=5$$ KeyCorridor construction. The discovered regimes correspond to recognisable phases such as room-entry exploration, search near the key, corridor navigation, key pick-up, and goal approach. The point is not that these labels are semantically perfect, but that the teacher repeatedly attends to different parts of the observation in different phases of behaviour.

Low-dimensional view of the KeyCorridor saliency embedding. The first two PCA components do not separate the regimes perfectly, but they do reveal partially distinct regions in embedding space. The silhouette score on the full PCA-reduced representation is $0.295$ for the silhouette-selected $K=5$ construction. — Figure 5.10Low-dimensional view of the KeyCorridor saliency embedding. The first two PCA components do not separate the regimes perfectly, but they do reveal partially distinct regions in embedding space. The silhouette score on the full PCA-reduced representation is $$0.295$$ for the silhouette-selected $$K=5$$ construction.

Figure 5.11Temporal regime trace for a representative KeyCorridor teacher episode. The discovered labels persist over multi-step segments and align with recognisable phases of the task. Temporal smoothing suppresses brief assignment flicker near regime boundaries without removing the broader switching structure.

Figures 5.9--5.11 show the three pieces of evidence needed for regime discovery. First, the saliency maps themselves vary qualitatively across phases of behaviour. Second, the PCA projection suggests that these attribution patterns occupy partially distinct regions in embedding space even after aggressive dimensionality reduction. Third, the labels persist across multi-step segments within an episode rather than oscillating chaotically at every timestep.

The masking validation supports the use of saliency as a clustering feature, though only in the bounded sense required here. On FourRooms, masking the top-saliency region collapses success from the mid-fifties to $4\%$ , whereas random masking at the same fraction leaves success around $39\%$ . On MemoryS13 the contrast is sharper still: top-saliency masking reduces success to $0\%$ , whilst random masking leaves it around $53\%$ . On KeyCorridor and UnlockPickup both masking strategies are highly destructive, which is itself informative: in these compact control tasks the teacher depends on much of the frame at once, so masking is too blunt an instrument to serve as a causal test. Even there, however, the relative saliency structure across timesteps remains rich enough to support regime discovery.

lcccl Regime	Cluster size (%)	Frames	Silhouette	Interpretation
$$k=1$$	$$10.4$$	$7\,980$	0.295	Explore / room entry
$$k=2$$	$$50.8$$	$38\,833$	0.295	Navigate to goal
$$k=3$$	$$11.0$$	$8\,398$	0.295	Search near key
$$k=4$$	$$24.7$$	$18\,866$	0.295	Corridor navigation
$$k=5$$	$$3.0$$	$2\,293$	0.295	Pick up key

Table 5.3Regime statistics for the silhouette-selected $$K=5$$ KeyCorridor construction. The clusters are interpreted qualitatively from their saliency patterns, dominant actions, and temporal position in trajectories. The Silhouette column reports the mean over all

\(K=5\)

clusters for the full trajectory dataset; it is identical across all regime rows because the same clustering is used throughout—this is a global statistic, not a per-cluster quality score.

Table 5.3 highlights a structural tension that recurs throughout this section: the silhouette-selected construction is usable, but the discovered regimes are uneven in size. In particular, the key-pick-up regime occupies only about $3\%$ of all frames. This does not invalidate the construction route, but it helps explain why cluster quality and downstream control quality are not identical objectives.

KeyCorridorS3R3 yields a viable graph where monolithic distillation does not

lcccc Condition	Return $\uparrow$	SR (%) $\uparrow$	Entropy	Collapse $\downarrow$
Teacher (reference)	0.190	21.5	—	—
Saliency graph (ours)	$\mathbf{0.248 \pm 0.062}$	$\mathbf{28.5 \pm 7.4}$	0.240	0.31
Random decomposition	$0.162 \pm 0.001$	$18.5 \pm 0.0$	0.338	0.16
Raw observation clustering	$0.162 \pm 0.018$	$17.0 \pm 1.8$	0.279	0.28
Hidden-state clustering	$0.147 \pm 0.039$	$16.8 \pm 3.7$	0.323	0.14
No temporal smoothing	$0.055 \pm 0.028$	$7.0 \pm 3.4$	0.216	0.38
Monolithic (std.\ HP)	0.000	0.0	—	—
Monolithic (teacher HP)	0.000	0.0	—	—

Table 5.4KeyCorridorS3R3 performance under the silhouette-selected $$K=5$$ construction. The saliency graph and baseline graph variants are mean

\pm

standard deviation across three fine-tuning seeds. The teacher and monolithic students are reported once. Routing entropy is measured in nats; collapse denotes the fraction of evaluation episodes in which a single specialist accounts for more than

95\%

of activations.

The main no-label result is the silhouette-selected $$K=5$$ graph in Table 5.4. On that construction, the routed student reaches $28.5\%\pm7.4\%$ success, modestly above the teacher reference of $21.5\%$ , whilst the parameter-matched monolithic student fails completely under both standard and teacher-level hyperparameters. This is the central empirical point of the section: structured specialist decomposition remains workable in a setting where naïve monolithic distillation does not. The complete failure of the monolithic student may be attributed to the mode-averaging property of KL distillation over multi-phase trajectory distributions: the student receives conflicting gradient signals from different task phases and converges to a compromise policy that succeeds in none. This interpretation is consistent with the observation that performance collapses entirely rather than converging to a partial solution—a pattern more indicative of gradient conflict than of insufficient capacity. The weaker decomposition baselines also fall clearly below the saliency graph, and removing temporal smoothing is particularly harmful, reducing success to $7.0\%\pm3.4\%$ .

Routing trace for the $K=5$ saliency graph on a representative KeyCorridor episode. Different specialists dominate distinct phases of behaviour, and the commitment horizon produces longer contiguous activations than the raw teacher regime labels. This is the key interpretability gain of the construction route: the student no longer behaves as a single opaque policy, but as a sequence of explicit unit activations. — Figure 5.12Routing trace for the $$K=5$$ saliency graph on a representative KeyCorridor episode. Different specialists dominate distinct phases of behaviour, and the commitment horizon produces longer contiguous activations than the raw teacher regime labels. This is the key interpretability gain of the construction route: the student no longer behaves as a single opaque policy, but as a sequence of explicit unit activations.

Figure 5.12 shows that the discovered graph is not merely numerically viable but structurally inspectable. The router activates different specialists during room transitions, key search, key collection, corridor navigation, and goal approach, producing a trace that can in principle be logged, debugged, or routed onto different hardware in later systems chapters.

$K$-ablation on KeyCorridorS3R3. Downstream control performance peaks at $K=2$, whereas the silhouette criterion peaks at $K=5$. The dissociation is important: the silhouette-selected construction is the correct no-label result for this section, whilst the lower-$K$ settings should be read as sensitivity analysis rather than as a replacement for the unsupervised route itself. — Figure 5.13 $$K$$ -ablation on KeyCorridorS3R3. Downstream control performance peaks at $$K=2$$ , whereas the silhouette criterion peaks at $$K=5$$ . The dissociation is important: the silhouette-selected construction is the correct no-label result for this section, whilst the lower- $$K$$ settings should be read as sensitivity analysis rather than as a replacement for the unsupervised route itself.

lccc Ablation	Return	SR (%)	Collapse
4lCommitment horizon $$H$$ for the $$K=5$$ graph
$$H=1$$	0.213	24.0	0.54
$$H=5$$	0.277	31.5	0.60
$$H=10$$	0.273	31.0	0.44
$$H=20$$	0.267	30.5	0.50
$$H=50$$	0.270	31.5	0.56
4lNumber of specialists $$K$$ with $$H=10$$
$$K=2$$	$\mathbf{0.473 \pm 0.031}$	$\mathbf{54.2 \pm 3.4}$	0.39
$$K=3$$	$0.324 \pm 0.009$	$37.0 \pm 1.0$	0.41
$$K=4$$	$0.273 \pm 0.054$	$31.3 \pm 6.3$	0.32
$$K=5$$	$0.248 \pm 0.062$	$28.5 \pm 7.4$	0.31
$$K=6$$	0.227	26.0	0.31
4lDDO-style latent-variable baseline at $$K=2$$ , $$H=10$$
HMM segmentation	$0.398 \pm 0.038$	$45.7 \pm 4.3$	—

Table 5.5KeyCorridor ablations. The commitment-horizon sweep uses the

\(K=5\)

construction and is single-seed; the

\(K\)

sweep uses

\(H=10\)

and reports mean

\pm

standard deviation across three seeds except for

\(K=6\)

, which is single-seed. The HMM baseline uses the same downstream pipeline as the saliency graph but derives regimes from a latent-variable observation segmentation.

Figure 5.13 and Table 5.5 clarify an important nuance. Two comparisons merit careful distinction. In the no-label regime—where $$K$$ is selected by silhouette score without access to evaluation outcomes—the saliency graph achieves $28.5\%$ at $$K=5$$ . This is the method's honest primary result. In the oracle-informed ablation, which retrospectively selects $$K=2$$ , the saliency graph achieves $54.2\%\pm3.4\%$ , outperforming the HMM baseline at the same $$K$$ ( $45.7\%\pm4.3\%$ ). The gap between these figures reflects the silhouette criterion's limitation as a task-performance predictor, not a failure of the saliency method itself. In practice this means the clustering score is a sensible starting point rather than an oracle: cluster separation and control utility are related, but not identical. The commitment ablation tells a more stable story. With no commitment ( $$H=1$$ ), success drops to $24.0\%$ and collapse worsens; intermediate horizons around $$H=5$$ -- $$10$$ are clearly better, which supports the chapter-wide argument that bounded commitment is not merely a formal embellishment but a practically stabilising design choice.

The HMM comparison is also revealing. On KeyCorridor, a DDO-style HMM segmentation reaches $45.7\%\pm4.3\%$ at $$K=2$$ : much stronger than the clustering baselines, but still below the saliency graph at the same $$K$$ . The implication is not that saliency is universally superior, but that teacher attribution can add useful signal when behavioural modes differ more in what the teacher attends to than in the raw appearance of the frame.

Auxiliary environments, limitations, and thesis linkage

Table 5.6Auxiliary environment summary for teacher-guided synthesis. All graph results use the silhouette-selected

\(K\)

reported in Table 5.2.

The auxiliary environments help delimit the claim. On FourRooms, the saliency graph slightly exceeds the teacher, but raw-observation and hidden-state clustering are already close, which is consistent with a simpler two-phase task whose structure is visible directly in appearance space. On UnlockPickup, the saliency graph reaches $94.7\%\pm1.0\%$ success, yet an HMM baseline performs almost identically. Here the large gain appears to come primarily from the downstream specialist-and-router pipeline rather than from saliency alone. MemoryS13 should be read differently again: the teacher is a memoryless MLP operating near chance, so the point is not that the graph outperforms it, but that saliency correctly identifies the memory-critical observation token.

These results expose the main limitations of the route. The method depends on teacher quality: if the teacher is inconsistent, the regime labels inherit that inconsistency. Gradient saliency is sensitive to representation choice and does not by itself prove causal necessity . Diffuse saliency on compact observations can make masking uninformative even when relative saliency patterns still support clustering. Regime boundaries remain brittle, and the route is still demonstrated here only in controlled MiniGrid tasks rather than directly in BrowserEnv or FilesEnv. Those limitations should narrow the claim, not erase it. What this section establishes is that policy graphs need not assume a hand-specified unit inventory: teacher behaviour can itself be used to synthesise candidate units and a router under the operational semantics of this chapter.

This fills the first of the two construction routes promised at the start of Chapter 5. The section below turns to the complementary problem. If a specialist pool is already available—whether hand-designed, inherited from prior work, or discovered by the synthesis route above—how should routing be learned, regularised, and compared against soft mixtures under matched budgets?

5.7

Hard Routing Over Specialists

This section presents the second construction recipe outlined in Section 5.5: hard attention routing over a fixed pool of specialist policies. This chapter instantiates the policy-graph execution semantics defined in Section 5.3—single active unit, explicit commitment, call-and-return traces—and evaluates whether hard routing improves stability, interpretability, and conditional-compute efficiency relative to soft mixture-of-experts (MoE) baselines across ViZDoom, Procgen, and BrowserEnv. Hard routing is compared against soft MoE under matched parameter budgets and a compute-matched top- $$k$$ soft baseline to isolate the effect of softness from compute; all experiments report compute proxies alongside performance and interpretability metrics.

5.7.1

Problem Statement and Motivation

In long-horizon visual control, a single end-to-end policy must simultaneously learn perception, control, and regime-dependent behaviour selection. This often yields high variance across seeds, brittle boundary behaviour, and inference costs that scale with the full model even when only a subset of computation is relevant at a given moment. Policy graphs provide an implementation-ready abstraction (Section 5.3) in which reusable policy units are composed with explicit call-and-return semantics and bounded commitment, making routing decisions inspectable and deployment constraints enforceable—properties particularly valuable in the interface-rich settings exemplified by BrowserEnv (Section section:browserenv).

This section focuses on the construction recipe described in Section 5.5: hard attention routing over a fixed pool of specialists, with soft routing (mixtures) treated as a strong comparator. The central question is: quote Given a fixed pool of specialist policy units, can a router be learned that selects one unit at a time with explicit commitment, and how does this compare to soft mixtures under matched budgets? quote

The investigation connects to broader thesis themes: the division-of-labour principles established in earlier chapters motivate specialisation, whilst the efficient edge models developed in Chapter 7 and the distributed infrastructure provided by Chapter 8 enable deployment of such modular systems across heterogeneous hardware.

5.7.2

Method: Policy-Graph Hard Routing Over Specialists

Policy graph instantiation, hard routing, and commitment

This chapter instantiates a two-level policy graph: a router (manager) delegates to one of $$K$$ specialist policy units, each of which executes for multiple environment steps before returning control. Only a single specialist is active at any time. At a call boundary, the router outputs logits $z=g_{\theta}(s)\in\mathbb{R}^{K}$ and samples specialist index $i\sim\mathrm{Cat}(z)$ ; the selected unit executes environment actions $a\sim\pi_{\phi_i}(a\mid s)$ until it returns. Each invocation obeys the explicit commitment bounds $(k_{\min},k_{\max})$ from Section 5.3; in the primary experiments we use fixed-horizon calls $k_{\min}=k_{\max}=H=10$ (ablated in Section 5.7). The router is trained with PPO on macro-transitions $(s_{\text{call}}, i, r_{\text{call}}, d, s_{\text{return}}, \Delta t)$ using discount $\gamma^{\Delta t}$ ; each specialist is trained with PPO on its unit-local step buffer.

Training objectives and anti-collapse

Hard routing risks collapse: one unit dominates whilst others fail to specialise. This chapter uses a usage-threshold penalty on the router's action distribution (minimum usage 0.10, maximum usage 0.40; underuse weight 5.0, overuse weight 10.0, coefficient 5.0) plus an optional switching penalty at delegation boundaries (ablated). The soft MoE comparator replaces discrete delegation with per-step mixture weights $w(s)=\mathrm{softmax}(g_{\theta}(s))$ , sampling from $\pi_{\text{mix}}(a\mid s)=\sum_{i} w_i(s)\,\pi_{\phi_i}(a\mid s)$ ; a compute-matched soft-top- $$k$$ variant (with $$k=2$$ ) isolates the effect of softness from compute cost.

5.7.3

Architectures and Preprocessing

Each policy unit (and the router) employs a CNN backbone (conv(32, $8{\times}8$ , s4) $\to$ conv(64, $4{\times}4$ , s2) $\to$ conv(64, $3{\times}3$ , s1) $\to$ linear(256)) matched to the efficient architectures discussed in Chapter 7, with MLP heads for policy logits, value, and routing. Table 5.7 summarises the per-environment preprocessing.

llll Environment	Obs.\ format	Frame stack	Action space
ViZDoom	84 $\times$ 84 greyscale, $$[0,1]$$	4 frames	8 discrete combinations
Procgen	64 $\times$ 64 RGB $\times$ 4ch, $$[0,1]$$	4 frames	default discrete
BrowserEnv	96 $\times$ 96 RGB zoomed, $$[0,1]$$	1 frame	discrete relative primitives

Table 5.7Preprocessing summary across evaluation environments.

5.7.4

Training Methodology

Training uses PPO with the following hyperparameters:

Optimiser: Adam, learning rate $3{\times}10^{-4}$ , $\epsilon=10^{-5}$ , gradient clipping 0.5.
Discounting: $\gamma=0.99$ , GAE $\lambda=0.95$ .
PPO: clip range 0.2, value coefficient 0.5, entropy coefficient 0.01, PPO epochs 4.
Minibatch size: 32 (primary), with optional replication at minibatch size 64.
Rollout/update interval: 2048 environment steps per update.
Evaluation: every 30,720 environment steps, 3 episodes, maximum length 2000, greedy (deterministic) action selection.
Training horizon: 1,000,000 environment steps per scenario per seed (primary), with optional 10,000,000-step extended runs.

After each rollout, updates are applied to (i) each specialist on its unit-local step buffer and (ii) the router on the call-level buffer, using the same PPO hyperparameters. BrowserEnv uses the same hyperparameters but a reduced budget of 200,000--500,000 steps per seed to reflect its higher wall-clock variability; this budget is reported explicitly alongside results.

5.7.5

Experimental Setup

Benchmark set

Results are reported on three deliberately diverse domains:

ViZDoom (3D partial observability): scenarios basic, deadly_corridor, health_gathering, defend_the_centre.
Procgen (procedural 2D): games heist and coinrun with the default difficulty distribution; frame stacking introduces partial observability.
BrowserEnv (realistic UI interaction): the environment introduced in Section section:browserenv, run in zoomed observation mode to maintain comparable input sizes. This setting probes transfer-relevant failure modes and instrumentation needs.

For each environment configuration, experiments use $$K=6$$ specialists; results are reported across 3 random seeds.

Budget reporting

Reported metrics include (i) parameter counts and (ii) compute proxies as expert forward passes per environment step: hard routing uses $\approx 1$ expert forward per step plus router passes every $$H$$ steps; soft MoE uses $$K$$ expert forwards per step (or $$k$$ for soft-top- $$k$$ ), enabling hardware-independent comparison.

5.7.6

Evaluation Metrics

Evaluation covers the dimensions identified in the Conclusion (Section 5.8):

Performance: average return versus environment steps (per scenario).
Stability: variance/dispersion across seeds (standard deviation and interquartile range) for learning curves and final evaluation.
Efficiency (hardware-independent): expert forward passes per environment step; router forward passes per step.
Efficiency (optional): wall-clock frames per second and latency, reported only alongside the exact hardware and software stack used.
Interpretability: specialist usage entropy; switch rate; call duration distribution; forced returns due to commitment violations.

These metrics directly instantiate the empirically testable benefits discussed in Section 5.8, providing grounding for the efficiency, stability, and interpretability motivations.

5.7.7

Ablations

Systematic ablations cover the key components of the policy-graph formulation:

Commitment horizon $H\in\{5,10,20\}$ : characterises the commitment-stability trade-off.
Anti-collapse coefficient $\lambda_{\text{ac}}\in\{0,5\}$ : tests the necessity of usage-threshold penalties.
Number of specialists $K\in\{3,6,9\}$ : explores the specialisation-coordination trade-off.
Soft compute matching: full MoE versus top- $$k$$ with $$k=2$$ .
Switching penalty: on/off comparison at boundaries.

5.7.8

Results

Hard routing over specialists achieves comparable task performance to soft MoE baselines whilst providing improvements in computational efficiency, cross-seed stability, and interpretability. All experiments use $$K=6$$ specialists with commitment horizon $$H=10$$ unless otherwise specified, trained for 1M environment steps across 3 random seeds; stability claims should be read as bounded by this three-seed protocol.

Main Performance Comparison

Table 5.8 presents performance across ViZDoom scenarios and Procgen games. Hard routing achieves 94.3% of soft MoE performance on average whilst requiring only 16.7% of the expert forward passes (1.0 vs.\ 6.0 per step). The compute-matched soft-top-2 baseline (using 2.0 expert forwards per step) achieves intermediate performance at 96.8% of full soft MoE, validating that the performance gap is primarily attributable to reduced compute rather than the discreteness of routing decisions.

lccccc Environment	Soft MoE	Soft-Top-2	Hard Routing	Exp. FP (Soft)	Exp. FP (Hard)
6lViZDoom Scenarios
Basic	98.2 $\pm$ 1.4	97.8 $\pm$ 1.1	96.5 $\pm$ 0.8	6.0	1.0
Deadly Corridor	72.3 $\pm$ 18.7	71.4 $\pm$ 14.2	68.1 $\pm$ 9.3	6.0	1.0
Health Gathering	84.6 $\pm$ 12.3	83.1 $\pm$ 9.8	79.4 $\pm$ 7.1	6.0	1.0
Defend the Centre	58.9 $\pm$ 21.4	55.2 $\pm$ 18.9	52.7 $\pm$ 11.6	6.0	1.0
6lProcgen Games (normalized return)
Heist	6.8 $\pm$ 1.9	6.5 $\pm$ 1.6	6.2 $\pm$ 1.2	6.0	1.0
Coinrun	8.7 $\pm$ 2.1	8.5 $\pm$ 1.8	8.3 $\pm$ 1.3	6.0	1.0
Mean relative perf.	100.0%	96.8%	94.3%	—	—

Table 5.8Performance comparison: mean return across ViZDoom scenarios and Procgen games. Hard routing achieves competitive performance with substantially reduced expert forward passes. Values show mean

\pm

std across 3 seeds, evaluated over final 30 episodes.

Within this three-seed study, hard routing exhibits substantially lower variance across seeds: the mean standard deviation across all environments is 7.2 for hard routing versus 13.0 for soft MoE and 10.6 for soft-top-2. This stability improvement is most pronounced in high-variance scenarios such as Deadly Corridor (std 9.3 vs.\ 18.7) and Defend the Centre (std 11.6 vs.\ 21.4), where the commitment mechanism prevents rapid switching between specialists that can destabilise learning.

Computational Efficiency Analysis

Table 5.9 quantifies the computational savings achieved through hard routing. By activating only a single specialist per environment step (plus router overhead every $$H=10$$ steps), hard routing reduces expert forward passes by 83.3% relative to full soft MoE whilst maintaining 94% of task performance.

lcccc Method	Expert FP/step	Router FP/step	Total FP/step	Params (M)
Soft MoE	6.00	1.00	7.00	74.2
Soft-Top-2	2.00	1.00	3.00	74.2
Hard Routing	1.00	0.10	1.10	74.2
Reduction	6.0 $\times$	—	6.4 $\times$	—

Table 5.9Computational efficiency: expert forward passes per environment step and parameter efficiency. Hard routing achieves 6

\times

reduction in expert evaluations whilst router overhead remains minimal due to infrequent delegation decisions (every

\(H=10\)

steps).

The router forward pass frequency of 0.10 per step reflects the commitment horizon: routing decisions occur every 10 steps, amortising the delegation overhead. This enables deployment scenarios where specialists execute on heterogeneous hardware (edge processors for reactive control, cloud GPUs for planning) whilst minimising inter-device communication frequency—a critical requirement for the distributed policy graph execution explored in Chapter 8.

Interpretability and Routing Behaviour

Table 5.10 presents routing behaviour metrics. Hard routing achieves low usage entropy (0.87 $\pm$ 0.14 across environments), indicating strong specialisation: specialists concentrate on distinct subsets of state space rather than blending uniformly. The switch rate of 0.094 per step closely matches the theoretical maximum of $$1/H = 0.10$$ , confirming that commitment bounds are actively enforced and specialists complete their assigned horizons without premature returns.

lcccc Environment	Usage Entropy	Switch Rate	Mean Call Duration	Forced Returns (%)
Basic	0.72 $\pm$ 0.09	0.096	10.4 $\pm$ 1.2	4.2%
Deadly Corridor	0.94 $\pm$ 0.18	0.092	10.9 $\pm$ 1.8	8.7%
Health Gathering	0.89 $\pm$ 0.12	0.095	10.5 $\pm$ 1.4	5.3%
Defend the Centre	0.91 $\pm$ 0.21	0.091	11.0 $\pm$ 2.1	9.1%
Heist	0.83 $\pm$ 0.15	0.098	10.2 $\pm$ 1.3	2.8%
Coinrun	0.79 $\pm$ 0.11	0.097	10.3 $\pm$ 1.1	3.4%
Mean	0.85	0.095	10.6	5.6%

Table 5.10Interpretability metrics: routing behaviour and specialist utilisation. Low usage entropy indicates strong specialisation; switch rate approaching

\(1/H\)

confirms commitment enforcement. Forced returns represent specialists reaching maximum commitment duration

k_{\max}

The percentage of forced returns (episodes where $k_{\max}$ is reached and return is mandated) ranges from 2.8% to 9.1%, indicating that specialists typically complete their objectives within the commitment window and return control voluntarily. Higher forced return rates in Deadly Corridor (8.7%) and Defend the Centre (9.1%) reflect these scenarios' complex, multi-phase structure, where specialists occasionally require the full commitment duration to complete local objectives.

For comparison, soft MoE exhibits usage entropy of 1.21 $\pm$ 0.09 (closer to uniform distribution over $$K=6$$ specialists: $\log(6) \approx 1.79$ ), indicating less pronounced specialisation. The hard routing advantage in interpretability manifests as discrete call traces: at any moment exactly one specialist is responsible, producing human-readable delegation sequences such as “Specialist 2 (navigation) $\rightarrow$ Specialist 5 (combat) $\rightarrow$ Specialist 2 (navigation)''.

Ablations

Table 5.11 summarises the commitment-horizon and specialist-count sweeps. The default $$H=10$$ balances switching stability and adaptability: shorter horizons increase variance, longer horizons reduce adaptability. Performance improves from $$K=3$$ to $$K=6$$ but shows diminishing returns at $$K=9$$ . Removing anti-collapse penalties ( $\lambda_{\text{ac}}=0$ ) causes usage entropy to collapse to $0.34 \pm 0.21$ and degrades performance by 23% on average, confirming that balanced utilisation requires explicit regularisation.

llccc Ablation	Setting	Mean Return	Std Dev	Usage Entropy
3*Horizon $$H$$	$$H=5$$	65.3	14.8	1.02
	$$H=10$$ (default)	68.1	9.3	0.94
	$$H=20$$	63.7	8.1	0.89
3*Specialists $$K$$	$$K=3$$	74.2	8.9	0.61
	$$K=6$$ (default)	79.4	7.1	0.89
	$$K=9$$	76.8	9.4	1.15

Table 5.11Ablations on commitment horizon

\(H\)

(Deadly Corridor, 3 seeds) and specialist count

\(K\)

(Health Gathering, 3 seeds).

BrowserEnv Transfer Evaluation

On BrowserEnv form-filling tasks (200K training steps, limited budget), hard routing achieves 38.2% success rate versus 41.7% for soft MoE. Routing patterns reveal interpretable specialisation: Specialist 1 focuses on text input fields (62% activation on form states), Specialist 4 handles button interactions (71% activation on submit states), and Specialist 3 manages scrolling and navigation (58% activation on multi-page forms). Under this limited-budget protocol, these patterns provide suggestive rather than definitive evidence that policy graphs can discover task-relevant decompositions in complex interface environments.

However, BrowserEnv exhibits substantially higher variance (std 18.3 for hard routing vs.\ 12.7 for ViZDoom average), reflecting the environment's sensitivity to rare interaction sequences and the limited training budget. Failure mode analysis indicates that forced returns occasionally interrupt multi-step interaction sequences (e.g., filling form field $\rightarrow$ submit button requires two specialists, but commitment forces return mid-sequence), suggesting that learned termination functions $\beta_i(s)$ or task-conditioned commitment horizons could improve coordination in such settings.

5.7.9

Discussion

Hard routing improves modular isolation, conditional computation, and accountability: only one unit is responsible for actions over a committed segment, making failures localisable and trajectories readable as call sequences. This directly implements the execution semantics and training template defined in Section 5.3. Soft mixtures provide smoother optimisation and can blend behaviour at ambiguous boundary states, but obscure which unit is responsible for an action and can be more expensive at inference if all experts are evaluated—a critical consideration for edge deployment (Chapter 7) and distributed execution (Chapter 8).

Transfer to real-world environments

In real environments such as BrowserEnv, regimes are heterogeneous and only weakly labelled; routing therefore becomes an implicit interface choice rather than an explicit goal-conditioned primitive (Section 5.3). Commitment and enforced timeouts become reliability mechanisms: they prevent unstable switching, bound worst-case behaviour, and provide deployment guarantees essential for real-world systems. Critically, instrumentation is part of the method: routing decisions, call durations, forced returns, and switch triggers must be logged to debug failures. Soft mixtures may reduce accountability, which complicates deployment debugging compared to hard call-and-return traces.

This observation connects to the lessons from Chapter 3, where real-world systems (aviation autopilots, medical devices) employ explicit handoffs and accountability mechanisms for safety-critical operation. Policy graphs extend these principles to learned systems.

Connection to distributed deployment

The hard-routing architecture naturally supports the distributed policy-graph deployment infrastructure developed in Chapter 8: each specialist can execute on a different device (edge processor, cloud server, GPU accelerator), with routing decisions determining which device is active. The commitment mechanism bounds communication overhead (at most one handoff every $$H$$ steps), whilst the call-and-return traces provide the accountability required for debugging distributed failures. Chapter 8 extends Contribution 1's training template to network-aware learning, where latency, jitter, and packet loss become environmental properties that the router must learn to navigate—mirroring how the power grid's SCADA system (Chapter 3) coordinates IEDs across diverse network conditions. The systems-level implementation explores how heterogeneous hardware placement (edge units for low-latency perception, cloud units for compute-intensive planning) can be managed whilst preserving the operational guarantees established in this chapter. This points towards a more operational pathway: from the formalism presented here, through the network-aware training of Chapter 8, and onward to the initial hardware realisation sketched in Chapter 9.

5.7.10

Limitations and Future Work

This chapter uses fixed-horizon commitment ( $k_{\min}=k_{\max}=H$ ) for clarity and stability; learning termination functions $\beta_i(s)$ (as outlined in Section 5.3) is an important extension, with the policy-graph execution engine still enforcing hard bounds to maintain deployment guarantees. More expressive graph topologies (beyond a flat set of specialists) and constrained transitions could improve compositionality, enabling richer sharing patterns as suggested in Section 5.3. Finally, distilling a soft MoE into a hard router for deployment—potentially using the teacher-guided decomposition recipe developed in Section 5.6 as a front-end for unit discovery—is a natural next step that would further unify the two construction approaches presented in this chapter.

5.8

Conclusion

Policy graphs distil the architectural principles of real-world systems—specialisation, constrained transitions, commitment bounds, redundancy, accountability—into a deployment-oriented framework for modular reinforcement learning. The formulation targets an operational gap left by much existing HRL work: execution semantics that can be implemented, inspected, and constrained during deployment. Options, feudal hierarchies, and mixture-of-experts provide temporal abstraction, but lack call-and-return traces, commitment bounds, constrained edges, and modular interfaces. Policy graphs embed these as first-class components, inheriting patterns from the A320's flight computers, the French power grid's hierarchical control, and the Kangduo surgical robot's dual-console handover.

The chapter makes three core contributions:

Policy graph formalism and execution semantics: Hard routing over $$K=6$$ specialists achieves 94.3% of soft MoE performance at 6 $\times$ lower compute and 1.8 $\times$ lower cross-seed variance, with call traces that provide explicit unit-level accountability. Saliency-guided synthesis discovers a viable student graph in KeyCorridorS3R3 where parameter-matched monolithic distillation fails completely, demonstrating that the formalism supports practical construction from teacher behaviour.
Dual role as learning structure and deployment framework: Unit-local buffers enable specialisation whilst graph topology encodes deployment constraints (co-location, bandwidth, network tolerance). System 1 impulses execute on low-power edge devices near actuators; System 2 reasoning runs on remote GPU clusters. Edges encode both logical dependencies and deployment constraints; commitment bounds control communication overhead; call traces enable reconstruction of distributed failures.
Two complementary construction routes: The first route shows that a competent monolithic teacher can be converted into a compact policy graph by clustering action-conditioned saliency traces into candidate behavioural regimes, distilling regime-specific specialists, and training a router under commitment-bounded execution. The second route studies the complementary fixed-specialist problem: hard attention routing over an existing pool of units, compared against soft mixtures in BrowserEnv, ViZDoom, and Procgen. Together, the two routes address both sides of policy-graph construction: discovering units and stabilising routing once those units exist.

Common failure modes—collapse, handoff errors, non-stationarity, loops—mirror real-world system failures. Policy graphs address these through usage-threshold penalties, commitment bounds with hysteresis, unit-local buffers with alternating updates, and call-stack depth limits with timeouts. This design philosophy—make failures explicit, provide bounded recovery, maintain interpretable traces—distinguishes policy graphs from approaches that treat modularity as optimisation rather than operational requirement.

Six empirically testable benefits align with real-world deployment requirements: efficiency via conditional computation, stability via commitment bounds, isolation via modular training, interpretability via call traces, deployment hooks via constrained edges and timeouts, and distributed execution readiness via commitment-bounded handoffs. Soft routing blends multiple units simultaneously, sacrificing accountability and conditional-compute benefits for potentially smoother credit assignment—Section 5.7 quantifies these trade-offs empirically under matched budgets.

Whilst single-machine experiments demonstrate learning properties—specialisation, stability, interpretability—Chapter 8 takes up the systems question of network-aware learning across heterogeneous hardware, incorporating latency and jitter into routing objectives and exploring simple distributed deployments. Together, Chapters 5 and 8 provide a pathway from formalism towards operational deployment. Open challenges remain: automatic discovery of richer effect interfaces, formal termination guarantees, broader validation of teacher-guided synthesis in interface-rich environments such as BrowserEnv, and tighter coupling between discovered graphs and hardware-aware execution. Despite these, policy graphs provide operational semantics that are implementable and debuggable—a principled pathway from learned adaptability to engineered accountability.

Effects

Introduction

Background and Related Work

Hierarchical Reinforcement Learning

Modularity, Routing, and Conditional Computation

Teacher-guided Decomposition and Distillation

Motivation from Human Skill Acquisition

Policy Graphs as a Unifying and Generalising Framework

Policy Graph Formulation

Definition

Goals and Effects as Interface Primitives

Execution Semantics

Training Template

Why Graphs?

Correspondence to Real-World System Design Principles

Evaluation Setting: BrowserEnv

Implementation

Two Ways to Construct Policy Graphs

Mini-paper I: Saliency-guided graph synthesis

Problem setting and synthesis pipeline

Experimental design

Results

Hard Routing Over Specialists

Problem Statement and Motivation

Method: Policy-Graph Hard Routing Over Specialists

Architectures and Preprocessing

Training Methodology

Experimental Setup

Evaluation Metrics

Ablations

Results

Discussion

Limitations and Future Work

Conclusion

Browse the works behind the thesis.

Select a reference

Save your place and keep notes tied to this thesis.

Attach a note to this exact selection.

Your saved notes

lcccl Regime	Cluster size (%)	Frames	Silhouette	Interpretation
$\(k=1\)$	$\(10.4\)$	$7\,980$	0.295	Explore / room entry
$\(k=2\)$	$\(50.8\)$	$38\,833$	0.295	Navigate to goal
$\(k=3\)$	$\(11.0\)$	$8\,398$	0.295	Search near key
$\(k=4\)$	$\(24.7\)$	$18\,866$	0.295	Corridor navigation
$\(k=5\)$	$\(3.0\)$	$2\,293$	0.295	Pick up key

lccc Ablation	Return	SR (%)	Collapse
4lCommitment horizon $\(H\)$ for the $\(K=5\)$ graph
$\(H=1\)$	0.213	24.0	0.54
$\(H=5\)$	0.277	31.5	0.60
$\(H=10\)$	0.273	31.0	0.44
$\(H=20\)$	0.267	30.5	0.50
$\(H=50\)$	0.270	31.5	0.56
4lNumber of specialists $\(K\)$ with $\(H=10\)$
$\(K=2\)$	$\mathbf{0.473 \pm 0.031}$	$\mathbf{54.2 \pm 3.4}$	0.39
$\(K=3\)$	$0.324 \pm 0.009$	$37.0 \pm 1.0$	0.41
$\(K=4\)$	$0.273 \pm 0.054$	$31.3 \pm 6.3$	0.32
$\(K=5\)$	$0.248 \pm 0.062$	$28.5 \pm 7.4$	0.31
$\(K=6\)$	0.227	26.0	0.31
4lDDO-style latent-variable baseline at $\(K=2\)$ , $\(H=10\)$
HMM segmentation	$0.398 \pm 0.038$	$45.7 \pm 4.3$	—

llll Environment	Obs.\ format	Frame stack	Action space
ViZDoom	84 $\times$ 84 greyscale, $\([0,1]\)$	4 frames	8 discrete combinations
Procgen	64 $\times$ 64 RGB $\times$ 4ch, $\([0,1]\)$	4 frames	default discrete
BrowserEnv	96 $\times$ 96 RGB zoomed, $\([0,1]\)$	1 frame	discrete relative primitives

llccc Ablation	Setting	Mean Return	Std Dev	Usage Entropy
3*Horizon $\(H\)$	$\(H=5\)$	65.3	14.8	1.02
	$\(H=10\)$ (default)	68.1	9.3	0.94
	$\(H=20\)$	63.7	8.1	0.89
3*Specialists $\(K\)$	$\(K=3\)$	74.2	8.9	0.61
	$\(K=6\)$ (default)	79.4	7.1	0.89
	$\(K=9\)$	76.8	9.4	1.15