Chapter 2

Principles

Adam Smith, in his seminal work The Wealth of Nations (1776), chose pin-making as his primary exemplar of the division of labour. Smith described how a worker alone, even when employing “his utmost industry'' could make “one pin in a day'', but certainly, Smith posed, he could “not make twenty''. By dividing labour across ten workers, each with their own speciality and familiarity with certain machines, forty eight thousand pins could be produced in a day. “The greatest improvement in the productive powers of labour'', Smith surmised, “seem to have been the effects of the division of labour''.

Previous chapter Next chapter

This thesis asks how a simple and enduring idea, the division of labour, can help make reinforcement learning systems more deployable in the real world. It begins with the first principles of automation: its history, its philosophical foundations, and the path by which reinforcement learning reached its current form. It then considers the present state of the art, the reasons existing approaches still struggle in real deployment settings, and the methods introduced here to address those gaps.

We begin with labour.

2.1

Labour

In Plato's Phaedrus (265E), Socrates describes the process of carving nature “without trying to shatter a single part by going about it like a bad butcher … on the basis of Forms [and] according to its natural joints''. Dividing the labour of pin-making, in the style of Smith, similarly involves carving up the task by its natural joints.

In the case of pin-making, Smith states that eighteen “distinct operations'' were used in producing pins, with some factories employing different people for each and others where employees performed two or three tasks each. Whether Smith truly visited any pin factories to come to these conclusions is unknown, in part due to his request that his contemporaneous notes be burned before his death. Because of the details he mentions—including his belief that specifically eighteen steps are used to produce pins—it is likely that he was borrowing heavily from Denis Diderot's Encyclopédie.

The Encyclopédie includes an article detailing the eighteen purported steps used by Parisian pin makers, written by Alexandre Delaire. An overview of the steps is shown in Figure 2.1. Delaire, a literature specialist, was picked to obfuscate the plagiarism of the previous pin-making article after Jesuits accused Diderot of copying twenty-two articles from the French Academy of Sciences, including the original pin-making article. The separation of pin-making into eighteen operations appears to have been a literary invention of Delaire to avoid the claim of plagiarism, with original sources suggesting that there may have been closer to six distinct skills used within a factory. In addition to revealing the fabrication of the number of skills, original sources also raise questions about the methodology and conclusions detailed by Smith .

Whatever the specific truth of Smith's claims, it is clear that the separation of skills along natural joints can significantly improve productivity. The best metrics for productivity, and the best ways to incentivise work, are the questions to which we turn next.

2.2

Reward

The steps involved in the process of pin-making, as written by Alexandre Delaire in Encyclopédie, translated from original French and heavily simplified. The original source text is given in Appendix app:pins. Brass wire is used to form shanks and heads, which are then combined and packaged with paper. The processes of pointing and dyeing are described by Delaire as using two workers each. — Figure 2.1The steps involved in the process of pin-making, as written by Alexandre Delaire in *Encyclopédie*, translated from original French and heavily simplified. The original source text is given in Appendix A. Brass wire is used to form shanks and heads, which are then combined and packaged with paper. The processes of pointing and dyeing are described by Delaire as using two workers each.

Since antiquity, philosophers recognised pleasure and pain as behavioural motivators. From Epicurus's observation that pleasure is “our first and kindred good'' to Jeremy Bentham's formalisation of this insight in Introduction to the Principles of Morals and Legislation as the `felicific calculus'—an approach to quantifying utility in units of pleasure (hedons) and pain (dolors)—the same basic premise has structured thinking about motivation across centuries. This framework acknowledged that actions might yield different utility across time and context.

As the field of behaviourism developed at the end of the 19th century, Edward Thorndike proposed his Law of Effect, formalising notions of reward for the first time. The Law of Effect states that behaviours followed by a reward are more likely to be repeated in the future. Ivan Pavlov demonstrated `classical conditioning' through experiments with dogs, showing that involuntary responses could be conditioned to neutral stimuli. B.F. Skinner distinguished this `respondent behaviour' from `operant behaviour', in which animals learn to associate conscious actions with rewards.

In Behavior of Organisms (1938), Skinner formally defined reinforcement as “the presentation of a certain kind of stimulus in a temporal relation with either a stimulus or response''. He introduced `intermittent reward', demonstrating that whilst continuous reward enables faster learning, intermittent reward produces more robust behaviour less susceptible to `extinction' when reinforcement ceases. Skinner also proposed `reward shaping', in which difficult tasks are decomposed into smaller, more achievable units, each eliciting incremental reward.

Clark Leonard Hull articulated how a reinforcing stimulus could take the form of a `reduction of a drive'. A drive, such as hunger, is alleviated by eating food; thus the provision of food to a hungry dog has reinforcing effect.

Parallel to the works of Thorndike, Pavlov, Skinner and Hull, neuroscience would establish the biological substrate underlying these behavioural principles. The works of Kathleen Montagu and Arvid Carlsson identified dopamine as the neurotransmitter regulating reward processing and behaviour. Crucially, neuroscientific investigation revealed that dopamine does not simply signal the presence of reward; rather, it encodes reward prediction error—the difference between received and expected reward. When an outcome is better than anticipated, dopamine activity spikes; when an outcome disappoints, it falls below baseline. This mechanism provides a biological precedent for temporal-difference learning, in which an agent learns to predict future reward and updates those predictions based on observed discrepancies between expectation and outcome. The bridge from Skinner’s reinforcing stimulus to the update rules of modern reinforcement learning runs directly through this neurobiological discovery.

This research forms an important foundation for reinforcement learning, where an artificial notion of `reward' is used to guide the training of an artificial system. To move from neuroscience to artificial intelligence, it helps first to understand human behaviour in terms of systems.

2.3

Systems I

Human behaviour ranges from pushing buttons to complex gymnastics. So far, we have considered the ways in which behaviour is learned: classical conditioning to learn simple reflexive tasks and operant conditioning for more complex incentive-driven behaviour. We have also discussed the role of dopamine as the biological substrate of reward prediction error. However, it is hard to conceive of the ways in which, say, learning to keep balance on two feet could relate to learning a gymnastic routine. One thing we can say for sure is that many of the things we learn through conditioning can eventually be executed without thinking carefully about them. Balancing on a bike might take work at first, but eventually it becomes easier to balance with less thinking rather than more. There is some mechanism that takes smaller actions, such as the subtle movements involved in balancing, and groups them in a way that makes their execution feel automatic.

Cognitive dichotomies distinguishing reflexive from deliberative processes recur throughout intellectual history, from Descartes' mind-body distinction to modern theories of dual-process cognition. Herbert Simon formalised this in his Theory of Bounded Rationality (1957), arguing that human decision-making operates within cognitive and informational constraints. He identified two principal modes of thought: heuristic-driven processes, which rely on rules of thumb and mental shortcuts to make quick decisions, and rational processes, which require deliberate, logical analysis.

In more recent years, the notion of a systematic separation of aspects of human cognition has been brought to popular fame through Kahneman's Thinking, Fast and Slow (2011). The System 1/System 2 dichotomy was introduced by Stanovich and West and popularised by Kahneman, who writes that System 1 “operates automatically and quickly, with little or no effort and no sense of voluntary control'' whereas System 2 is used for “effortful mental activities'' and is associated with “agency, choice, and concentration''. He describes the relationship between System 1 and System 2 as a “division of labor''.

This cognitive dichotomy informs our approach to distributing computation across heterogeneous hardware: reactive control on resource-constrained edge devices (analogous to System 1), and deliberative planning on remote servers with greater computational capacity (analogous to System 2).

We turn now to the history of automation that made these computational frameworks necessary.

2.4

Automation

Automation has fascinated thinkers throughout history. In Plato's Meno (97d), Socrates describes Daedalus' mythical living statues as “play[ing] truant and run[ing] away'', “if they are not fastened''. In his work Politics (Book 1), Aristotle considers the set of `tools' that exist as comprising two parts: `living' and `lifeless'. Living tools include human assistants and lifeless ones include implements like the rudder of a ship. He supposes that “if every tool could perform its own work when ordered, or by seeing what to do in advance, like the statues of Daedalus in the story'', then “master-craftsmen would have no need of assistants''.

The Mechanical Turk, constructed in 1770, appeared to play chess autonomously, defeating opponents including Benjamin Franklin; Napoleon Bonaparte famously lost to it in 1809. In reality, it was an elaborate fraud: expert chess players concealed themselves within the desk, operating the mannequin from hidden compartments using an ingenious system of sliding seats and mirrors . Despite its deception, the Turk profoundly influenced subsequent work in automation, including that of Charles Babbage, who lost to it twice during its 1819 European tour.

Babbage, lamenting errors in hand-calculated logarithm tables, conceived the Difference Engine (for automating calculations) and later the Analytical Engine (for general computation). His collaborator Ada Lovelace wrote what is widely acknowledged as the `first ever algorithm' for the Analytical Engine. Babbage, likely influenced by the Mechanical Turk, believed his Analytical Machine could play chess competently.

Leonardo Torres Quevedo was a big admirer of Babbage's work. In his 1914 essay Ensayos sobre automática, he credits Babbage's “mechanical genius'' (genio mecánico) and describes him as a “distinguished mathematician'' (matemático distinguido). He wished to extend the theoretical work of Babbage, who was never able to finish constructing his theorised computer. Torres was extremely optimistic about automation. His work serves, for our purposes, as the conclusive bridge from the first principles of automation to modern day algorithmic artificial intelligence.

In a supplement to the November 1915 issue of Scientific American, Torres' “Remarkable Automatic Devices'' are profiled, alongside the claim that Torres “Would Substitute Machinery for the Human Mind''. Indeed, what follows reads strikingly like an early manifesto for modern artificial intelligence¹:

When it comes to an apparatus in which the number of combinations makes a very complex system, analogous in a small degree to what goes on in the human brain, it is not generally admitted that a practical device is possible. On the contrary, M. Torres claims that he can make an automatic machine which will “decide'' from among a great number of possible movements to be made, and he conceives such devices, which if properly carried out, would produce some astonishing results. Interesting even in theory, the subject becomes of great practical utility, especially in the present progress of the industries, it being characterised, in fact, by the continual substitution of machine for man; and he wishes to prove that there is scarcely any limit to which automatic apparatus may not be applied, and that at least in theory, most or all of the operations of a large establishment could be done by machine, even those which are supposed to need the intervention of a considerable intellectual capacity.

The single most notable “remarkable automatic device'' in the Scientific American profile of Torres was El Ajedrecista: one of the first chess-playing automata to operate through genuinely algorithmic means rather than hidden human operators. Playing a simplified endgame (king and rook versus king), it demonstrated that machines could exhibit strategic reasoning through programmed rules. The automaton would later famously play against chess Grandmaster Savielly Tartakower in 1951.

Torres, in his 1914 essay, explains the importance of machines having what he calls discernment (discernimiento). “It is necessary'', he says “and this is the main object of Automation'' (y éste es el principal objeto de la Automática), that they can choose the correct action by “taking into account the impressions they receive, and also, sometimes, those they have received previously''. He points towards a distinction between the types of machines that people generally believe possible. On one hand, machines which respond continuously to stimulus input are agreed to be easy to make, whereas those which “[weigh] the circumstances surrounding [them] … in determining ... actions'' (pese las circunstancias que le rodean) are “generally thought'' to be “[achievable] in very simple cases''. He claims that “this distinction is worthless'' (esta distinción carece de valor), and that “it is always possible to build an automaton whose acts, all of them, depend on certain more or less numerous circumstances, obeying rules that can be arbitrarily imposed at the time of construction''.

The 1951 Festival of Britain featured NimrodNot the 19th-century self-styled prophet Nimrod Murphree, whose claims to unaided flight did not pan out., which played the game nim and drew such crowds during its European tour that special police were required for crowd control.

In 1948, Alan Turing and David Champernowne developed TuroChamp², a powerful chess-playing algorithm. Due to its age, there was an awkward caveat to the algorithm; no `computer' existed to run it. The computation for each move had to be carried out with pen and paper.

Amongst Turing’s several questions about what it might mean to “make a machine to play chess’’, the most prescient was whether a machine could “improve its play, game by game, profiting from its experience’’—question four of six, and the one he could not yet answer with confidence. He detailed instead an algorithm for question three: a machine that would indicate a passably good legal move. The algorithm operates by assigning each position a value derived from material and positional considerations, then selecting the move leading to the highest-valued position reachable within a limited search depth. The conceptual move from Torres’s discernimiento to Turing’s position value is the crucial one: it establishes that a machine can encode preferences over states as numerical quantities, and act so as to maximise them.

Across these episodes, automation moves from myth and spectacle towards useful computation. The Mechanical Turk traded on deception, but it still helped sustain public fascination with machine intelligence; Babbage, Torres, and Turing then redirected that fascination towards genuine mechanism and calculation. Turing's notion of value marks an especially important step towards what we now call reinforcement learning: behaviour guided by numerical preferences over future states. The developments that made this possible—feedback control, adaptive systems, and ultimately learning from interaction—emerged during the mid-twentieth century under the banner of cybernetics.

2.5

Heuristics

Three recurring heuristics structured how automation progressed through these episodes, and they recur throughout the RL methods that follow. Bootstrapping is the process of using something to improve itself: an initial capability becomes the basis for acquiring a more refined one, without external supervision. The term derives from the nineteenth-century expression for lifting oneself off the ground by pulling on one's own bootstraps—an impossibility in the physical sense, yet in machine learning it describes a genuinely productive loop, from temporal-difference value estimation to self-play. Evolution operates by iterative selection across a population: variation is introduced, fitness is evaluated, and successful variants propagate. In reinforcement learning this manifests in population-based training and neuroevolution, where diversity guards against premature convergence. Co-evolution extends this by making the selection pressure itself adaptive: as predator and prey evolve in mutual response, competing or co-operating agents drive one another's development. Self-play and adversarial curriculum generation are its direct expressions in deep RL. These three heuristics are named here so that they can be recognised when they reappear.

2.6

How to Train Your Machine

The mid-20th century saw the emergence of cybernetics, a discipline that formalised the role of feedback in control systems. Influenced by biological and neurological models, cybernetic approaches emphasise homeostatic regulation and adaptive response mechanisms. This was demonstrated by wartime innovations such as torpedo guidance systems and the Homeostat, an early self-regulating machine developed by W. Ross Ashby. As electronic computing matured, analogue control systems were widely adopted in industrial processes and aerospace applications, facilitating real-time automation. The transition from analogue to digital control in the 1960s and 1970s further improved precision, enabling the development of programmable controllers for manufacturing, as well as early forms of computerised decision-making in robotics and avionics.

Rule-based expert systems and fuzzy logic controllers extended this automation further, but remained dependent on hand-engineered knowledge and could not learn.

Reinforcement learning emerged as a response to these limitations, providing a framework in which control policies are learned through interaction with an environment rather than being explicitly programmed. Rooted in behavioural psychology and dynamic programming, RL enables machines to optimise decision-making by maximising cumulative rewards over time. This approach is particularly effective in scenarios where system dynamics are complex or only partially known, making it well-suited to robotics, adaptive automation, and real-time decision systems.

2.7

Deep Learning Foundations

The control systems described in the previous section—from cybernetic feedback loops to fuzzy logic controllers—rely on explicit modelling of system dynamics. Whilst effective in well-understood domains, these approaches struggle when confronted with high-dimensional observations. A robot navigating an indoor environment receives camera images: even a modest 84×84 pixel RGB frame corresponds to a 21,168-dimensional input. Enumerating rules or value tables at this scale becomes impractical. Real-world reinforcement learning demands function approximation—learning to generalise from observed states to unobserved ones.

Deep neural networks provide the expressive capacity required. Stacked layers of parameterised transformations learn hierarchical representations: early layers detect edges and textures; later layers compose these into semantically meaningful features. Gradient-based training—specifically backpropagation with adaptive optimisers such as Adam —makes this practical at scale.

For visual observations, convolutional neural networks (CNNs) are particularly well-suited. Rather than treating an image as a flat vector, convolutional layers apply learned filters that slide across the spatial extent of the image. This parameter sharing drastically reduces the number of trainable weights, and local connectivity reflects the natural structure of images: edges, textures and objects are spatially localised, and a useful detector for an edge in one region of the image is equally useful in another. Stacked convolutional layers followed by fully-connected output heads form the backbone of visual RL architectures and are used extensively in this thesis—most directly in Chapter 7, which develops compact CNN encoders for deployment on resource-constrained edge hardware.

Combining neural networks with reinforcement learning introduces training challenges that do not arise in supervised settings. Reinforcement learning generates data through interaction: consecutive observations are temporally correlated, and the data distribution shifts as the policy improves. The deadly triad —combining function approximation, bootstrapping from value estimates, and off-policy learning—creates instability that naive gradient descent cannot resolve. Section 2.8.4 describes how Deep Q-Networks resolved these instabilities and established deep reinforcement learning as a practical methodology. The chapters that follow build throughout on the foundations introduced there.

2.8

Reinforcement Learning

We now reach the formal conceptual introduction of modern reinforcement learning. We will build the framework systematically, starting from first principles and progressively introducing the extensions—deep function approximation, policy gradient methods, hierarchical decomposition—that underpin the contributions in this thesis.

2.8.1

The World as Will and Environment

Let us consider Turing's third question about chess again:

Could one make a machine which would play a reasonably good game of chess, i.e. which, confronted with an ordinary (that is, not particularly unusual) chess position, would after two or three minutes of calculation, indicate a passably good legal move? (Emphasis added.)

In order for a machine to make a “passably good legal move'', it needs some way to interpret the “chess position''. For Turing's chess algorithm, this state of play is represented by a full comprehension of the chess board. This is part of the reason for Turing's algorithm being possible to compute only on pen-and-paper. The computers that existed³ at the time were too restrictive to interpret such a complex state and carry out the necessary number of computations. For Torres, automation should make moves by “taking into account the impressions they receive, and also, sometimes, those they have received previously'', raising the possibility of distilling the state which the machine receives only down to its most relevant components.

Here, we will introduce the abstract principal paradigm for RL through its three components:

The agent: the entity that carries out actions
The environment: a representation of the world on which the agent acts
The policy: the rulebook that the agent follows when deciding which action to take next in the environment

RL is the process by which the policy is learned and it is carried out by recording details of interactions that occur between the agent and the environment. The RL paradigm is abstract: both the environment and the agent conform to specific constraints of form which distinguish them from the real-world problem they represent. To formalise this, four key assumptions are made:

Time is composed of discrete `steps', rather than being continuous
The agent can take actions at each step, but not between steps
Positive behaviour is indicated by the environment in the form of a numerical reward, although no constraints are made on the regularity or scale of the reward, just that it must eventually be provided
rlrule:markov The state of the environment at each step is sufficient to choose the optimal action

It is important to note that, for any real-world problem, there can be multiple different environments which suitably represent it, as well as many which appear to represent it, but which fail to adhere properly to the assumptions above. Assumption rlrule:markov is sometimes called the Markov assumption: the state must be sufficient to choose the optimal next action, so any information critical to the decision must be included in the state representation. We formalise this as a Markov Decision Process (MDP).

2.8.2

Markov Decision Processes

A Markov Decision Process (MDP) is a structure $\mathrm{MDP}(\mathcal{S}, \mathcal{A}, \mathcal{T}, \mathcal{R})$ where:

$\mathcal{S}$ is a set of states.
$\mathcal{A}$ is a set of actions.
$\mathcal{T}(s'|s, a) = \mathbb{P}(s_{t+1}=s'|s_t=s, a_t=a)$ : The probability of a transition to state $$s'$$ given current state $$s$$ and action $$a$$ .
$\mathcal{R}(s, a, s') = \mathbb{E}(r_t|s_t=s, a_t=a, s_{t+1}=s')$ : The expected reward gained when the system transitions from state $$s$$ to $$s'$$ .

MDPs exhibit the Markov Property.

property[Markov] For any $\mathrm{MDP}(\mathcal{S}, \mathcal{A}, \mathcal{T}, \mathcal{R})$ and any $$t$$ having $a_{t-1}, \ldots, a_0 \in \mathcal{A}$ and $s_t, \ldots, s_0 \in \mathcal{S}$ :

\mathbb{P}(s_{t}=s | \underbrace{\left[ a_{t-1}, \ldots, a_0\right]}_\text{previous actions}, \underbrace{\left[ s_{t-1}, \ldots, s_0 \right]}_\text{previous states}) = \mathbb{P}(s_t=s |s_{t-1}, a_{t-1}) = \mathcal{T}(s_t|s_{t-1}, a_{t-1}).

Equation 2.1

There may also be two sets $S_\mathrm{start}$ and $S_\mathrm{term}$ having:

\forall s \in \mathcal{S} \; \forall s' \in S_\mathrm{start} \; \forall a \in \mathcal{A} \quad \mathcal{T}(s'|s, a) = 0,

Equation 2.2

\forall s \in \mathcal{S} \; \forall s' \in S_\mathrm{term} \; \forall a \in \mathcal{A} \quad s \neq s' \implies \mathcal{T}(s|s', a) = 0.

Equation 2.3

A set of transitions from a state in $S_\mathrm{start}$ to one in $S_\mathrm{term}$ constitutes an episode. An example MDP is shown in Figure 2.2 and the view of interactions with the system of the agent is shown in Figure fig:prep:agent1.

A simple MDP structure with three states. In this case, $s_0$ is a start state and $s_2$ is a terminating state. — A simple MDP structure with three states. In this case, $$s_0$$ is a start state and $$s_2$$ is a terminating state.

The way that an agent interacts with an environment. Each action gives the agent some reward and changes the environment's state. — A simple MDP structure with three states. In this case, $$s_0$$ is a start state and $$s_2$$ is a terminating state.

If an action $$a$$ causes a transition from a state $$s$$ to $s^{\prime}$ , then the tuple $(s, a, s^{\prime})$ is described as an experience. A trajectory (of length $$n$$ ) is an ordered set of experiences that describes a series of actions taken by the agent to move the environment from some state $$s_0$$ to a state $$s_n$$ :

\tau = \{(s_0, a_0, s_1), (s_1, a_1, s_2) , \ldots , (s_{n-1}, a_{n-1}, s_n)\}.

Equation 2.4

Note that if $s_n = s_\mathrm{term}$ , then $\tau$ describes an episode.

2.8.3

The Trial (and Error)

The goal of RL is to choose the policy which maximises the value of the accumulated reward that an agent achieves through its interactions with the environment. There are two broad ways of thinking about how to learn a policy using reinforcement. If we know something about how the environment works, then we can employ what is known as model-based learning, otherwise, we use model-free learning.

Model-free learning⁴ is referred to as tabula rasa (clean slate) learning, since such methods approach the environment with no pre-existing knowledge, building their understanding entirely from successful and unsuccessful interactions with it. For this reason, model-free methods are the most versatile, and constitute the majority of recent research in deep reinforcement learning.

Two principal families of model-free learning are considered in this thesis: value-based and policy-based methods.

Value-based Methods

To understand value-based methods, we can consider Turing's chess algorithm. At each point, Turing calculates a “position value'', comprised of the “material value'' and the “position-play value''. The next action is chosen as the one which leads to the highest position value. Although Turing's approach did work well for chess, it was not impenetrable, it was specific to chess and did not improve with additional experience. Instead, we will consider the value of a state as the expected, discounted value of the future reward.

For the policy $\pi$ , we can formally define the state value, $$V(s)$$ , at each time $$t$$ in terms of the state $$s_t$$ :

V^\pi(s) = \mathbb{E}_\pi \left\{ \quad \sum_{t=0}^{\infty} \gamma^t \mathcal{R}(s_t, a_t, s_{t+1}) \quad \middle| \quad s_0 = s \quad \right\},

Equation 2.5

where $0 \leq \gamma < 1$ is known as the discount factor.

The value of $\gamma$ is chosen to control how short-sighted the agent is. Values that are low represent a policy that cares only about immediate rewards, whereas values that are high represent policies that are far-sighted⁵.

The value function tells us how desirable a state is, but we also need to determine the desirability of an action given a state. We do that by introducing the idea of quality. For any action $a \in \mathcal{A}$ at a state $s \in \mathcal{S}$ , we can determine the quality of that action in that state:

Q(s, a) = \mathbb{E}_\pi \left\{ \quad \sum_{t=0}^n \gamma^t \mathcal{R}(s_t, a_t, s_{t+1}) \quad \middle| \quad a_0=a,s_0=s \quad\right\}.

Equation 2.6

Given a policy $\pi$ , actions are chosen such that:

\pi(s)= \argmax_{a\in \mathcal{A}} Q(s, a).

Equation 2.7

Applying the Markov property (Property prop:prep:markov) to Definition 2.6 yields the Bellman recursion :

Q(s, a) = \mathcal{R}(s, a, s') + \gamma \max_{a'\in \mathcal{A}} Q(s', a'), \quad \text{where } s' = \mathcal{T}(s'|s,a).

Equation 2.8

From this, we can deduce the value of $$Q(s, a)$$ using an iterative approach: when an MDP in state $$s$$ transitions to $$s'$$ as a result of action $$a$$ with reward $$r$$ , determine the so-called temporal-difference error (or TD-error):

TD(s, a, r, s') = \left\{ r + \gamma \max_{a'\in \mathcal{A}} Q(s', a') - Q_{i}(s, a) \right\}.

Equation 2.9

Then, we can update the estimate for $$Q(s, a)$$ :

Q_{i+1}(s, a) = Q_{i}(s, a) + \alpha \, TD(s, a, r, s'),

Equation 2.10

where $\alpha$ is the learning rate, controlling the size of updates in stochastic systems.

This temporal-difference formulation directly parallels the biological reward prediction error encoded by phasic dopamine discussed in Section 2.2. Just as dopamine spikes when outcomes exceed expectations and dips when outcomes disappoint, TD error quantifies the discrepancy between predicted value $$Q_i(s, a)$$ and observed return $r + \gamma \max_{a'} Q(s', a')$ . This neurobiological correspondence provides both inspiration and validation for TD-based learning algorithms: the same mechanism that evolution refined for adaptive behaviour in biological agents turns out to be a principled and effective basis for learning in artificial ones.

2.8.4

Deep Reinforcement Learning

The value-based framework developed above assumes tractable state spaces. Q-learning with tables stores $$Q(s,a)$$ for every state-action pair—feasible for gridworlds but not for real-world problems. Robotic control confronts continuous joint-space observations; visual tasks confront images with millions of possible configurations. Function approximation with neural networks becomes necessary. As discussed in the previous section, the central challenge is the deadly triad : combining function approximation, bootstrapping, and off-policy learning creates training instability that naive gradient descent cannot handle.

Mnih et al. resolved this with Deep Q-Networks (DQN). By combining convolutional encoders with two stabilising techniques—experience replay (storing past transitions in a buffer and sampling them randomly to break temporal correlation) and target networks (a periodically frozen copy of the value network that provides stable TD targets)—DQN achieved human-level performance across 49 Atari games, learning directly from pixel observations with a single architecture and no game-specific engineering. This result established deep reinforcement learning as a practical methodology. DQN's architectural patterns—CNN encoders, replay buffers, target networks—underpin the methods used throughout this thesis. Its principal limitation is that the $\argmax_a Q(s,a)$ operation requires discrete, enumerable actions; continuous control demands a different approach.

2.8.5

Policy Gradient Methods

Value-based methods derive policies implicitly via $\pi(s) = \argmax_a Q(s,a)$ , which requires enumerating all actions. This is tractable for discrete choices but intractable for continuous action spaces—robot joint torques, vehicle steering—where actions vary smoothly over $\mathbb{R}^n$ . Policy gradient methods address this by parameterising the policy directly as a distribution $\pi(a|s;\theta)$ and optimising $\theta$ to maximise expected cumulative reward. The policy gradient theorem shows that the gradient of expected return is:

\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_{t=0}^T \nabla_\theta \log \pi_\theta(a_t|s_t) \, A_t\right],

Equation 2.11

where $$A_t$$ is the advantage—how much better the chosen action was than the average for that state—estimated by a learned critic in actor-critic architectures . This gradient can be computed without a model of environment dynamics, requiring only sampled trajectories.

Proximal Policy Optimisation (PPO) is the de facto standard on-policy method. Its central challenge—that large gradient steps can catastrophically degrade policy performance—is addressed by clipping the probability ratio between the new and old policy to a narrow interval, naturally bounding the size of each update without expensive second-order constraints. PPO is simple to implement, stable across diverse tasks, and is the primary algorithm used in Chapters 5 and 7 of this thesis.

Soft Actor-Critic (SAC) is the leading off-policy method. It augments the standard RL objective with a policy entropy term, rewarding diverse behaviour and improving exploration. Off-policy learning—reusing past experiences via a replay buffer—dramatically reduces the number of environment interactions required, an important property when sample collection is expensive or slow. SAC's combination of sample efficiency, entropy-regularised exploration, and stability makes it a suitable algorithm for continuous control when interaction cost is a constraint, as in some experimental settings in Chapter 8.

2.8.6

Sisyphus Plays Atari

Reward regimes which apply credit only upon completion of a specific task are known as sparse. A drone that receives reward only when it successfully navigates the forest must discover, through random interaction, the entire sequence of actions that leads there. Sparse rewards exacerbate the credit assignment problem : it is difficult to determine which prior actions were responsible for a success that may have occurred hundreds of steps earlier. Classical examples include Montezuma’s Revenge, an Atari game where meaningful rewards are obtained only after solving multi-step puzzles , and dexterous manipulation tasks where reward is withheld until grasp success . In such settings, unguided exploration rarely yields the necessary action sequences, motivating the techniques discussed below.

Reward Shaping

When environment rewards are sparse or delayed, agents may require prohibitively many interactions before discovering rewarding trajectories. Reward shaping augments environment rewards with supplementary signals designed to guide learning towards desirable behaviours. A shaped reward function $\mathcal{R}'(s, a, s') = \mathcal{R}(s, a, s') + F(s, a, s')$ adds a shaping term $$F$$ to the original environment reward $\mathcal{R}$ .

Not all shaping functions preserve the optimal policy. Ad-hoc shaping—adding arbitrary bonuses for visiting certain states or taking particular actions—risks inducing policies that maximise shaped reward whilst failing to solve the original task. Potential-based shaping addresses this concern by constraining $$F$$ to the form $F(s, a, s') = \gamma \Phi(s') - \Phi(s)$ for some potential function $\Phi: \mathcal{S} \to \mathbb{R}$ . This telescoping structure ensures that cumulative shaped reward across any trajectory equals the difference in potential between terminal and initial states, guaranteeing that optimal policies under shaped reward remain optimal under the original reward function.

The principle extends naturally to modular policy architectures. In distributed policy graphs, different units may receive divergent reward signals tailored to their specialised roles—one unit shaped towards “stabilise angle'', another towards “minimise energy consumption''. Careful reward design ensures units develop complementary skills rather than conflicting objectives. The potential-based guarantee provides a foundation for principled reward routing: shaping signals can guide unit specialisation during training without distorting the global objective that the complete graph must optimise.

Intrinsic Motivation

In their 2004 paper , Chentanez, Barto, and Singh apply the idea of intrinsic and extrinsic motivation to reinforcement learning from psychology. As they describe it, “extrinsic motivation …\ means being moved to do something because of some specific rewarding outcome, and intrinsic motivation …\ refers to being moved to do something because it is inherently enjoyable''. Developing reward metrics that rely on something other than environmental feedback makes training tractable in sparse reward settings.

Intrinsic motivation mechanisms encourage exploration by rewarding novelty, uncertainty reduction, or state visitation. Count-based methods reward visiting states inversely proportional to prior visits, encouraging agents to explore unfamiliar regions of state space. Curiosity-driven approaches reward prediction error: when an agent's internal model fails to predict the consequences of its actions, that unpredictability itself becomes rewarding, driving exploration towards surprising outcomes. Random Network Distillation provides a computationally efficient approximation: a fixed random network serves as a target, and a second network is trained to match its outputs. Prediction error is high for novel states the predictor has not yet encountered, making it a reliable proxy for novelty.

These mechanisms address a fundamental tension in reward-driven learning. Environmental rewards reflect task objectives but may be too sparse to guide learning; intrinsic rewards are dense but may not align with task objectives. Combining extrinsic and intrinsic signals—weighting task reward against exploration bonuses—enables agents to learn efficiently in sparse domains whilst ultimately optimising for environmental objectives. The balance between these signals, and how that balance should evolve during training, remains an active research question with direct implications for distributed policy graphs: should individual units receive intrinsic motivation signals, and if so, how should unit-level exploration coordinate with graph-level task objectives?

Directed and Undirected Reward

Reward signals differ not only in sparsity but in their relationship to task objectives. Directed rewards explicitly encode goal achievement: reaching a target location, solving a puzzle, completing a manipulation task. These signals provide clear learning objectives but may induce specification gaming—agents that maximise reward through unintended means. Just as dopamine encodes incentive salience rather than genuine wellbeing, a reward function encodes the designer's proxy for success rather than success itself. Agents trained with poorly specified directed rewards may therefore develop pathological optimisation behaviours, maximising the reward signal through unintended strategies that fail to achieve the task designer's actual intent.

Undirected rewards, by contrast, encourage broad competence without specifying particular goals. Entropy maximisation rewards diverse behaviour; empowerment rewards states from which the agent can exert maximal influence over future states; skill discovery rewards the acquisition of distinguishable behavioural primitives. These approaches develop general capabilities that may transfer across tasks, but provide weaker guarantees about solving any specific objective.

The distinction matters for deployment. Directed rewards enable focused training on specific tasks but risk brittleness: agents may fail when task conditions differ from training. Undirected rewards develop flexible capabilities but may never achieve specific objectives reliably. Policy graphs offer a structural solution: directed rewards train specialised units for specific subtasks (“navigate to waypoint'', “grasp object''), whilst graph-level coordination ensures these specialised capabilities compose into complete task solutions. The modular structure bounds specification gaming: even if an individual unit develops an unintended behaviour, explicit routing constraints limit how that behaviour propagates through the system.

2.8.7

Self-Play and Curriculum Learning

The heuristics introduced earlier in this chapter—bootstrapping, evolution, co-evolution—find concrete expression in reinforcement learning through mechanisms that generate training signal from the learning process itself rather than from external supervision. Self-play and curriculum learning exemplify this principle: agents improve by competing against past versions of themselves or by progressively tackling increasingly difficult tasks, creating virtuous cycles where capability improvements unlock access to new training experiences that drive further improvement.

Self-Play

When suitable opponents or training partners are unavailable, agents can learn by interacting with copies of themselves. Self-play embodies the co-evolutionary heuristic: as the agent improves, so does its opponent, maintaining appropriate challenge throughout training. The agent's current capabilities enable training experiences that develop new capabilities, which in turn unlock further improvement—a bootstrapping loop that can discover strategies beyond human conception, as demonstrated by AlphaGo's novel Go moves and OpenAI Five's Dota 2 team coordination.

However, self-play also risks pathological dynamics. Agents may develop strategies that exploit weaknesses in their current opponent (a past self) rather than developing generally robust capabilities. Cycling between brittle strategies—rock-paper-scissors dynamics—can prevent convergence to stable, high-quality policies. Maintaining population diversity and carefully managing the distribution of training opponents addresses these concerns, ensuring that self-play drives genuine capability improvement rather than narrow exploitation.

Curricula

Complex tasks may be intractable when attempted directly but learnable through careful sequencing of intermediate objectives. Curriculum learning presents tasks in order of increasing difficulty, allowing agents to develop foundational skills before confronting full task complexity. The principle mirrors human education: children learn arithmetic before calculus, basic motor skills before complex athletics.

Automatic curriculum generation extends this idea by adapting task difficulty to agent capabilities: rather than hand-designing task sequences, the curriculum itself becomes a learning problem. Approaches range from simple heuristics—training on tasks where the agent achieves intermediate success rates—to meta-learning methods that explicitly optimise curriculum parameters.

Domain randomisation represents a complementary approach: rather than sequencing tasks by difficulty, training exposes agents to diverse variations of the same task. Randomising physics parameters, visual appearances, or environmental configurations encourages policies that generalise across conditions rather than overfitting to specific settings.

Chapter 6 extends these ideas through procedural environment generation. Rather than manually designing curriculum stages or randomisation distributions, the EnvCraft system generates thousands of validated training environments from natural-language specifications. This enables systematic study of how training diversity affects generalisation—a question central to deploying reinforcement learning beyond narrow benchmark distributions.

2.8.8

Hierarchical RL

The foundational concepts developed earlier in this chapter—division of labour from pin factories, reward prediction error from dopamine neuroscience, System 1/System 2 cognitive dichotomies—all point towards a common architectural principle: complex adaptive systems benefit from modular decomposition along functional boundaries. Reinforcement learning research has explored this principle through hierarchical frameworks that decompose policies into reusable, composable units.

Options extend the action space to include temporally extended actions—policies that execute over multiple timesteps until a termination condition is met. An option consists of three components: an initiation set (states where the option can start), a policy (what actions to take), and a termination condition (when to return control). This formalism enables agents to learn at multiple temporal scales: low-level options learn motor skills (“grasp object'', “move to location''), whilst high-level policies learn to compose these skills into task solutions. Options align with the chunking mechanisms discussed in Section 2.3: just as practiced skills become automatic routines, learned options become reusable behavioural primitives.

Feudal RL introduced explicit hierarchical control through manager-worker relationships. Managers operate at slower timescales, setting subgoals and decomposing tasks; workers execute low-level policies to achieve these subgoals. This mirrors the division of labour in pin factories: managers coordinate specialisation, workers execute specific skills. Feudal RL demonstrates that hierarchical value function decomposition—where managers learn to evaluate subgoal achievement and workers learn skill execution—can improve learning efficiency in complex domains.

MAXQ formalises hierarchical decomposition through recursive task decomposition. A task graph defines subtasks and their constraints; each node in the graph has a value function decomposed into completion value (reward for completing this subtask) and continuation value (reward from parent tasks after completion). This decomposition enables state abstraction: subtask policies need only observe state features relevant to their specific objective, reducing the effective state space. MAXQ exemplifies “carving nature at the joints'' (Section 2.1): task decomposition should align with natural problem structure rather than arbitrary boundaries.

HAM (Hierarchical Abstract Machines) represents hierarchical policies as partially-specified finite state machines. Each machine defines legal action sequences through states, transitions, and choice points where learning occurs. Non-choice states execute deterministically; choice states invoke learned policies or sub-machines. HAM provides stronger constraints than options or MAXQ: the hierarchy itself encodes domain knowledge about valid action sequences, reducing the space of behaviours the agent must explore. This connects to Torres's conception of automation (Section 2.4): constraints imposed at construction time enable efficient operation within bounded domains.

Despite their conceptual elegance, these hierarchical frameworks face deployment challenges that limit real-world applicability:

Co-location assumption: Prior work assumes hierarchy components share a process and memory space. Options, feudal managers, MAXQ subtasks, and HAM machines all presume instantaneous communication and shared state access. This precludes physical distribution across heterogeneous hardware—exactly what System 1/System 2 dichotomies suggest (reactive edge control, deliberative cloud reasoning).
No network/communication model: Existing frameworks lack explicit models of inter-component communication. Delegation, return, and reward routing occur implicitly through shared memory. Real deployment confronts latency, jitter, packet loss, and partial failures—conditions these frameworks do not address.
Limited interpretability and accountability: Soft attention mechanisms and implicit routing make it difficult to trace which component made which decision. When a policy fails, identifying the responsible unit requires analysis of learned attention weights rather than explicit call traces. This undermines the debugging and auditing requirements for safety-critical deployment.
Training complexity: End-to-end training of hierarchical policies requires differentiating through routing mechanisms and managing credit assignment across temporal abstractions. This couples learning across components, preventing independent development and deployment of specialised units.

Policy graphs, introduced in Chapter 5, extend hierarchical RL to address these four deployment gaps. A directed graph $$G=(V,E)$$ of callable policy units uses hard routing and call-and-return semantics: exactly one unit is active at any moment, commitment bounds $(k_{\min}, k_{\max})$ prevent unstable switching, and explicit call traces provide accountability that soft-attention hierarchies cannot. Units execute as physically distributed networked services—reactive control on low-power edge devices, deliberative reasoning on remote hardware—operationalising the System 1/System 2 distribution described in Section 2.3. Chapter 8 extends this further through CALF, treating network conditions as first-class training objectives so that policies learn to tolerate the latency and packet loss they will encounter at deployment.

2.9

Systems II

Having established the algorithmic foundations of reinforcement learning, we turn to the distributed systems context in which policy graphs must operate.

Remote Procedure Call (RPC) systems enable function invocation across network boundaries, presenting remote execution with the appearance of local function calls. Traditional RPC assumes reliable, low-latency networks—this holds within data centres but fails across Wi-Fi, cellular, and satellite links, where variable latency, jitter, packet loss, and partial failure are normal. Distributed systems must anticipate component failures and asynchronous delivery. Fault domains define boundaries within which failures correlate; placing time-critical policy units locally and computationally intensive deliberation remotely ensures that network failures degrade gracefully rather than causing total system collapse.

Observability and traceability become critical when distributed systems fail. When a deployed policy fails, operators must identify which unit made which decision, under what observations, and whether the cause was learned behaviour, network failure, or hardware fault. Policy graphs' hard routing and call-and-return semantics provide this traceability: execution traces explicitly record which units were active and when delegations occurred. This accountability distinguishes policy graphs from soft-attention hierarchies where responsibility diffuses across learned weights and cannot be inspected.

Containerisation packages code and runtime dependencies into portable units that execute consistently across diverse hardware. This enables deployment parity: the same policy code executes in pure simulation, simulation with network models, and real hardware, eliminating discrepancies between training and production environments. Chapter 8's CALF framework leverages containers for exactly this purpose.

Where pin factories achieved productivity through specialisation—eighteen workers performing distinct operations—policy graphs achieve deployability through modular accountability: eighteen policy units performing distinct behaviours, each traceable, testable, and independently deployable. The architectural patterns from engineered systems reviewed in Chapter 3—A320 flight computers distributing responsibility across ELACs and SECs, power grids coordinating IEDs at substations with SCADA at national scale—inform this design: reliability emerges from constrained transitions between well-defined components, not from monolithic optimisation of opaque end-to-end systems.

This chapter has assembled the conceptual vocabulary on which the remainder of the thesis depends. Beginning with Adam Smith's observation that dividing labour along natural joints dramatically multiplies productive capacity, and following the thread through Skinner's reward schedules, dopamine's encoding of prediction error, Kahneman's dual-process cognition, Torres's discernimiento, and Turing's notion of positional value, we have arrived at an account of why reinforcement learning is structured as it is and what it still lacks for real deployment. The frameworks of Options, Feudal RL, MAXQ and HAM are elegant, but they were designed for co-located, monolithic execution; the distributed, heterogeneous, failure-prone deployment environments of the real world demand something more explicit. The following chapters address that gap: the Lessons chapter grounds these abstractions in three engineering systems where distribution and failure have proved consequential; the Works chapter surveys the state of real-world RL deployment; and the research chapters introduce policy graphs, compact edge-deployable encoders, communication-aware training, and procedural environment generation as components of a practical answer to the question Turing could not yet resolve.

Footnotes

In a now plainly discreditable sign of the period, the profile appears shortly after material endorsing eugenic research.
A portmanteau of their surnames.
The Ferranti Mark I (1951), an improved version of the Manchester Mark I, was the first commercially available general-purpose computer. Turing attempted to run his chess algorithm on the Ferranti but was unsuccessful in his lifetime.
As is the case for some model-based methods.
Values of __MATH_1__ can cause instability in infinite-horizon settings: changes in __MATH_2__ anywhere in the state space propagate globally, making convergence slow and sensitive to initialisation.