Chapter 4

Works

Reinforcement learning has proven itself in a wide array of simulated domains, yet deploying it in real-world systems exposes a cluster of persistent obstacles: sample scarcity, system constraints, partial observability, reward misspecification, the need for offline training, interpretability requirements, high-dimensional spaces, and—most pertinently for distributed systems—latency and actuator delays. This chapter surveys those challenges, illustrates them through three case studies in sepsis treatment, robotic manipulation, and telesurgery, and synthesises the recurring deployment gaps that motivated the technical contributions developed in subsequent chapters: policy graphs for interpretable modular control, EnvCraft for generalisation benchmarking, MiniConv for edge-optimised inference, and CALF for communication-aware training.

Previous chapter Next chapter

Chapter Abstract

4.1

Foundations

Dulac-Arnold et al. discuss several of the challenges involved in real-world applications of RL. Namely:

Being able to learn on live systems from limited samples.
Reasoning about system constraints that should never or rarely be violated.
Interacting with systems that are partially observable, which can alternatively be viewed as systems that are non-stationary or stochastic.
Learning from multi-objective or poorly specified reward functions.
Training offline from the fixed logs of an external behaviour policy.
Providing system operators with explainable policies.
Learning and acting in high-dimensional state and action spaces.
Being able to provide actions quickly, especially for systems requiring low latencies.
Dealing with unknown and potentially large delays in the system actuators, sensors, or rewards.

4.1.1

Limited Samples

Learning high quality policies using only a limited number of experiences is a common problem in RL, often referred to as a problem of sample efficiency. The problem is most acute when training policies in the real world, since the acquisition of experiences can often be costly. In robotics, steps can typically only be carried out sequentially and in real time, so the collection of large amounts of training data can take a prohibitively long time. This limitation necessitates algorithms that can generalise effectively from sparse interactions, compared to traditional simulation-based RL where samples are abundant and cost-free.

Research has tackled this challenge through several approaches. Model-based RL blends learned dynamics with model-free updates to reduce real-interaction requirements , whilst meta-learning algorithms such as MAML accelerate adaptation to new tasks by leveraging prior experience. Off-policy methods further improve efficiency by reusing past transitions via experience replay . Later chapters return to this pressure through reusable policy units and environment-generation pipelines intended to extract more value from limited real interaction.

4.1.2

System Constraints

System constraints in RL, such as safety boundaries or operational limits, are often a critical requirement in real-world applications like autonomous driving, robotics, and healthcare. Unlike unconstrained settings where exploration is unbounded, real-world systems demand that agents avoid violating rules—such as the collision avoidance in vehicles or dosage limits in medicine—during both learning and deployment. This challenge requires balancing exploration with adherence to hard or soft constraints, often conflicting with reward maximisation objectives.

In practice, these constraints are often intertwined with where computation is placed. For example, Neurosurgeon partitions deep neural network inference between mobile devices and the cloud, explicitly optimising for end-to-end latency and energy by deciding which layers run where. This kind of computation offloading highlights that latency is not merely an algorithmic property but a system-level design choice: a policy's physical location and the communication topology can drastically change the effective feedback delay observed by an RL agent.

Constrained optimisation, as in Achiam et al.'s CPO , formulates RL to maximise rewards within explicit cost limits, whilst shielding employs runtime monitors to block constraint violations outright. Both approaches trade some learning efficiency for a safety guarantee, a tension that reappears in the thesis through constrained routing, bounded commitment, and explicit fallback structure.

Large-scale RL systems make similar trade-offs. Sample Factory demonstrates a single-machine architecture capable of exceeding 100,000 environment frames per second by aggressively parallelising simulation and learning, whilst Isaac Gym keeps physics and policies on the GPU to avoid CPU--GPU communication bottlenecks. Both systems illustrate that choices about simulation architecture can introduce staleness and latency between data collection and policy updates, even when communication occurs on a single physical node.

4.1.3

Partial Observability

Partial observability, where agents cannot fully perceive the environment's state, is a pervasive challenge in real-world RL. A robotic manipulator working in clutter must infer object pose from occluded views and sensor history; a clinical policy must act on noisy, missing, or delayed measurements rather than complete physiological data. These situations can be modelled formally as POMDPs, requiring agents to maintain and act upon belief states rather than exact state estimates.

Recurrent networks approximate belief states by tracking observation sequences over time, whilst model-based approaches such as Hafner et al.'s latent dynamics model learn predictive world models from which hidden states can be inferred. The issue matters directly for this thesis because later systems must cope with stale observations, sparse interface cues, and network-mediated state information rather than perfectly exposed simulator state.

4.1.4

Reward Functions

Multi-objective or poorly specified reward functions challenge RL by introducing conflicting goals or ambiguous success criteria. In navigation robotics, objectives might be specified as a combination of path accuracy, speed, duration or other measures. In the domain of healthcare, different optimisation criteria routinely conflict; reduced patient mortality might result in a higher financial cost or a reduced infection rate might result in reduced patient satisfaction. Agents must balance these objectives or infer intended rewards, as misaligned or vague specifications can lead to suboptimal or unintended behaviours. This complexity requires robust reward design or adaptive learning to align policies with real-world intent.

The literature offers a range of approaches to this issue. Multi-objective RL employs scalarisation or Pareto optimisation to balance goals , inverse RL infers rewards from expert demonstrations , and reward shaping adjusts rewards to guide behaviour —though misdesign in any case risks unintended outcomes. For the present thesis, this is one reason to prefer architectures that expose intermediate decisions and operational traces rather than leaving all reward interpretation buried inside a monolithic policy.

4.1.5

Offline Training

Figure 4.1Online training is shown by visualising state spaces in 2D. Initially, a collection of easy-to-achieve states is seen by the agent and used for training. By using reinforcement, through the rewards achieved at different states, less useful states are no longer visited and harder-to-achieve states are discovered. As the harder-to-achieve states become more routinely explored, the policy is further able to achieve more difficult states. Over time, the policy explores a narrower set of states and is able to achieve more complex combinations of actions.

As shown in Figure 4.1, online training uses bootstrapping: a random policy first explores easily accessible states; as those experiences improve the policy, it reaches progressively harder states, creating a virtuous cycle of data collection and learning. Training RL offline loses this feedback loop—fixed logs of experiences are used to train a policy that, were it to have acted in the environment, would maximise reward. An iterative variant, known as batch RL, alternates experience collection and offline training; Figure 4.2 illustrates all three regimes.

Figure 4.2Online RL updates the policy after encountering new experiences. Batch RL updates the policy for every batch of experiences. Offline RL trains on pre-existing experience data without those experiences reflecting feedback from the progressively trained model.

Offline training is especially prevalent in healthcare, finance, and robotics, where real-world data collection is expensive and online exploration is unsafe or impractical. Foundational methods such as Batch-Constrained Q-Learning (BCQ) helped establish the core problem of distributional shift: a learned policy may choose actions that are poorly supported by the fixed dataset on which it was trained. Benchmarks such as D4RL subsequently gave the field a common evaluation language, and methods such as Conservative Q-Learning (CQL) were developed specifically to reduce overestimation on out-of-distribution actions.

In robotics, offline RL is attractive because collecting real interaction data is expensive and slow. Chen et al.'s Batch Exploration with Examples (BEE) addresses this by using a small amount of human guidance to steer exploratory data collection towards task-relevant regions before offline training. Work on robust bisimulation metrics likewise suggests that exploiting shared structure can reduce the amount of task-specific data required . These examples matter because they show that offline RL is most useful when the data-collection process itself is engineered rather than treated as an afterthought.

Beyond robotics, offline RL identifies high-risk treatment patterns in healthcare without trial-and-error on live patients , and supports policy learning from pre-collected driving logs where online exploration would be unsafe .

Methodologically, the field now spans straightforward batch adaptations of value learning , off-policy evaluation techniques based on importance sampling , conservative regularisation , model-based variants such as MOReL , and sequence-modelling approaches such as Decision Transformer . Fujimoto et al. further showed that a relatively simple modification of TD3—TD3+BC, which adds a behaviour-cloning term and state normalisation—can achieve competitive D4RL results with relatively low implementation complexity . The common limitation remains data quality: offline RL is powerful when logs have adequate coverage and sensible support, but brittle when the dataset omits critical states, actions, or failure modes. This is why later chapters emphasise inspectable execution structure and controlled proxy environments: in deployment settings, the dataset is rarely rich enough to let opacity be harmless.

4.1.6

Explainable Policies

Explainable policies are essential in real-world RL for trust and accountability, particularly in life-critical applications like healthcare and autonomous driving, where opaque decisions undermine adoption. In deployment, it is not enough to know that a policy works on average: operators need to inspect why a recommendation was made, what evidence it relied upon, and how the system behaved when it failed. Interpretable model architectures and post-hoc explanation methods such as LIME address part of this need, as do saliency visualisations .

A useful distinction for the chapters that follow is between feature-level explanation and execution-level explanation: saliency maps may show what a policy attended to, but deployed systems also require readable traces of which unit acted, when control changed hands, and what fallback occurred. That need recurs in the sepsis and telesurgery case studies below, and later motivates the modular execution structures developed in this thesis.

4.1.7

High-dimensional State and Action Spaces

High-dimensional state and action spaces strain traditional algorithms designed for low-dimensional, discrete settings. States may include raw sensory data such as images or multi-variable financial indicators, whilst actions can span continuous ranges such as motor controls, exponentially increasing computational demands and sample requirements.

Deep RL has been the primary response: DQN demonstrated that discrete action control from images is tractable, whilst DDPG extended this to continuous control. These results motivate the thesis's focus on compact encoders, conditional computation, and modular decomposition rather than ever-larger monoliths.

4.1.8

Latency

Low-latency action provision is essential in real-world RL systems like robotics, autonomous driving, and high-frequency trading, where delays can compromise both safety and efficacy. Computing actions with low latency—in the order of milliseconds—requires balancing policy complexity with execution speed, a departure from offline RL where latency is less critical. This challenge requires efficient algorithms and infrastructure to ensure real-time responsiveness in dynamic environments.

Three recurring responses appear in the latency literature. Policy distillation compresses large models into smaller ones that can act more quickly at deployment. Hardware acceleration uses GPUs, TPUs, or specialist inference hardware to reduce wall-clock decision time. Real-time planning trades precomputed reactive behaviour for structured online search in domains where the planning horizon remains manageable.

Beyond purely algorithmic techniques, recent work has begun to treat delay as a first-class design parameter in deployment settings such as teleoperation. Bataduwaarachchi (2024) proposes deterministic delay-aware reinforcement learning for teleoperated robotic systems, explicitly modelling end-to-end communication delays between operator, agent and environment. By adjusting how observations and actions are scheduled, these methods show that system-level design and delay-aware learning rules can be combined to maintain control performance under realistic network conditions.

The system-level techniques discussed in the System Constraints subsection—collaborative cloud--edge inference and high-throughput simulators —further underscore that where and how computation is performed is inseparable from latency considerations in real-world RL. That observation leads directly to later chapters on compact edge models, commitment-bounded policy graphs, and network-aware training.

4.1.9

Dealing with Delays

Delays in actuators, sensors, or rewards disrupt the immediate feedback assumption central to traditional RL, complicating policy optimisation in real-world settings. Such delay—whether from mechanical lags in robotics, network latency in distributed systems, or delayed physiological responses in healthcare—introduce temporal misalignment between actions and their consequences, undermining standard Markovian assumptions. This challenge is particularly acute in systems where delays are variable or unknown, requiring RL agents to adapt dynamically to maintain performance.

MDPs that include a delay have been the subject of research interest for some time, but are now attracting renewed interest as RL is applied to real-world problems. Work by Brooks and Leondes (1972) first discusses the issue of so-called `state-information lag' in which the effect of actions is only seen after one timestep. Further early theoretical results involving MDPs with small constant delays are presented by Kim (1985) , Kim and Jeong (1987) , Altman and Nain (1992) and Bander and White (1999) . Similar problems have also been considered in the context of Dynamic Programming and congestion control in high-speed networks .

Recent work extends these formulations to deep reinforcement learning with random, time-varying delays. Bouteiller et al. (2021) analyse environments with stochastic action and observation delays and introduce Delay-Correcting Actor--Critic (DCAC), which relabels trajectories in hindsight so that multi-step off-policy value estimates remain correct. Wang et al. (2024) similarly formalise signal delay in continuous-control tasks and propose delay-aware actor--critic variants that achieve performance close to non-delayed baselines by carefully correcting for the misalignment between actions, observations and rewards.

Katsikopoulos and Engelbrecht (2003) observe that delays in action execution (“action delay'') and delays in state observation (“observation delay'') pose equivalent problems from the position of the agent. They discuss formalisations for the Constant Delayed MDP (CDMDP) and the Stochastic Delayed MDP (SDMDP) and show how both can be reduced to problems dealing only with a single constructed MDPs. This result is important since it shows how the problem of finding an optimal policy for delayed MDPs can be solved using RL and gives us an indication of the increased complexity involved in optimally solving each problem in the general case.

Using these formulations, Katsikopoulos and Engelbrecht (2003) show that the problem of finding optimal policies to CDMDPs is NP-Hard. Trivially, this is also true for solving SDMDPs. The implication of this result is that the development of an algorithm to solve delayed MDPs problem in a way which is computationally feasible is extremely unlikely . In line with this finding, some authors provide concrete examples of where heuristic-driven techniques are necessarily sub-optimal .

The effects of delays on the performance of naively applying existing algorithms has also been quantified. The performance of IMPALA on a delayed environment degrades monotonically with the length of the delay . Implementing a waiting agent, which simply waits for the delay to elapse before acting has also shown to perform poorly . The more recent algorithms of Bouteiller et al. (2021) and Wang et al. (2024) can be interpreted as principled alternatives to such naive strategies: rather than waiting, they reconstruct or relabel the effective sequence of state--action--reward tuples, preserving the Markov structure needed by standard actor--critic methods whilst explicitly accounting for delayed execution.

A parallel line of work in networked control systems (NCS) studies similar phenomena from a control-theoretic perspective. Hespanha et al. (2007) survey results on stability and performance of feedback loops in which sensors and actuators communicate over shared, lossy networks. This literature emphasises how packet loss, bounded or unbounded delays, and scheduling policies interact with closed-loop stability—issues that are increasingly relevant as RL controllers are deployed over the same kinds of shared communication infrastructure.

Several authors propose algorithmic solutions to MDPs with constant delays. Walsh et al. (2007) introduce Model-Based Simulation (MBS) which uses a model to predict the most likely underlying (unobserved) MDP state and use the result as an input to an RL training algorithm. Schuitema et al. (2010) introduce modifications to the SARSA and Deep-Q algorithms to account for a constant known delay.

Firoiu et al. (2018) revisit the technique of using a predictive model to account for delay. They implement a human-like predictive model using a GRU and show how doing so significantly improves performance on the game Super Smash Bros. However, the success of this approach assumes that the state representation is semantically meaningful, which may not be the case in end-to-end systems.

Subsequent work has found some success in training RL algorithms using recent action buffers and simple state prediction . Liotet et al. (2021) train a transformer network to generate a belief representation as a function of previous states and actions and train RL algorithms on this representation as normal.

One particularly impressive approach uses imitation learning to train agents to copy an expert trained on the non-delayed MDP . However, it assumes knowledge of an underlying non-delayed environment which may not be present in most real-world scenarios.

A related work, addressing the problem of stochastic observation delays in the operation of a PD controller, provides discussion of the real-world problems faced in the control of devices operated at a distance such as medical and space equipment .

Almost all of the existing work on training policies on delayed MDPs considers only constant delays, specifically of known value. This is an assumption unlikely to hold in many real-world systems . Many of the methods that do train on environments with stochastic delay still rely on assumptions that may fail in practice, such as knowing a small upper bound on the maximum delay . Even more recent delay-correcting deep RL algorithms typically assume centralised training with full access to delay statistics and relatively clean interfaces between sensing, actuation, and computation, whereas real deployments must cope with heterogeneous hardware, network-induced variability, and partial observability on top of latency.

Furthermore, almost no existing work considers the difficult problem of non-integer delays in which the delay period may elapse between two MDP time steps. Schuitema et al. (2010) consider this problem using linear combinations of actions. Liotet et al. (2022) propose that some¹ non-integer delays may be treated as the combination of two interleaved MDPs.

Existing work has shown that the problem of delays in MDPs is provably hard and that only heuristically guided approximations are currently available. Despite this, some methods perform well under ideal conditions where delays are constant and known. There is still a long way to go: non-integer delays and stochastic delays remain largely unexplored, despite their relevance in real-world settings. Furthermore, there is no standardised methodology for training agents on delayed versions of environments, and current work reflects this by often demonstrating results on only one or a very small number of evaluation environments. The lack of a reusable systems methodology is one of the clearest motivations for the CALF infrastructure developed later in the thesis.

Similar issues arise in multi-agent settings, where agents must communicate over shared, noisy channels. Mao et al. (2020) study multi-agent communication under limited bandwidth, introducing a gating mechanism that prunes redundant messages to respect communication budgets. Chen et al. (2020) formalise Delay-Aware Markov Games and propose algorithms that mitigate the impact of action and observation delays across multiple agents. These works reinforce the view that delays and communication constraints are structural properties of many real-world control problems, not just incidental details of individual deployments.

4.2

Applications

4.2.1

Robotics

RL has been physically deployed across manipulation, locomotion, and navigation. Sim-to-real transfer is the central challenge: domain randomisation and dynamics randomisation train policies in simulation under a distribution over physical parameters so that they generalise to unknown real dynamics. Tan et al.’s quadruped system provides a concrete example of treating latency as a first-class simulator design parameter: actuator latency is explicitly modelled and randomised alongside physical properties, so that the deployed policy is already adapted to realistic feedback delays.

4.2.2

Healthcare

Healthcare applications of RL confront stringent constraints: patient safety precludes exploratory learning on live subjects, regulatory requirements demand interpretability, and clinical datasets are often incomplete or biased by historical treatment protocols. Despite these obstacles, RL has been applied to treatment optimisation, drug dosing, and resource allocation.

Sepsis management has received significant attention, with policies trained offline on intensive care datasets to optimise fluid and vasopressor administration . Such systems promise personalised treatment but face deployment barriers: clinicians require transparent decision traces, yet learned policies typically provide opaque recommendations. Section 4.3.1 examines this case in detail. Beyond sepsis, RL has been explored for chemotherapy scheduling , insulin delivery in diabetes management , and ventilator weaning protocols . These applications share a common challenge: offline learning from historical data introduces distributional shift, where policies encounter states absent from training logs, potentially yielding unsafe actions.

The requirement for offline learning stems from practical and ethical constraints. Randomised trials are expensive and slow; observational data is abundant but reflects clinician behaviour rather than optimal policy. Methods like Conservative Q-Learning address this by penalising out-of-distribution actions , whilst batch-constrained approaches prevent policy divergence from demonstrated behaviour . However, conservatism trades safety for performance: policies may underperform human experts by avoiding beneficial but rarely-observed actions. Interpretability remains the critical deployment gap. Clinicians will not adopt systems that cannot explain why withholding treatment is recommended for a deteriorating patient, regardless of aggregate performance metrics.

4.2.3

Autonomous Systems

Autonomous vehicles represent RL's highest-visibility deployment domain, with substantial industry investment in perception, planning, and control systems. End-to-end learning approaches train policies directly from sensor inputs to control outputs, bypassing hand-engineered perception pipelines . Whilst compelling in simulation, such systems confront severe generalisation challenges: training distributions cannot enumerate the long-tail of edge cases encountered in deployment.

Waymo's autonomous vehicles employ layered architectures combining learned perception with rule-based planners, reflecting pragmatic deployment constraints . Perception failures—misclassified pedestrians, undetected obstacles, degraded sensor performance in adverse weather—require failsafe mechanisms and human oversight. RL has been applied to specific subproblems: lane-keeping, adaptive cruise control, and parking manoeuvres, where constrained operational domains permit reliable learning.

Network partitions and communication failures introduce additional failure modes. Vehicle-to-infrastructure systems assume reliable connectivity, yet cellular networks exhibit variable latency and packet loss. Distributed decision-making under network uncertainty motivates the communication-aware training methodology developed in Chapter 8. Beyond ground vehicles, unmanned aerial systems confront similar challenges: long-range operation requires tolerating communication delays, whilst safety-critical manoeuvres demand low-latency response. The Chapter 3 case studies of the Kangduo surgical robot and distributed power grids illustrate how engineered systems manage latency through explicit handover semantics and hierarchical control—principles applicable to autonomous vehicle fleets coordinating under network constraints.

4.2.4

Finance and Industrial Control

RL has been explored for algorithmic trading, portfolio optimisation, and market-making, where high-dimensional action spaces and non-stationary dynamics challenge traditional methods. High-frequency trading systems require sub-millisecond decision latency, constraining policy complexity and favouring compact representations . Sample efficiency is critical: financial markets permit no exploratory losses, necessitating offline training on historical data with careful validation under realistic market conditions.

Industrial process control presents complementary challenges. DeepMind's data centre cooling system achieved substantial energy savings through learned control policies , demonstrating RL's applicability to complex multi-variate optimisation. Unlike healthcare or autonomous driving, industrial settings permit controlled experimentation: policies can be validated in simulation or shadow mode before deployment, mitigating safety risks.

4.3

Case Studies

The following three case studies examine in depth the deployment constraints most directly relevant to this thesis's contributions.

4.3.1

Sepsis Treatment in ICU

Summary & Methodology

Sepsis is a life-threatening dysregulation of the immune response to infection; left untreated it progresses to septic shock and multi-organ failure. Standard clinical management centres on intravenous fluid administration and vasopressor therapy to restore haemodynamic stability, with clinician judgement determining the dose and timing of each intervention.

The study The Artificial Intelligence Clinician Learns Optimal Treatment Strategies for Sepsis in Intensive Care by Komorowski et al. trains an RL policy offline using data from over 17,000 patients from the Medical Information Mart for Intensive Care (MIMIC-III) dataset. Patient state is encoded as follows:

The current state of the patient is represented by 48 clinical variables (including vital signs, laboratory values, and treatment history).
The state is further discretised using k-means++ clustering, resulting in a total of 750 states across all 17000 patients' states.
State is measured in four hour intervals.
Data variables with multiple measurements within a 4-hour time step are summarised by averaging (for example, with heart rate) or summing (in the case of urine output).
Missing data is handled using a sample-and-hold approach, where missing values are carried forward from the most recent available measurement.
Two absorbing states are defined for discharge and death.

Notably, states are assumed to be homogeneous within each cluster, meaning that variations within a cluster are not explicitly accounted for when making decisions. Additionally, unlike human clinicians, the choice to model this scenario as a first-order MDP means that the agent cannot take previous state directly into account when making treatment decisions. The mapping from actions to physiological responses also does not account for patient-specific pharmacokinetics or pharmacodynamics, treating the effect of each action as identical across all patients in the same state.

Actions are simplified as follows:

The action space is discretised into a fixed set of 25 possible actions, representing combinations of intravenous (IV) fluid and vasopressor dosages.
IV fluid and vasopressor doses are each categorised into five discrete bins, where the lowest bin represents no drug administration, and the remaining nonzero doses are divided into quartiles.
Rarely observed treatment decisions (defined as those occurring fewer than five times in the dataset) are excluded from the action space, potentially limiting the exploration of less common but effective interventions.
The AI Clinician is constrained to learning on actions observed in the dataset, meaning it cannot learn anything about novel treatment strategies beyond those previously administered by clinicians.

The researchers have chosen to consider only two treatment modalities (IV fluids and vasopressors), excluding other relevant interventions such as antibiotics, corticosteroids, or mechanical ventilation, which may be used by a human clinician. Additionally, the policy cannot consider adjustments to treatment frequency or infusion rates, only total dose administration in each 4-hour time step. The model also assumes that, beyond the effect represented in a patient's state encoding, past treatment decisions do not influence future ones; as long as the patient's state does not represent any ill effect, a very large cumulative dose of IV might be administered over several timesteps, beyond what a human clinician would typically allow.

The agent is validated on the eICU Research Institute (eRI) dataset, which includes data on over 79000 admissions. Offline evaluation is conducted using importance sampling and bootstrapping to compare the agent's decisions with that of real clinicians.

Results & Analysis

To assess policy performance, the study employs off-policy evaluation using weighted importance sampling (WIS) and bootstrapping. Across 500 trained models, the agent’s policy appears to consistently outperform clinician policies, achieving a 95% confidence lower bound that exceeds the upper bound of clinician performance. However, the model is limited by the constraints of retrospective data and potential confounders in the highly discretised and relatively narrow state representation.

Jeter et al. provide a rigorous critique. They point out that the AI model’s transition dynamics were not adequately validated, leading to questionable treatment recommendations, and that the use of four-hour data aggregation bins obscured rapid patient deterioration. The agent’s performance degraded significantly on the external validation dataset, suggesting poor generalisability. Most strikingly, the agent sometimes chose non-intervention as the optimal strategy when a patient’s Mean Arterial Pressure dropped below the recommended threshold—learning to associate intervention with patients already in stable conditions rather than as a necessary response to deterioration. The critique emphasises the need for transparency, reproducibility, and careful evaluation before AI can be safely integrated into medical decision-making, and the absence of publicly available code made independent verification impossible. This case also illustrates the quality--latency trade-off shown in Figure 4.3: the most clinically meaningful decisions require expensive offline optimisation, not fast look-up.

Figure 4.3As decision quality improves, the amount of time taken to make the decision also usually increases. GPU-based RL requires the additional overheads involved in moving data between devices and usually involves larger, more complex models. Hard-coded lookup-based policies can respond quickly, but usually make poorer, less robust decisions.

4.3.2

Batch Exploration for Robotic Manipulation

Summary

Robotic manipulation in unstructured environments confronts a fundamental challenge: acquiring diverse training data without exhaustive human supervision. Random exploration in high-dimensional visual observation spaces is prohibitively sample-inefficient; task-specific demonstrations are expensive to collect at scale. Chen et al.'s Batch Exploration with Examples (BEE) framework addresses this by using minimal human guidance to direct autonomous data collection, then training policies offline on the resulting dataset.

The approach targets vision-based manipulation tasks where the agent observes only pixel inputs and must learn control policies for object interaction. Unlike methods that require dense human demonstrations for every target task, BEE collects a single batch of exploratory data guided by a small set of example trajectories, then extracts task-specific policies through offline RL. This separates data collection (task-agnostic exploration) from policy learning (task-specific optimisation), enabling reuse of collected data across multiple downstream objectives.

Methodology

BEE operates in three phases. First, a relevance discriminator is trained on a small set (10--100) of human demonstrations to distinguish task-relevant states from random interactions, providing a reward signal that biases autonomous exploration towards useful regions without specifying the task goal. Second, model-based planning generates exploratory trajectories that visit relevant regions whilst maintaining diversity. Third, offline RL trains task-specific policies on the resulting dataset, using hindsight relabelling and conservative value estimation to handle distributional shift. In high-dimensional spaces, this guidance addresses a critical failure mode: random action sequences rarely produce meaningful object interactions, yet dense demonstrations are too expensive to collect at scale.

Impact

Experiments on vision-based pushing and grasping tasks demonstrate that BEE substantially outperforms baseline offline RL methods trained on randomly collected data. Policies trained on BEE datasets achieve success rates exceeding 70% on held-out object configurations, compared to $$<$$ 20% for random exploration baselines under matched data budgets. Critically, BEE's data collection is task-agnostic: the same exploratory dataset supports training policies for multiple objectives (e.g. pushing objects to different target locations) without re-collecting demonstrations.

The work highlights a recurring deployment theme: sample efficiency through structured exploration. Pure offline RL assumes access to high-quality datasets; pure online RL assumes cheap interaction. Real robotic systems permit neither: data collection is expensive, yet available historical data may not cover task-relevant states. BEE's hybrid approach—autonomous exploration guided by minimal human input—provides a pragmatic middle ground. However, the method inherits offline RL's brittleness: policies fail when deployed states deviate from the training distribution. The relevance discriminator also introduces a potential failure mode: if human examples inadequately represent the task, exploration may focus on irrelevant behaviours.

This case study connects to subsequent thesis contributions. The need for sample-efficient skill acquisition motivates policy graphs' modular training interfaces (Chapter 5), where low-level units learn reusable primitives from simple feedback whilst higher-level components compose them into task-specific behaviours. BEE's reliance on offline learning underscores the value of diverse training distributions, motivating EnvCraft's environment generation methodology (Chapter 6). The vision-based observation space and associated computational demands exemplify the edge deployment challenges addressed through MiniConv's compact encoders (Chapter 7).

4.3.3

Telesurgery and Latency Predictability

The Kangduo KD-SR-01 telesurgical system, examined in detail in Chapter 3, provides the clearest evidence of the latency predictability principle. Experimental deployments over links of 80 km and 6 km established that consistent latency of 130--271 ms supports safe telesurgery, whilst higher jitter at equivalent mean delay causes positional error and surgeon confusion . The system's architectural response—dedicated leased lines for bounded worst-case delay, dual-console redundancy for local fallback, and a sub-three-second handover mechanism—instantiates the principle that predictable moderate latency outperforms unpredictable low latency. Chapter 8 operationalises this insight through CALF's network-aware training, which exposes policies to realistic latency distributions during simulation, yielding robustness that zero-latency training cannot provide.

4.4

Synthesis: Recurring Deployment Challenges

The foundational challenges, application domains, and case studies examined above reveal persistent obstacles to real-world RL deployment. These obstacles transcend specific application areas, appearing across healthcare, robotics, autonomous systems, and industrial control. This section synthesises common themes and identifies deployment gaps that motivate the technical contributions in subsequent chapters.

4.4.1

Interpretability and Accountability

The sepsis treatment case study exemplifies a critical deployment barrier: clinicians will not adopt systems they cannot interpret. Komorowski et al.'s AI Clinician achieves superior aggregate performance metrics through offline RL, yet Jeter et al.'s critique reveals that the policy sometimes recommends withholding treatment for deteriorating patients without providing explanatory traces. Clinicians require decision rationales—which observations triggered which actions, and why—not merely confidence scores or aggregate survival rates.

This interpretability requirement extends beyond healthcare. Financial regulators demand audit trails for algorithmic trading decisions; industrial operators require explanations when process control policies deviate from established procedures; autonomous vehicles must justify emergency manoeuvres to accident investigators. Black-box policies, regardless of performance, fail to meet these accountability standards. Traditional RL produces monolithic neural networks mapping observations to actions with no intermediate structure; post-hoc explanation methods provide approximations but cannot guarantee faithful decision traces.

Chapter 5 addresses this through policy graphs: a modular architecture where decision-making decomposes into units communicating through explicit interfaces, with hard-routing ensuring deterministic call-and-return traces. Unlike ensemble methods or hierarchical RL, policy graphs provide accountability by construction: each decision is attributable to a specific unit, and execution paths are observable through routing tables, meeting the regulatory and clinical requirements that monolithic policies cannot satisfy.

4.4.2

Sample Efficiency and Offline Learning

All three case studies confront sample scarcity. Sepsis treatment cannot permit exploratory administration of harmful drugs; robotic manipulation systems cannot afford thousands of hours of real-world interaction; telesurgery systems must operate reliably from initial deployment. Consequently, real-world RL relies heavily on offline learning: training policies from fixed datasets collected under historical behaviour.

However, offline learning introduces distributional shift: policies encounter states absent from training data, yielding extrapolation errors that online learning avoids through bootstrapping. Conservative methods mitigate this by restricting policies to demonstrated behaviours, but conservatism sacrifices performance—policies cannot discover better-than-human strategies if constrained to imitate historical data. The BEE framework partially addresses this through guided exploration, collecting diverse data without task-specific supervision, but still depends on the relevance discriminator adequately representing task requirements.

Policy graphs improve sample efficiency through modular training. Rather than learning monolithic end-to-end policies requiring comprehensive datasets, individual units learn narrow sub-tasks with simple reward signals. A vision encoder learns from self-supervised reconstruction; a low-level motor controller learns from position tracking errors; a high-level planner learns from sparse task completion. Each unit's training data requirements are modest compared to end-to-end alternatives, and units can be pre-trained on diverse tasks then composed for new objectives without full retraining. This compositional reuse reduces the sample burden that offline learning imposes.

4.4.3

Latency Predictability vs. Sporadic Low Latency

The telesurgery case study establishes a counterintuitive principle: predictable moderate latency outperforms unpredictable low latency. Fan et al.'s deployments succeed under consistent 130-270 ms delays but would fail under variable 0-500 ms latency with the same mean, because human operators (and, by extension, learned policies) can adapt to consistent delays but cannot compensate for unpredictable jitter.

This insight contradicts common RL training assumptions. Standard benchmarks execute policies in lockstep with simulation: action $$a_t$$ immediately produces next state $s_{t+1}$ with zero delay. Deployed systems exhibit variable latency: network communication, sensor processing, actuator dynamics, and policy inference all introduce delays that fluctuate based on computational load and network conditions. Policies trained under zero-latency assumptions cannot adapt to deployment latencies, whilst policies trained under realistic latency distributions learn compensatory strategies—predictive control, delayed action execution, or conservative behaviour when latency exceeds thresholds.

Chapter 8 operationalises this through CALF: Communication-Aware Learning Framework. Rather than training policies in idealised zero-latency simulation, CALF exposes agents to realistic network models during training, including variable transmission delays, packet loss, and bandwidth constraints. Policies learn to tolerate communication failures, execute time-critical components locally whilst offloading computation when network permits, and gracefully degrade when latency exceeds bounds. This yields robustness that zero-latency training cannot provide.

The foundations section's discussion of actuator delays reinforces this theme. Robotic systems exhibit non-integer, stochastic delays between commanded actions and physical effects. Existing work demonstrates that constant known delays can be handled heuristically, but variable delays remain largely unexplored despite their prevalence. The distributed power grid and telesurgery examples from Chapter 3 illustrate how deployed systems manage latency through architectural choices: explicit handover semantics, local fallback mechanisms, and bounded worst-case guarantees. These design patterns inform policy graphs' deployment model: safety-critical units execute locally with deterministic latency, whilst optimisation-oriented units execute remotely and tolerate variable delays.

4.4.4

Generalisation Beyond Training Distributions

All surveyed application domains confront generalisation failures. Autonomous vehicles trained on sunny highway driving crash in snow; robotic policies optimised in simulation fail on real hardware; healthcare policies trained on one hospital's patient population underperform at institutions with different demographics. The sim-to-real gap exemplifies this: domain randomisation and dynamics randomisation improve transfer, but policies still encounter deployment states outside their training distribution.

The BEE case study demonstrates that task-agnostic exploration improves generalisation by collecting diverse data, but the approach still depends on the relevance discriminator identifying appropriate state coverage. If training environments inadequately represent deployment diversity, policies fail. This motivates systematic environment generation rather than manual dataset curation.

Chapter 6 addresses this through EnvCraft: a procedural environment generation system producing diverse task variants covering broad state distributions. Rather than training on fixed benchmark suites or manually designed levels, policies train on programmatically generated environments spanning parameter ranges, obstacle configurations, and reward structures. EnvCraft's diversity metrics quantify environment coverage, enabling principled evaluation: does the policy generalise to held-out environment parameters, or merely overfit to training instances?

4.4.5

Edge Deployment and Computational Constraints

The foundations section's discussion of system constraints highlights computational trade-offs: placing computation on-device reduces communication latency but constrains model capacity; offloading to cloud permits larger models but introduces network delays. Neurosurgeon's DNN partitioning exemplifies this, whilst Sample Factory and Isaac Gym demonstrate that even single-machine systems confront latency-throughput trade-offs based on simulation architecture.

Robotic manipulation and autonomous vehicles require real-time inference on resource-constrained hardware: mobile robots carry limited battery and compute; embedded automotive systems face strict power budgets; surgical robots demand deterministic latency incompatible with cloud offloading. These constraints necessitate compact policy representations compatible with edge deployment.

Chapter 7 addresses this through MiniConv: compact convolutional encoders enabling vision-based policies to execute on edge hardware. Rather than deploying large ResNet or Vision Transformer encoders requiring GPU acceleration, MiniConv provides parameter-efficient architectures achieving competitive performance within embedded system constraints. This enables the deployment model that telesurgery and robotics require: time-critical perception and control execute locally, whilst high-level planning may offload to remote infrastructure when network permits.

Policy graphs integrate with edge deployment through distributed execution: different units deploy on different hardware based on computational requirements and latency tolerances. A lightweight MiniConv vision encoder executes on-device; a heavyweight world model executes remotely; routing logic determines active execution paths based on current network conditions. This mirrors the dual-console telesurgery architecture: the local operator maintains control when the remote connection degrades. The same modular structure also addresses multi-objective constraints: safety-critical units enforce hard limits whilst optimisation-oriented units pursue performance objectives, separating concerns that monolithic reward shaping conflates.

4.5

From Gaps to Contributions

Table 4.1 maps each recurring deployment gap identified in this chapter to the contributing chapter and the key technique it employs.

lll Deployment Gap	Contributing Chapter	Key Technique
Interpretability and accountability	Chapter 5	Policy graphs: hard-routing call-and-return traces
Sample efficiency and offline learning	Chapter 5	Modular unit training; compositional reuse
Latency tolerance and network variability	Chapter 8	CALF: communication-aware training with latency injection
Generalisation beyond training distributions	Chapter 6	EnvCraft: procedural environment generation
Edge deployment and computation constraints	Chapter 7	MiniConv: compact convolutional encoders
Physical realisation	Chapter 9	Three deployed realisations: hardware prototype, EnvCraft (envcraft.com), RLPlayground (rlplayground.com)

Table 4.1Deployment gaps identified in this chapter and their corresponding thesis contributions.

Real-world RL deployment requires simultaneously addressing interpretability, sample efficiency, latency tolerance, generalisation, and computational constraints. The technical contributions in Chapters 7 through 9 provide integrated solutions to these persistent challenges, grounded in the deployment gaps that this chapter's survey has identified.

Footnotes

The work considers two interleaved MDPs, allowing for delays to elapse either at the start of a time step, or at one single, predetermined interval within it.