Chapter 8

Systems

Policy graphs—introduced theoretically in Chapter 5—decompose reinforcement learning policies into modular units organised in a directed graph structure, enabling hierarchy, skill reuse, and division of labour across heterogeneous hardware. This chapter addresses the systems challenges of deploying policy graphs in real-world distributed settings. When policy units execute on different devices (edge processors, cloud servers) communicating over real networks, latency, jitter, and packet loss emerge as critical factors affecting performance. Yet sim-to-real transfer research focuses primarily on physics and visual domain gaps, largely overlooking network-induced mismatches that arise in distributed deployment. This chapter introduces CALF (Communication-Aware Learning Framework), infrastructure for distributed policy graph execution. CALF implements policy units as networked services, supports flexible deployment topologies from single-machine simulation to multi-device edge-cloud deployments, and provides transparent network impairment injection via NetworkShim middleware. This architecture enables a key insight: network conditions constitute an orthogonal axis of the reality gap, alongside physics and visual domain randomisation. Systematic experiments on CartPole and MiniGrid demonstrate that realistic network conditions cause severe performance degradation (40--80% drop) in baseline policies, whilst network-aware training—exposing flat policies to realistic latency, jitter, and packet loss during training—substantially closes this gap (reducing degradation by 4 $\times$ for CartPole and approximately 3 $\times$ for MiniGrid). Ablations reveal that stochastic jitter and packet loss are more detrimental than constant latency. CALF is then illustrated through small hierarchical policy deployments across Raspberry Pi and desktop hardware, showing that the infrastructure can execute distributed policy graphs successfully when network effects are explicitly addressed. CALF serves as systems infrastructure within the thesis, connecting particularly to Chapter 7 (efficient edge models), Chapter 5 (policy-graph formalism and hard routing), and Chapter 9 (embodied hardware control).

Previous chapter Next chapter

Chapter Abstract

8.1

Introduction

8.1.1

From Policy Graph Theory to Distributed Implementation

Chapter 5 introduces policy graphs, a framework for decomposing reinforcement learning policies into modular units organised in a directed graph structure. Policy graphs enable hierarchy, skill reuse, and division of labour—concepts grounded in the principles explored in Chapter 2. However, the theoretical framework presented in Chapter 5 assumes that policy units can communicate instantaneously, with zero latency and perfect reliability. When policy graphs are deployed across distributed hardware—with policy units executing on different devices such as edge processors, cloud servers, and embedded systems—this assumption fails.

Reinforcement learning is increasingly deployed in distributed settings where policy and environment are not co-located: remote-controlled robots, edge devices transmitting to cloud policies, and multi-device systems such as drone swarms. In these cases, network communication mediates both the perception-action loop between environment and policy, and the coordination between policy units in a policy graph. This introduces latency, jitter, packet loss, and bandwidth constraints that alter the temporal structure of the MDP and affect inter-unit communication.

Figure 8.1Real-world distributed systems employ hybrid communication strategies to maintain operation under varying network conditions. Unmanned aerial vehicles, for instance, choose between multiple communication channels (satellite, radio frequency, optical tether) based on signal quality and operational constraints, defaulting to autonomous operation when no reliable connection exists. This illustrates the challenge CALF addresses: policies must function across heterogeneous network conditions rather than assuming perfect connectivity. The experiments in this chapter focus on LAN-like scenarios (Wi-Fi, Ethernet); WAN and adversarial scenarios motivate the need for configurable impairments but are not evaluated here. Whilst the experiments reported in this chapter focus on LAN-like conditions, the NEXUS relay infrastructure is designed to support multi-hop communication across all three categories shown.

Yet mainstream RL training assumes synchronous, zero-latency interaction. Standard benchmarks (ALE , DeepMind Control Suite , OpenAI Gym ) presuppose instant observation delivery and immediate action effects. Distributed training systems (IMPALA , SEED RL ) optimise worker-learner communication but abstract away agent-environment communication as an implementation detail handled by ROS or gRPC.

In deployment, these assumptions fail. Observations arrive late or out-of-order; actions are delayed or dropped; jitter creates unpredictable timing. A policy that perfectly balances an inverted pendulum in simulation may fail with 100 ms Wi-Fi latency, even with perfect physics modelling. The policy learned under instantaneous feedback; it has no mechanism to compensate for temporal desynchronisation.

Sim-to-real transfer has made substantial progress addressing physics mismatch through domain randomisation over friction, masses, and contact models , and visual mismatch through randomisation of textures and lighting . These techniques have enabled remarkable achievements in locomotion and manipulation . However, network-induced mismatch—the temporal and stochastic properties of communication in distributed systems—receives minimal attention. Hwangbo et al. found that accurate modelling of actuator dynamics was central to closing the sim-to-real gap for quadruped robots, but such experiences have not been synthesised into general methodology or reusable infrastructure. This gap is particularly critical for policy graphs: when policy units are distributed across hardware, network conditions directly affect both environment-to-policy and inter-unit communication.

Network conditions constitute an orthogonal axis of the reality gap. Just as domain randomisation exposes policies to variations in friction and lighting, network-aware training should expose policy graphs to latency distributions, jitter patterns, and packet loss rates characteristic of deployment networks. For distributed policy graphs, this becomes an important design consideration rather than a background detail: a policy graph trained assuming instantaneous inter-unit communication may fail catastrophically when deployed across edge devices communicating over Wi-Fi with 100 ms latency and 10% packet loss. This chapter presents network-aware training as a core systems requirement for distributed policy graph deployment.

8.1.2

Research Questions

This chapter addresses three research questions concerning distributed policy graph deployment:

RQ1 (Network Impact on Policy Graphs): How severely do realistic network conditions—including latency, jitter, and packet loss—degrade the performance of policy graphs when trained in idealised, synchronous simulations but deployed over real networks with distributed policy units?

RQ2 (Network-Aware Training for Policy Graphs): Can training policy graphs under realistic network conditions during simulation (“network-aware training'') close this performance gap? Which network phenomena (latency versus jitter versus loss) are most critical to model when preparing policy graphs for distributed deployment?

RQ3 (Infrastructure for Distributed Policy Graphs): What systems infrastructure is needed to enable reproducible, scalable deployment of policy graphs across heterogeneous edge devices and real networks?

8.1.3

Contributions

This chapter makes three main contributions. First, CALF (Communication-Aware Learning Framework), infrastructure for deploying and training policy graphs across distributed hardware: policy units run as networked services, and NetworkShim middleware injects configurable latency, jitter, loss, and bandwidth limits on graph edges without modifying policy code, whilst deployment parity ensures the same policy graph runs from pure simulation to real edge-cloud hardware. Second, systematic empirical evidence that network-aware training—exposing distributed policy graphs to realistic communication conditions during simulation—reduces deployment degradation by $4\times$ for CartPole and approximately $3\times$ for MiniGrid, with stochastic jitter and packet loss proving more detrimental than constant latency. Third, illustrative deployment of hierarchical two-level policy graphs across Raspberry Pi edge devices and desktop cloud servers, providing initial validation that CALF's progressive deployment modes can execute distributed policy graphs successfully when network effects are explicitly addressed.

8.2

Related Work and Positioning

This section positions CALF within multiple research communities: RL theory and algorithms (delayed MDPs, network-aware methods), control theory (networked control systems), sim-to-real transfer (domain randomisation), distributed systems (actor-learner architectures, edge computing), and hierarchical RL (policy graphs from Chapter 5). Each subsection reviews relevant prior work, identifies specific gaps or limitations, and explicitly connects to CALF's design or contributions for distributed policy graph deployment.

8.2.1

Delays and Network Effects in RL and Control

Early work extended the MDP framework to include action and observation delays. Katsikopoulos & Engelbrecht showed that fixed $$k$$ -step delays can be transformed into an equivalent Markov process by augmenting the state with the last $$k$$ actions or observations, though this causes the state space to grow exponentially with $$k$$ . Walsh et al. proved an exponential lower bound: no algorithm can circumvent this blow-up in the worst case. With stochastic delays, optimal policies must use full history, becoming POMDP-like, motivating practical approaches such as frame stacking and recurrent policies. Delay-aware Q-learning (dQ) and SARSA update Q-values against delayed next states for constant delays; Delay-Correcting Actor-Critic (DCAC) resamples and relabels trajectories to correct for random delay distortions. A consistent finding is that unmitigated latency severely degrades performance, but training under delays yields robustness.

The control theory community has extensively studied networked control systems (NCS) , deriving compensation strategies (zero-order hold, Smith predictors, event-triggered control) and stability conditions under bounded delay and dropout. However, NCS analysis applies to linear or simple nonlinear controllers with analytical models; deep RL policies are black-box functions for which no equivalent guarantees exist, and the systematic application of NCS insights to deep RL remains limited.

8.2.2

Sim-to-Real Transfer: The Missing Network Axis

Sim-to-real RL focuses overwhelmingly on physics and visual domain randomisation, with minimal attention to network-induced mismatch. Network conditions constitute an orthogonal axis of sim-to-real transfer; CALF extends the domain randomisation toolkit to network parameters.

Domain randomisation randomises simulator properties so the real world appears as another random variant, enabling zero-shot transfer for manipulation and locomotion . Hwangbo et al. found that accurately modelling Series Elastic Actuator dynamics was the dominant factor in closing the sim-to-real gap for the ANYmal quadruped. However, all these works assume perfect timing—either the policy and environment are co-located, or network effects are unmodelled. Network conditions constitute an independent axis of variation, orthogonal to physics and vision. Some practices incidentally touch on network effects (lower control frequencies, frame skip), but deliberately addressing network domain shift remains absent from prior sim-to-real methodology. CALF makes this network axis explicit and controllable.

8.2.3

Distributed RL Systems: A Contrasting Philosophy

Large-scale distributed RL frameworks treat network communication as a cost to minimise or hide, not as an object of study. These systems optimise away network effects in training infrastructure; CALF foregrounds network conditions as part of the agent-environment interaction.

Modern deep RL often uses distributed architectures for training efficiency. IMPALA separates actors (generate experience) and learners (update model), with V-trace off-policy correction to handle policy lag between when experience was collected and when it's used for learning. SEED RL decouples inference on TPUs with fast transport protocols to minimise network overhead. Sample Factory keeps everything on one machine using threads to avoid network communication entirely. The design philosophy is to ensure agents experience an ideal MDP during training, despite asynchronous collection. Network communication between actors and learners is an engineering challenge to solve, not a phenomenon to study.

There is a fundamental difference: IMPALA/SEED RL address network lag between actor and learner (in training infrastructure), whereas CALF addresses network lag between agent and environment (in the control loop itself). These address different problems. IMPALA ensures policy updates aren't stale; CALF trains policies that work when observations and actions are stale.

8.2.4

Edge Computing and Resource Constraints

Edge machine learning research focuses primarily on computation and energy constraints, with less attention to communication constraints. CALF addresses the communication side, motivated by edge-cloud deployments where not all computation fits on-device.

There is growing interest in running RL policies on microcontrollers, Raspberry Pi, and Jetson devices. Techniques include model compression, quantisation, and distillation to fit policies in limited memory and compute . The TinyML movement targets extremely compact policies for microcontrollers with kilobytes of memory. The trade-off is that smaller networks can run in real-time but may have less representational capacity. Additionally, computational latency becomes a concern when large neural networks cannot compute actions fast enough, leading to proposals for asynchronous or parallel policy architectures.

However, complex policies—especially vision-based—will not fit on tiny embedded devices. Some splitting or offloading is necessary. Neurosurgeon automatically partitions deep neural networks between edge devices and cloud to minimise latency and energy: convolutional layers execute on the edge (near sensors), fully connected layers execute on a server, and intermediate features (smaller than raw images) are sent over the network. This achieves $3\times$ lower latency and energy consumption compared to all-cloud or all-device execution. This approach could be applied to RL by splitting policy networks similarly—e.g., visual encoder on robot, decision MLP on server—reducing bandwidth and latency through parallel processing.

8.2.5

Multi-Agent RL and Other Network-Aware Contexts

Network effects appear in other machine learning contexts (multi-agent communication, federated learning), but no focused infrastructure exists for single-agent control RL. CALF addresses this gap.

Multi-agent RL research studies how agents learn to communicate under bandwidth limits or delays. Work on emergent communication includes learned continuous communication protocols and communication minimisation via information-theoretic regularisation . A consistent finding is that naïve MARL degrades with delays, but training under delays yields robustness. However, MARL focuses on agent-to-agent delays, whilst agent-to-environment delays in single-agent control RL remain less explored.

8.2.6

Hierarchical RL and Distributed Policy Execution

Hierarchical RL provides methods to decompose behaviour into subskills. Chapter 5 introduces policy graphs, which generalise hierarchical approaches by organising policy units in directed graph structures. CALF provides the execution infrastructure where these policy graphs can be physically distributed across heterogeneous devices, addressing a gap in prior work.

The Options framework , Hierarchies of Machines , and MAXQ introduced temporally extended actions and hierarchical decomposition of behaviour, enabling higher-level decision-making at slower timescales. If an option runs autonomously for 10 steps, the high-level policy only needs to communicate every 10 steps—naturally more robust to moderate network latency, as the low-level skill continues even if communication is temporarily delayed. Modern variants include Option-Critic for end-to-end option learning, and two-level hierarchies like FeUdal Networks and HIRO , where managers set goals and workers execute them. However, prior work assumes hierarchy components are co-located (same process or machine), whilst policy graphs explicitly enable distributed deployment.

8.2.7

Network Emulation Tools

Mature network emulation tools exist but are not integrated into RL training loops. CALF builds on these tools but integrates them directly into the RL workflow.

Available tools include Linux tc netem (kernel-level delay, loss, bandwidth limits with configurable distributions: normal, Pareto, etc., and Markov loss models), Mininet (virtual networks on a single machine for network protocol research), and Mahimahi (record and replay real network traces, especially cellular). These are occasionally used in federated learning or video streaming RL, but rarely in robotics or control RL.

8.2.8

Summary: CALF's Position

Prior work addresses network effects through algorithm modification (delay-aware Q-learning, DCAC), control-theoretic compensation (Smith predictors, zero-order hold), or distributed training infrastructure (IMPALA, SEED RL). CALF takes a complementary approach: rather than modifying algorithms or optimising training infrastructure, the training and deployment environment is modified to expose realistic network behaviour. This environmental approach is algorithm-agnostic and extends naturally to heterogeneous edge deployment of policy graphs. CALF implements Chapter 5's policy graph framework whilst making network conditions explicit: policy units become networked services, and network impairments are transparently injected on the communication channels between units. The infrastructure can be combined with algorithmic innovations (e.g., DCAC within CALF's framework) and complements existing domain randomisation practices by adding network parameters to the randomisation distribution. Together, these strands of work suggest two requirements for progress: training must experience the same communication pathologies as deployment, and the infrastructure must allow controlled, reproducible manipulation of latency, jitter, and loss across real hardware. CALF is designed to meet both.

8.3

CALF: A Framework for Network-Aware Reinforcement Learning

This section describes CALF's architecture and implementation at a level sufficient to understand the experimental methodology and results. Complete implementation specifications, including byte-level protocol details, serialisation algorithms, and service lifecycle management, are provided in Appendix B.

To enable network-aware training for distributed policy graphs, CALF decomposes RL workloads into networked services, injects realistic network behaviours at specific communication links, and runs the same configuration across deployment modes from pure simulation to real hardware with real networks. This section details these capabilities and connects design choices to the experimental requirements of network-aware training for policy graphs.

8.3.1

Design Goals and Requirements

CALF is designed around four primary goals, each motivated by network-aware RL research needs:

G1: Network Realism. RL training loops must incorporate realistic latency, jitter, packet loss, and bandwidth constraints. CALF supports both synthetic models (parametric distributions such as $\mathcal{N}(\mu, \sigma^2)$ for latency) and trace-based replay (recorded from real deployments). Network conditions must be configurable, loggable, and reproducible for scientific experiments.

G2: Deployment Parity. The same policy code should run in pure simulation (baseline, no network), simulation with simulated network (network-aware training), and real edge hardware with real networks (final deployment). Platform-specific code should be minimised—agents should not need to know whether they are in simulation or on real hardware.

G3: Reproducibility. Network conditions must be loggable during real deployments and re-playable in simulation for debugging and ablation. Experiments must be reproducible across platforms via containerisation and module versioning.

G4: Device Heterogeneity. CALF supports cheap edge devices (Raspberry Pi 4, Jetson Nano) as environment or policy hosts, enables policy splitting across devices (e.g., hierarchical agents with components on edge and cloud), and handles heterogeneous compute (CPU-only on Pi, GPU on desktop).

An additional principle is algorithm agnosticism: CALF is infrastructure, not an RL algorithm. It works with any RL library (Stable-Baselines3, RLlib, custom implementations) without modification. Table 8.1 summarises how each goal connects to the research questions.

llp5cm Goal	Capability	Enables
Network Realism	Synthetic + trace-based network models	Controlled ablations (RQ2), realistic training
Deployment Parity	Same code across simulation/hardware	Fair comparison of network effects (RQ1)
Reproducibility	Deterministic seeds, versioning	Scientific rigour, exact replication
Heterogeneity	Edge devices to cloud servers	Realistic distributed settings (RQ3)

Table 8.1CALF design goals and their role in answering research questions.

8.3.2

Architecture Overview

Policy Graphs as Networked Services

CALF implements Chapter 5's policy graph framework by treating policy units and environments as networked services communicating via a standardised protocol. This provides spatial distribution (policy units execute on different machines/containers, enabling edge-cloud deployment), transparent network injection (NetworkShim services insert delays on graph edges without modifying policy implementations), temporal distribution (policy units can be dynamically loaded without restarting the system), and reproducibility (containerised services with versioned modules).

In the policy graph framework, policy units are abstract computational entities that receive observations and produce actions. CALF realises this abstraction through Agent Services: each Agent Service is a running instance of a policy unit that can be deployed on any hardware platform. Multiple Agent Services communicate to form the nodes of a policy graph, with communication channels forming the directed edges. High-level policy units (managers) send goals or subgoals to low-level policy units (workers), implementing the hierarchical structure described in Chapter 5.

In contrast to traditional RL, where obs = env.step(action) is a function call in the same process with zero latency, CALF implements environment and policy units as separate services where step() becomes message passing over potentially slow, lossy networks. This is not gratuitous distribution—it is necessary both to study how policies behave when deployed across real networks and to enable the physical distribution of policy graph nodes across heterogeneous hardware.

Three-Layer Hierarchy

CALF's architecture comprises three layers. Layer 1 (NEXUS) is an optional global hub enabling communication across hosts on different networks (NAT traversal). The relay layer—referred to throughout as NEXUS—is implemented as a custom TCP relay running at nexus.standardrl.com:57012. Each CALF node authenticates with NEXUS using RSA challenge-response: the relay issues an 8-byte challenge, the node signs it with its private key, and the relay verifies the signature before admitting the node to a session group. This authentication model allows heterogeneous nodes—personal laptops, cloud servers, GPU clusters, and embedded hardware—to join a shared CALF session across the public internet without requiring public IP addresses, VPN configuration, or pre-arranged network topology. For our experiments, NEXUS allows a Raspberry Pi on home Wi-Fi to communicate with a desktop in the university lab without VPN or port forwarding. The NEXUS relay is deployed as a production service and constitutes the networking substrate for the RLPlayground hosted deployment described in Chapter 9. Layer 2 (HOST) manages the lifecycle of Services on a single machine: module installation, Service creation (launch in Python venv or Docker container), local routing (forward packets between Services via Unix sockets), and a web UI for monitoring and interactive policy-graph configuration (Figure fig:calf_ui). Layer 3 (SERVICES) execute RL logic: Environment Services run Gym environments and send observations; Agent Services run policies and send actions; NetworkShim Services inject network impairments; utility Services log metrics.

Pre-built policy/environment units available as modules.

Interactive wiring of units into a policy graph. — Pre-built policy/environment units available as modules.

A typical communication flow for CartPole on Pi, policy on Desktop, NetworkShim on Desktop: Environment (102) sends observation → NetworkShim (900) delays by sampled latency → Agent (201) computes action → NetworkShim delays action → Environment applies action. The three-layer separation enables CALF's progressive deployment modes (Section 8.3.5): the same code runs in local simulation (Layer 3 only), simulation with network (Layer 3 with shims), and real hardware (all three layers).

Mapping CALF Services to Policy Graph Concepts

To clarify the relationship between CALF's implementation and Chapter 5's policy graph framework, Table 8.2 provides an explicit mapping:

ll Policy Graph Concept	CALF Implementation
Policy unit (node)	Agent Service instance
Policy graph (structure)	Set of Agent Services + routing configuration
Edge (communication channel)	Network connection between services
Manager (high-level policy)	Agent Service sending goals/subgoals
Worker (low-level policy)	Agent Service receiving goals, executing skills
Distributed execution	Services on different hardware (Pi, Desktop, Cloud)
Network delay on edge	NetworkShim Service on communication channel
Environment	Environment Service

Table 8.2Mapping between policy graph concepts (Chapter 5) and CALF implementation.

In Chapter 5's terminology, each Agent Service is a policy unit. When multiple Agent Services are deployed with a routing configuration specifying their connections, they form a policy graph. NetworkShim Services sit on the edges of this graph, enabling controlled study of network effects on distributed policy execution.

Complete architectural specifications, including port allocations, routing protocols, and process management, are provided in the technical specification appendix (Appendix B, Sections 2--3).

8.3.3

Communication Protocol

CALF uses a low-latency, type-safe binary protocol with a 5-byte header and seven packet types. Type 2 Data Packets carry timestamps that enable precise end-to-end latency measurement ( $\text{latency}_{\text{ms}} = t_{\text{receive}} - t_{\text{send}}$ ), which NetworkShim uses to schedule delayed delivery. Complete protocol specifications—byte layouts, serialisation algorithms, and API details—are provided in Appendix B, Sections 3--5.

8.3.4

NetworkShim: The Core Mechanism

NetworkShim is CALF's primary mechanism for injecting network impairments into the RL loop. It acts as a transparent middlebox (“bump in the wire'') sitting between Environment and Agent. The routing configuration specifies that observations and actions pass through NetworkShim, which delays or drops packets according to configured network models.

When NetworkShim receives a packet, it first simulates packet loss (drop with probability $p_{\text{loss}}$ ). If not dropped, it samples a delay from the configured distribution: for jittery networks, delay $\sim \max(0, \mathcal{N}(\mu_{\text{latency}}, \sigma_{\text{jitter}}^2))$ ; for constant latency, delay is fixed. NetworkShim then schedules forwarding by placing the packet in a priority queue sorted by delivery time. A background thread continuously checks the queue and forwards packets when their delays expire.

Network Models

Synthetic Models define parametric distributions matching our evaluation conditions: Ethernet-clean (2 ms $\pm$ 0.5 ms, 0% loss), Wi-Fi-normal (30 ms $\pm$ 10 ms, 2% loss), and Wi-Fi-degraded (80 ms $\pm$ 40 ms, 10% loss). Latency is sampled from normal distributions (clipped at 0), loss from Bernoulli( $$p$$ ).

Trace-Based Models enable replay of recorded conditions. A LatencyTracer Service calculates actual latency from packet timestamps ( $\text{latency}_{\text{ms}} = t_{\text{receive}} - t_{\text{send}}$ ) during Real-Wi-Fi evaluation and logs traces. NetworkShim can then replay these traces during training, sampling delays from the empirical distribution. This allows policies trained on synthetic Wi-Fi-normal to be refined using real Wi-Fi traces, or enables controlled experiments comparing “Real-Wi-Fi-Home'' versus “Real-Wi-Fi-Campus'' conditions.

Critically, Environment and Agent are unaware of NetworkShim's existence—they simply experience delayed messages. This transparency enables network-aware training without modifying RL algorithms.

Complete NetworkShim implementation details, including delay queue algorithms, statistics collection, and trace replay mechanisms, are provided in Appendix B, Section 6.

8.3.5

Progressive Deployment Modes

A key CALF feature is that the same policy and environment code run across a continuum of deployment scenarios (Figure 8.3):

Mode 1: Local Sim (Baseline). Environment and policy in the same process with direct function calls, no network. Used for fast prototyping and baseline comparison (RQ1). Achieves approximately 100K steps/hour (CartPole on Desktop).

Mode 2: Sim + Simulated Network. Environment and policy are separate Services with CALF NetworkShim between them and a synthetic network model (e.g., Wi-Fi-normal: 30 ms $\pm$ 10 ms, 2% loss). Used for network-aware training (RQ2). Achieves approximately 50K steps/hour (slower due to delays).

Mode 3: Edge Sim (Real Hardware, Simulated Environment). Environment Service on Raspberry Pi or Jetson, policy Service on Desktop, communicating over real network (Ethernet or Wi-Fi). Used for hardware validation and measuring real network distributions. Achieves approximately 20K steps/hour (network and Pi CPU limit throughput).

Progressive modes de-risk deployment: develop policy in Mode 1 (fast iteration), train with network-awareness in Mode 2 (expose delays), validate on real hardware in Mode 3 (catch hardware-specific issues).

Figure 8.3CALF's three progressive deployment modes enable incremental validation from pure simulation to distributed deployment. Mode 1 (Local Sim) provides a zero-latency baseline for rapid development with environment and policy co-located. Mode 2 (Sim + Simulated Network) introduces NetworkShim services that inject realistic latency, jitter, and packet loss for network-aware training. Mode 3 (Edge Sim) validates distributed deployment on real hardware (Raspberry Pi/Jetson for environment, Desktop for policy) communicating over real Wi-Fi/Ethernet networks. This progressive approach ensures that network-aware policies trained in Mode 2 transfer successfully to distributed edge deployment in Mode 3, addressing the network axis of the sim-to-real gap.

8.3.6

Containerisation and Modules

CALF supports both Python virtual environments (lightweight, fast startup, easy debugging) and Docker containers (complete isolation, system dependencies, reproducibility). Each CALF module is a packaged RL component with Python code, dependencies (requirements.txt), and metadata (info.json: name, version, build ID, container requirements). Modules can be installed from a repository or locally.

Reproducibility features include build ID (timestamp ensuring exact version matching), Docker image hash (bit-for-bit reproducibility), version control (repository tracks all versions), and deterministic network seeds (NetworkShim uses fixed RNG seeds for reproducible delays). These mechanisms enable future thesis chapters to reuse CALF modules, support reproducible experiments (exact module versions can be downloaded and re-run), and enable heterogeneous execution (same module runs on Pi and Desktop via Docker).

Complete module system specifications, including installation workflows, execution mode selection, and distribution mechanisms, are provided in Appendix B, Section 7.

CALF is uniquely suited for distributed policy graph research because it treats network conditions as first-class objects (configurable, loggable, and replayable rather than hidden implementation details), ensures deployment parity (the same policy graph runs from pure simulation to real edge hardware), is algorithm-agnostic (works with any RL training approach), and provides reproducibility (module versioning, containerisation, network seeds). With CALF's capabilities established, the following section describes the network-aware training methodology employed for distributed policy graph deployment.

8.4

Network-Aware Training Methodology

This section describes our RL training protocol and experimental methodology for answering RQ1 (how severely do network conditions degrade performance of distributed policy graphs?) and RQ2 (does network-aware training enable successful policy graph deployment?).

8.4.1

Problem Formulation: Delayed MDPs

In a standard Markov Decision Process $(S, A, T, R, \gamma)$ , an agent observes state $$s_t$$ , takes action $$a_t$$ , receives reward $$r_t$$ and next state $s_{t+1}$ , and a policy $\pi(a | s)$ maximises expected return. In a delayed MDP (informal), the agent selects $$a_t$$ based on a delayed observation $o_{t-d_{\text{obs}}}$ where $d_{\text{obs}}$ is the observation delay, and action $$a_t$$ takes effect with delay $d_{\text{act}}$ such that the environment applies $a_{t-d_{\text{act}}}$ . Delays may be constant, random, or variable (jitter). With packet loss, some observations or actions never arrive.

With stochastic delays, the true state $$s_t$$ is unobserved; the agent must infer from observation history $h = \{o_{t-k}, o_{t-k+1}, \ldots, o_{t}\}$ , making the problem a Partially Observable MDP.

CALF treats the delayed environment as an MDP with augmented state: $(s, h_{\text{obs}}, h_{\text{act}})$ , where the policy learns $\pi(a | h_{\text{obs}}, h_{\text{act}})$ . Implementation options include frame stacking (feed policy last $$k$$ observations), recurrent policy (LSTM, where hidden state implicitly maintains belief), and action history (append recent actions to input, representing actions “in flight''). See Section 8.2 for delay MDP theory. For distributed policy graphs, each policy unit must handle delays on its incoming edges independently. Our experiments use practical deep RL with frame stacking and LSTM, not optimal state augmentation.

8.4.2

Training Regimes: Comparing Network-Awareness

Our experimental design trains policies under three regimes and evaluates all policies on all deployment modes, enabling systematic comparison of network-agnostic versus network-aware training for distributed policy graph deployment.

Baseline: No Network Awareness

Setup: Mode 1 (local sim) with environment and policy in the same process. No artificial delays, jitter, or loss. Standard Gym loop: synchronous, zero-latency. This represents training that ignores network conditions, corresponding to traditional RL where policy units are assumed co-located.

Delay-Only Training

Setup: Mode 2 with separate Services and NetworkShim. Fixed latency (e.g., 50 ms), no jitter, no loss. This represents awareness of constant delays but not stochastic network effects.

Full Network-Aware Training

Setup: Mode 2 with separate Services and NetworkShim. Realistic distribution: latency + jitter + loss (fitted to Wi-Fi-normal: mean 30 ms, jitter 10 ms, loss 2%). This represents full awareness of network conditions expected during distributed deployment.

Distribution fitting: Real network statistics are measured during pilot runs using LatencyTracer; a normal distribution is fitted to latency $\mathcal{N}(\mu, \sigma^2)$ , and packet loss rate is estimated from dropped packets.

8.4.3

RL Algorithm: PPO

Proximal Policy Optimization is used via Stable-Baselines3 with standard hyperparameters: learning rate $3 \times 10^{-4}$ , discount $\gamma = 0.99$ , GAE $\lambda = 0.95$ , batch size 64 (CartPole) or 256 (MiniGrid). PPO is chosen for its stability (clipped objective), generality across discrete and continuous action spaces, and compatibility with recurrent architectures needed for partial observability under delays.

8.4.4

State Representation for Delay Robustness

Policy units must infer current state from delayed observations. Three strategies are employed:

Strategy 1: Frame Stacking (CartPole). Stack last $$k$$ observations: $[o_{t-k}, o_{t-k+1}, \ldots, o_t]$ . For CartPole with delay $$d$$ , $$k = d+1$$ frames are used. Intuition: Multiple snapshots allow velocity inference.

Strategy 2: Recurrent Policy (MiniGrid). LSTM policy: $a_t = \pi(o_t | h_{t-1})$ , where $$h_t$$ is hidden state. Advantages: Automatically maintains belief state over history, handles variable delays. Disadvantages: Slower training (recurrence breaks parallelisation).

Strategy 3: Action History (Ablation). Append last $$k$$ actions to observation. Intuition: Know which actions are “in flight''. Finding (preliminary): Modest improvement (approximately 5%) over observation-only.

Our experiments use frame stacking for CartPole (simpler, sufficient) and LSTM for MiniGrid (necessary for partial observability combined with delays). For hierarchical policy graphs, low-level policy units may use frame stacking whilst high-level units use recurrent architectures to track long-horizon goals.

8.4.5

Evaluation Protocol

Each trained policy (each seed, each training regime) is evaluated on five deployment modes:

Sim-Clean (Mode 1): Local sim, no network
Sim+Network (Mode 2): Desktop only, NetworkShim with Wi-Fi-normal model
Real-Ethernet (Mode 3): Environment on Pi, policy on Desktop, Ethernet connection
Real-Wi-Fi-Normal (Mode 3): Environment on Pi, policy on Desktop, Wi-Fi
Real-Wi-Fi-Degraded (Mode 3): Environment on Pi, policy on Desktop, Wi-Fi + tc netem impairments

Per mode, 50 episodes are run and episodic return (CartPole: survival time), success rate (CartPole: return $\geq$ 475; MiniGrid: goal reached), and end-to-end latency are recorded. Statistical rigour is ensured via 10 random seeds per training regime, with paired $$t$$ -tests comparing full network-aware versus baseline at $\alpha = 0.05$ .

8.5

Experimental Setup

This section specifies environments, agents, hardware platforms, and evaluation metrics for complete reproducibility (G3).

8.5.1

Environments

Environments are selected for diverse timing sensitivity, community familiarity as benchmarks, and tractability on modest hardware.

CartPole-v1

Classic inverted pendulum: balance a pole on a movable cart. State is 4-dimensional (cart position/velocity, pole angle/angular velocity), action is discrete $\{$ left, right $\}$ , termination when $$|x| > 2.4$$ or $|\theta| > 12\degree$ or 500 steps. Reward is +1 per step (maximum 500). CartPole is highly timing-sensitive—unstable dynamics require fast reactions, and 100 ms delays can halve survival time—making it a stringent test of network-aware training.

MiniGrid DoorKey-8x8

Gridworld navigation: find a key, unlock a door, reach the goal. Observation is a $7 \times 7$ egocentric view (partial observability), action is discrete (move/turn/pick up/toggle), success reward +1 with $$-0.01$$ per step. MiniGrid's subgoal structure (key $\to$ door $\to$ goal) provides a natural two-level hierarchy and tests a less timing-critical regime where delays cause overshooting rather than catastrophic instability.

8.5.2

Agent Architectures

Flat Policies (Primary Experiments)

CartPole: Multi-layer perceptron with 64 units, 64 units (ReLU), action logits (2-dimensional). Input is 4-dimensional observation (or $4 \times k$ if stacked).

MiniGrid: Convolutional neural network: $7 \times 7 \times 3$ input, Conv(16 filters, $3 \times 3$ ), Conv(32 filters, $3 \times 3$ ), Flatten, LSTM(128 units), Fully Connected(128 units), action logits (5-dimensional).

Training: PPO with 10 random seeds per regime.

Policy Graphs (Distributed Deployment Illustration)

Two-level hierarchical policy graphs are used to illustrate CALF's distributed deployment capabilities. Policy units are trained separately in Mode 1 (local sim) and then deployed across Pi and Desktop in Mode 3 with NetworkShim on inter-unit communication channels. Full topology specifications are described alongside results in Section subsec:distributed_results.

8.5.3

Hardware and Network Conditions

Hardware

Desktop (Policy Host):

CPU: Intel i7-10700K (8 cores, 3.8 GHz)
RAM: 32 GB
GPU: NVIDIA RTX 3070 (optional, PPO runs on CPU)
OS: Ubuntu 22.04, Python 3.8

Raspberry Pi 4 Model B (Environment Host):

CPU: Quad-core ARM Cortex-A72 (1.5 GHz)
RAM: 4 GB
OS: Raspberry Pi OS, Python 3.9

Network Configurations

Ethernet-Clean: Physical Ethernet cable between Desktop and Pi. Observed latency: mean 2 ms, jitter 0.5 ms, loss 0.0%. Bandwidth: 1 Gbps (link capacity).

Wi-Fi-Normal: Desktop and Pi on same Wi-Fi network (802.11ac, 5 GHz). Observed latency: mean 30 ms, jitter 10 ms, loss 2%. Bandwidth: approximately 50 Mbps (measured throughput).

Wi-Fi-Degraded: Wi-Fi-Normal + tc netem impairments on Desktop interface to simulate congested network. Configuration: tc qdisc add dev wlan0 root netem delay 50ms 30ms loss 5%. Observed latency: mean 80 ms, jitter 40 ms, loss 10%.

All network statistics (latency, jitter, loss) are measured using LatencyTracer during pilot runs, verified across 1000 packet samples, and logged for reproducibility.

8.5.4

Evaluation Metrics

Primary metrics are episodic return (CartPole survival time, max 500; MiniGrid goal reward minus step penalties), success rate (CartPole: return $\geq 475$ ; MiniGrid: goal reached), and sim-to-real gap ( $\text{Gap} = \frac{\text{Perf}_{\text{Sim-Clean}} - \text{Perf}_{\text{Real-Wi-Fi}}}{\text{Perf}_{\text{Sim-Clean}}} \times 100\%$ ). Network metrics are end-to-end latency (from packet timestamps, mean/median/p95), throughput (episodes per hour), and packet loss rate. Results are reported as mean $\pm$ standard deviation across 10 seeds; significance assessed by paired $$t$$ -test ( $\alpha = 0.05$ ) with Cohen's $$d$$ effect size.

8.6

Results

This section presents empirical findings demonstrating that (1) network-aware training substantially improves real deployment performance for distributed policy graphs (RQ2), (2) different network pathologies have distinct impacts on performance with stochastic jitter and packet loss proving more detrimental than constant latency (RQ2 refined), (3) small policy graphs can be deployed across heterogeneous devices whilst maintaining competitive performance on simple tasks (distributed deployment illustration), and (4) systems measurements support CALF's practical viability for edge-cloud deployments (RQ3).

All experiments were conducted following the methodology specified in Section 8.4, with 10 random seeds per training regime to ensure statistical rigour. Results are presented as mean $\pm$ standard deviation across seeds unless otherwise stated. Statistical significance is assessed using paired $$t$$ -tests ( $\alpha = 0.05$ ).

8.6.1

Network-Aware Training Improves Real Deployment Performance

CartPole Results

Table 8.3 presents mean episode return across 10 seeds per training regime and deployment mode.

[email protected]@[email protected]@[email protected] Training Regime	Sim-Clean	Sim+Net	Real-Eth	Wi-Fi-N	Wi-Fi-D
Baseline	495 $\pm$ 7	310 $\pm$ 48	288 $\pm$ 62	173 $\pm$ 71	92 $\pm$ 54
Delay-Only	482 $\pm$ 11	468 $\pm$ 16	425 $\pm$ 32	348 $\pm$ 49	218 $\pm$ 58
Full Net-Aware	476 $\pm$ 9	472 $\pm$ 13	458 $\pm$ 22	442 $\pm$ 27	378 $\pm$ 41

Table 8.3CartPole: Mean Episode Return (

\pm

std) over 10 seeds. Each cell reports mean performance across seeds, with each seed evaluated over 50 episodes per deployment mode.

The baseline collapses to 92 $\pm$ 54 under Wi-Fi-Degraded—an 81.4% performance drop—because policies predicated on instantaneous feedback fail when observations arrive 80 ms late. Full network-aware training achieves 378 $\pm$ 41 in Wi-Fi-D, a 3.95 $\times$ reduction in the sim-to-real gap ( $$t(9) = 12.7$$ , $$p < 0.001$$ , Cohen's $$d = 2.31$$ ). Real-Ethernet performance (458 $\pm$ 22) closely matches Sim+Network (472 $\pm$ 13), confirming that Mode 2 synthetic models accurately represent real Mode 3 conditions. Figure 8.4 visualises the degradation trajectories.

Figure 8.4CartPole: Performance comparison across deployment modes for each training regime. Full network-aware training maintains robust performance under real network conditions, whilst baseline training exhibits severe degradation. Delay-only training provides partial robustness, validating the necessity of modelling jitter and packet loss in addition to latency.

MiniGrid Results

Table 8.4 presents success rate (percentage of episodes reaching the goal) for MiniGrid DoorKey-8x8. Unlike CartPole's continuous survival metric, MiniGrid provides a binary success signal, making results directly interpretable as task completion reliability.

[email protected]@[email protected]@[email protected] Training Regime	Sim-Clean	Sim+Net	Real-Eth	Wi-Fi-N	Wi-Fi-D
Baseline	94 $\pm$ 4	76 $\pm$ 9	73 $\pm$ 11	61 $\pm$ 13	44 $\pm$ 16
Delay-Only	91 $\pm$ 5	87 $\pm$ 6	84 $\pm$ 7	77 $\pm$ 9	64 $\pm$ 11
Full Net-Aware	89 $\pm$ 4	87 $\pm$ 5	85 $\pm$ 6	81 $\pm$ 7	74 $\pm$ 9

Table 8.4MiniGrid: Success Rate (%

\pm

std) over 10 seeds. Each cell reports percentage of episodes successfully reaching the goal, with each seed evaluated over 50 episodes per deployment mode.

Baseline training achieves 94% success in Sim-Clean but drops to 44% in Wi-Fi-D—a 53.2% degradation. Full network-aware training achieves 74% in Wi-Fi-D (17.0% drop from Sim-Clean), a 3.13 $\times$ reduction in the deployment gap ( $$t(9) = 8.4$$ , $$p < 0.001$$ , Cohen's $$d = 1.87$$ ). The smaller absolute effect than CartPole is consistent with MiniGrid's reduced timing sensitivity: delayed actions cause overshooting rather than catastrophic instability. Delay-only training provides partial robustness (64% in Wi-Fi-D), confirming that stochastic network phenomena require explicit modelling.

8.6.2

Impact of Different Network Pathologies

An ablation study trains CartPole policies under four conditions—latency-only (constant 50 ms), stochastic additional delay ( $\Delta t \sim \max(0, \mathcal{N}(0, 40^2))$ ms), loss-only (10% dropout, zero delay), and combined (full network model)—and evaluates all on Real-Wi-Fi-Degraded. Table 8.5 presents the results.

lc Training Regime	Real-Wi-Fi-Degraded
Baseline (none)	92 $\pm$ 54
Latency-Only (50 ms)	275 $\pm$ 52
Stochastic Add.\ Delay ( $\sigma$ =40 ms)	315 $\pm$ 47
Loss-Only (10%)	308 $\pm$ 49
Combined (full model)	378 $\pm$ 41

Table 8.5CartPole Ablation: Mean Episode Return in Real-Wi-Fi-Degraded. Ten seeds per training condition, each evaluated over 50 episodes.

The ablation reveals a clear hierarchy of network pathology severity. Stochastic additional delay training (315 $\pm$ 47) outperforms latency-only (275 $\pm$ 52), despite both conditions having similar average delay magnitudes. This counterintuitive result has important implications: constant delays allow policies to learn fixed-horizon predictive models (“the state I observe now reflects what happened 50 ms ago; I should plan 50 ms ahead''), whereas stochastic additional delay forces policies to maintain uncertainty estimates over observation freshness. Training under stochastic delay therefore induces more conservative, robust control strategies that hedge against worst-case timing.

Packet loss (308 $\pm$ 49) proves similarly detrimental to jitter. When 10% of observations are dropped, policies must infer missing state information or defer actions until fresh observations arrive. Policies trained without loss awareness assume all observations are fresh and trustworthy; when deployed under loss, they act on stale or interpolated observations, leading to control failures. Loss-trained policies learn to detect observation staleness (e.g., via action-observation consistency checks) and adopt conservative strategies when observations are missing.

The combined training regime (378 $\pm$ 41) significantly outperforms any single-factor training (pairwise $$t$$ -tests: all $$p < 0.01$$ ), but the combined benefit is subadditive: training for all three pathologies simultaneously yields 378, substantially better than the baseline (92) but considerably less than the arithmetic sum of the individual improvements above baseline ( $$183 + 223 + 216 = 622$$ ). This subadditivity suggests that the three pathologies share common adaptive mechanisms—likely the state-augmentation and uncertainty-maintenance pathways—such that learning to handle one confers partial benefit for the others. Latency, jitter, and packet loss compound: jittery latency with occasional packet loss creates scenarios where the policy must handle simultaneous timing uncertainty and information gaps. Training under the full joint distribution enables policies to develop integrated coping strategies (e.g., maintaining belief states over delayed, noisy, and incomplete observations) that single-factor training cannot discover.

8.6.3

Distributed Policy Graph Deployment

Two-level hierarchical architectures for CartPole and MiniGrid illustrate CALF's distributed deployment capabilities. These experiments show that policy graphs trained in simulation can transfer to edge-cloud hardware, and that simple decompositions with time-critical units on edge devices achieve competitive performance whilst exercising the commitment mechanisms of Chapter 5.

CartPole Hierarchical Policy Graph subsec:distributed_results

A two-level CartPole graph decomposes control into an Angle Stabiliser (Unit A, reward $r_A = -|\theta| - 0.1|x|$ , deployed on Pi) and a Recentring unit (Unit B, reward $r_B = -|x| - 0.05|\theta| - 0.1|\Delta a|$ , deployed on Desktop), with a rule-based manager delegating to Unit A when $|\theta| > 5\degree$ . Table 8.6 shows the distributed deployment achieves 465 $\pm$ 24—intermediate between flat-on-Pi (472 $\pm$ 21) and flat-on-Desktop (448 $\pm$ 28)—whilst achieving 22 ms median latency by keeping time-critical control local. The modest gap relative to flat-on-Pi reflects inter-unit handoff costs, exactly the overhead that the commitment mechanisms of Chapter 5 are designed to amortise.

lcc Deployment Configuration	Episode Return	E2E Latency (p50/p95)
Flat (Desktop)	448 $\pm$ 28	38 ms / 62 ms
Flat (Pi)	472 $\pm$ 21	6 ms / 11 ms
Hierarchical (Distributed)	465 $\pm$ 24	22 ms / 45 ms

Table 8.6CartPole Policy Graph: Performance comparison for distributed deployment. All configurations evaluated over Real-Wi-Fi-Normal network.

MiniGrid Hierarchical Policy Graph

MiniGrid's natural subgoal structure (find key $\rightarrow$ unlock door $\rightarrow$ reach goal) defines two specialist units: Unit K (key policy, deployed on Pi) and Unit G (goal policy, deployed on Desktop), with a rule-based manager switching on has_key. The hierarchical deployment achieves 79% success—close to flat-on-Pi (82%) and above flat-on-Desktop (77%)—with the 3-point gap relative to flat-on-Pi not statistically significant ( $$p = 0.18$$ ). Deploying the time-sensitive key-collection unit locally avoids network round-trips during interactive item manipulation, whilst the goal-navigation unit on Desktop tolerates moderate latency.

lc Deployment Configuration	Success Rate (%)
Flat (Desktop)	77 $\pm$ 9
Flat (Pi)	82 $\pm$ 7
Hierarchical (Distributed)	79 $\pm$ 8

Table 8.7MiniGrid Policy Graph: Success rate comparison. All configurations evaluated over Real-Wi-Fi-Normal network.

These results illustrate the distributed policy graph execution model from Chapter 5 and provide initial evidence that CALF can deploy hierarchical policies across edge-cloud infrastructure.

8.6.4

Systems Measurements and Infrastructure Validation

End-to-end latency, throughput, and resource utilisation are measured during distributed policy graph execution to assess CALF's practical feasibility (RQ3). Results indicate that CALF's architecture supports responsive control on resource-constrained edge devices whilst maintaining efficient utilisation of heterogeneous hardware.

End-to-End Latency

Table 8.8 presents latency measurements across network configurations. Latency is measured from environment observation emission to policy action receipt, capturing the full round-trip communication delay.

lcc Configuration	Latency p50 (ms)	Latency p95 (ms)
Local (Pi only)	5.2	9.8
Ethernet (Pi $\leftrightarrow$ Desktop)	8.7	14.3
Wi-Fi-Normal	34.5	68.2
Wi-Fi-Degraded	82.1	152.7

Table 8.8End-to-End Latency: Median and 95th percentile latency measured over 1000 environment steps during CartPole policy graph deployment. “Local'' indicates co-located environment and policy; “Remote'' indicates networked communication.

Local execution on Raspberry Pi achieves sub-10 ms latency at p95, validating that edge devices can support responsive control loops. Ethernet deployment adds minimal overhead (8.7 ms median versus 5.2 ms local), reflecting the low latency and near-zero packet loss of wired connections. Wi-Fi-Normal introduces substantial variability (34.5 ms median, 68.2 ms p95), with p95 latency exceeding median by $2\times$ due to jitter and occasional retransmissions. Wi-Fi-Degraded exhibits severe tail latency (152.7 ms p95), demonstrating the worst-case conditions against which network-aware training must be robust.

These measurements validate the network models used in Mode 2 training. Our synthetic Wi-Fi-Normal model ( $\mathcal{N}(30, 10^2)$ ms latency, 2% loss) closely matches measured Wi-Fi-Normal (34.5 ms median, implying fitted mean $\approx 34$ ms). This alignment confirms that policies trained in Mode 2 experience representative network conditions, enabling successful transfer to Mode 3 deployment.

Throughput and Resource Utilisation

Table 8.9 reports CPU and memory usage during distributed policy graph execution, demonstrating that CALF's architecture enables balanced workload distribution across heterogeneous hardware.

lccc Device	CPU (%)	Memory (MB)	Throughput (episodes/hour)
Pi	52	310	—
Desktop	18	420	—
System	—	—	1840

Table 8.9Resource Utilisation: Mean CPU and memory usage measured over 10-minute deployment window during CartPole hierarchical policy graph execution. Pi hosts environment and Unit A; Desktop hosts Manager and Unit B.

The Raspberry Pi operates at 52% average CPU utilisation, indicating headroom for additional workloads or more complex policy networks. Memory usage (310 MB) remains well within the Pi's 4 GB capacity, validating that CALF's binary protocol and efficient serialisation avoid memory bloat. Desktop CPU utilisation is low (18%), reflecting that Manager and Unit B execute lightweight policies; this headroom could be exploited by deploying multiple policy graphs or running compute-intensive strategic planning (e.g., tree search, model-based lookahead) on the cloud server whilst edge devices handle real-time control.

System throughput (1840 episodes/hour) demonstrates that CALF supports high-frequency experimentation. At this rate, evaluating a trained policy over 50 episodes (typical experimental protocol) requires $$<2$$ minutes, enabling rapid iteration during development. For comparison, frameworks that require environment-policy co-location (e.g., Gym running locally) achieve similar throughput but cannot exploit distributed deployment; frameworks that rely on heavyweight RPCs (e.g., gRPC without optimisation) often suffer $5{-}10\times$ throughput degradation due to serialisation overhead. CALF's custom binary protocol achieves deployment flexibility without sacrificing performance.

8.7

Discussion

8.7.1

Network as an Orthogonal Axis of Sim-to-Real Transfer

Network conditions constitute an independent dimension of domain randomisation, orthogonal to physics and visual randomisation. The analogy is direct: just as physics randomisation samples friction $\sim U(0.3, 0.7)$ to make policies robust to uncertain surfaces, network randomisation samples latency $\sim \mathcal{N}(30~\text{ms}, 10~\text{ms}^2)$ to make policies robust to uncertain networks. Both expose the agent to a distribution during training, yielding robustness at deployment. A policy trained with perfect timing may fail catastrophically on a real system with 100 ms lag even if physics are perfectly modelled; the two axes are conceptually and empirically distinct.

The ablation (Section 8.6.2) extends this analogy. Training under constant delay is analogous to sampling friction from a point mass rather than a distribution: the policy adapts to the mean but remains brittle to deviations. Training under stochastic delay forces policies to hedge across a distribution of timing perturbations. The implication is direct: even when the mean latency is known, the variance must be included in training. CALF provides infrastructure to make network-aware training systematic and reproducible, treating prior delay-aware fixes (e.g., Hwangbo et al. modelling actuator dynamics) as a domain-agnostic methodology rather than robot-specific engineering.

8.7.2

CALF as a Platform for Future Work

Within this thesis, CALF serves as deployment substrate for the distributed-policy work that follows: Chapter 7's efficient edge models address running policy units on resource-constrained hardware; Chapter 5 provides the policy-graph abstraction CALF executes; and Chapter 9's purpose-built USB hardware path relies on the same networking infrastructure. For the research community, natural extensions include trace-based training using recorded real-world network logs, multi-agent settings where inter-agent messages pass through NetworkShim, dynamic computation offloading based on current network state, and integration with delay-correcting algorithms such as DCAC .

8.7.3

Production Deployment

The CALF framework has been deployed as a publicly accessible service at RLPlayground (https://rlplayground.com). Users can launch personal CALF training environments in the browser, connected to the shared NEXUS relay infrastructure and to a GPU training hub. The experimental results reported in this chapter—the $4\times$ reduction in CartPole degradation under degraded Wi-Fi, the $3\times$ reduction in MiniGrid degradation, and the ablation showing stochastic impairments are more detrimental than constant latency—are reproduced on the RLPlayground homepage and form the basis of the service's headline claims . This deployment validates the infrastructure design described in this chapter: the NEXUS relay topology is operationally viable across heterogeneous real-world networks, and the GPU training pipeline provides the accelerated training path that the CALF methodology requires. The LAN-only scope of the experiments reported here—controlled laboratory Wi-Fi conditions rather than WAN or cellular networks—is addressed infrastructurally by NEXUS, which routes sessions across the public internet as a matter of course; empirical evaluation of CALF training under WAN and multi-hop conditions remains future work. Chapter 9 describes the production system in detail.

8.7.4

Limitations

1. Simulated environments. CartPole and MiniGrid are simulated, not physical robots. This allows isolation of network effects but limits ecological validity; future work should validate CALF on physical systems where sensor noise, actuation dynamics, and safety constraints are present.

2. Limited network scenarios. Experiments cover LAN-like conditions only (Ethernet, Wi-Fi within one building). WAN, cellular, and adversarial conditions—each with distinct latency asymmetries and jitter profiles—are not evaluated and may require different training strategies. The NEXUS relay infrastructure described above provides the routing layer for multi-hop WAN scenarios; empirical evaluation under those conditions remains future work.

3. Simple policy graphs. Distributed deployments use 2-unit decompositions with rule-based managers. Deeper hierarchies (3+ levels), learned option discovery, and end-to-end policy graph training under network constraints remain unexplored; Chapter 5 provides the formalism that such work would require.

4. Offline training and single-agent focus. Policies are trained in simulation then deployed without online adaptation, and CALF currently targets single-agent RL. Online adaptation under deployment-time network conditions and extension to multi-agent coordination under delays are natural next steps.

The core finding—network-aware training reduces the network reality gap by $$3$$ -- $4\times$ —is orthogonal to physics fidelity and plausibly generalises to physical robots, though empirical testing is necessary.

8.7.5

Future Directions

1. Richer environments and modalities. Extending CALF to continuous control (MuJoCo locomotion, manipulation) and vision-based tasks would test network-aware training where bandwidth becomes a first-order constraint alongside latency.

2. Advanced network models. Time-varying conditions (diurnal patterns, congestion), adversarial networks, and trace-based training using cellular or campus Wi-Fi logs would enable policies tuned to specific deployment environments.

3. End-to-end policy graph training. Investigating whether Option-Critic or HIRO-style hierarchies discover temporally extended options naturally robust to communication delays, and learning optimal unit-placement based on communication requirements, remain open problems.

4. Multi-agent and adaptive deployment. Extending CALF to MARL—where inter-agent messages traverse NetworkShim—and developing policies that dynamically offload computation between edge and cloud based on observed network state, represent two directions that would increase practical scope.

8.8

Conclusion

This chapter introduced CALF (Communication-Aware Learning Framework), infrastructure that extends the policy graph framework from Chapter 5 to network-aware distributed execution across heterogeneous hardware. Where Chapter 5 established policy graphs as directed graph structures enabling modular decomposition—with policy units coordinating through hard routing and commitment bounds—this chapter addressed the systems challenge of deploying those policy units across real networks where latency, jitter, and packet loss emerge as first-order constraints.

CALF realises policy graphs as networked services with NetworkShim middleware transparently injecting impairments on graph edges, enabling network-aware training that reduces deployment degradation by $4\times$ (CartPole) and approximately $3\times$ (MiniGrid). Stochastic network phenomena—jitter and packet loss—prove more detrimental than constant latency, challenging the fixed-delay focus of prior delay-aware RL. Illustrative distributed deployments across Raspberry Pi and desktop hardware demonstrate that hierarchical architectures with time-critical units executing locally maintain competitive performance under network constraints.

These findings establish network conditions as an orthogonal axis of sim-to-real transfer, complementing the physics and visual domain randomisation reviewed in Chapter 4. The architectural patterns of Chapter 3—A320 flight computers distributing responsibility across ELACs and SECs, power grids coordinating IEDs with SCADA—motivate CALF's design: just as engineered systems achieve reliability through hierarchical specialisation, distributed policy graphs partition computation across edge and cloud with accountability through commitment mechanisms. Chapter 7's efficient edge models and Chapter 9's purpose-built hardware path build on this infrastructure, and CALF's progressive deployment modes—pure simulation, simulation with network models, real hardware—stage the path from theory to deployment by treating communication constraints as tractable training objectives rather than deployment obstacles.

Browse the works behind the thesis.

Select a reference

Save your place and keep notes tied to this thesis.

Attach a note to this exact selection.

Your saved notes