Chapter 7

Models

Distributed policy graphs—grounded in the division-of-labour principles discussed in Chapter 2 and formalised in Chapter 5—require efficient edge models to make real-world deployment practical. When policy units are distributed across heterogeneous hardware, the computational cost and communication overhead of transmitting high-dimensional visual observations can dominate decision latency, particularly on resource-constrained edge devices. This chapter introduces MiniConv, a library of small convolutional encoders designed to compile cleanly to OpenGL fragment shaders for broad embedded GPU support. A split-policy architecture is realised in which a lightweight on-device encoder extracts compact visual features that are transmitted to a remote policy head. Across three visual control tasks trained with PPO, SAC, and DDPG, MiniConv encoders remain competitive with the chapter's Full-CNN baselines under pixel observations in the reported fixed-seed runs. The principal systems result is a 3.7 $\times$ reduction in end-to-end decision latency at 10 Mb\,s $^{-1}$ bandwidth (540 ms server-only versus 145 ms split-policy), enabling practical deployment on devices ranging from the Raspberry Pi Zero 2 W to the NVIDIA Jetson Nano. The infrastructure developed here directly supports the distributed policy graph deployment discussed in Chapter 8.

Previous chapter Next chapter

Chapter Abstract

7.1

Introduction

Policy graphs—formalised in Chapter 5—embody the division of labour identified in Chapter 2: specialist policy units coordinate through hard routing and commitment bounds, inheriting the architectural patterns that enable A320 flight computers and power grid controllers to achieve reliability through modularity. Chapter 5 motivates a deployment picture in which rapid, reactive components execute on low-power edge devices near actuators, whilst more deliberate reasoning can run on remote GPU clusters. This distributed execution exploits heterogeneous hardware—edge processors handle time-critical perception, cloud servers handle optimisation—whilst commitment bounds limit handoff frequency to control communication overhead.

However, the practical viability of this architecture depends critically on edge efficiency. When a policy unit processes high-dimensional visual observations on resource-constrained hardware—a Raspberry Pi Zero 2 W with 512 MB RAM and an embedded Broadcom GPU, for instance—two bottlenecks emerge: the computational cost of feature extraction and the communication cost of transmitting observations. A conventional deployment transmitting full RGB frames to a remote server incurs substantial decision latency in bandwidth-limited settings and concentrates compute load centrally. Conversely, an encoder small enough to run efficiently on-device reduces both communication overhead and server-side load, enabling the edge-to-cloud division of labour that policy graphs require.

This chapter introduces MiniConv, a library of compact convolutional encoders designed for this deployment context, and evaluates the resulting split-policy architecture across learning performance, on-device execution, decision latency, and server scalability. The findings establish that lightweight visual encoders can serve as components of distributed policy graphs—supporting the edge-to-cloud architectures explored in Chapter 8.

7.2

Related Work

Approaches to on-device neural network inference range from specialist hardware accelerators to architectures such as MobileNet that achieve favourable accuracy--efficiency trade-offs through depthwise separable convolutions, and post-hoc compression methods (pruning, quantisation, and knowledge distillation), surveyed in .

More directly related to split-policy execution, several systems partition deep neural network inference between end devices and the edge or cloud to optimise latency and resource usage under bandwidth constraints. Neurosurgeon selects partition points in DNNs to balance device computation against transmission cost, whilst Edge Intelligence explores on-demand co-inference with device--edge synergy. Teerapittayanon et al. consider distributed DNN execution across end devices, edge servers, and the cloud. MiniConv is complementary: it applies a similar division of labour to RL policies, emphasising wide hardware support through OpenGL shader execution and transmitting compact feature representations rather than raw observations. This work evaluates the resulting trade-offs in decision latency, scalability, and device resource pressure.

7.3

Implementation

The MiniConv library provides small, composable encoder blocks designed to compile cleanly to OpenGL fragment shaders, respecting practical constraints such as texture binding and sampling limits. MiniConv encoders are instantiated here with $$K$$ output channels (specifically $$K=4$$ and $$K=16$$ ) and trained end-to-end together with a downstream policy in PyTorch. At deployment, only the MiniConv encoder runs on-device (via OpenGL), producing a $$K$$ -channel feature tensor per frame; only this tensor is transmitted to the server-side policy head. MiniConv is a library rather than a single fixed architecture: $$K$$ and block compositions can be varied to meet device and bandwidth constraints.

The on-device encoder is deployed using OpenGL fragment shaders, which compute each output pixel as a function of one or more input textures and are widely supported across embedded GPUs. This execution model maps naturally to convolution and pooling: a shader samples a neighbourhood of an input texture and writes an output texture, as illustrated in Figure fig:glfrag:short. MiniConv exploits this mapping whilst respecting the practical limits of low-cost devices. For example, on the Raspberry Pi Zero 2 W, fragment shaders can sample from a maximum of eight bound textures, and each shader is subject to a finite sampling budget (64 texture samples in this deployment). Since each shader pass outputs four channels (RGBA), encoders with larger $$K$$ are implemented via multiple passes. These constraints inform the choice of kernel sizes, channel packing, and layer compositions used by MiniConv.

Mapping CNN layers to shader passes. — Fragment shader input/output.

7.4

Evaluation

Deploying split-policy RL on edge devices requires that the on-device encoder preserves policy performance whilst respecting strict compute, memory, and power constraints. The evaluation is organised around eight practical questions: description Does a split-policy architecture match the learning performance of a conventional Full-CNN baseline under visual observations? Does the compressed on-device representation retain sufficient task-relevant information to support high-return behaviour? How do per-frame inference latency and variability change under sustained on-device execution? What memory footprint does on-device inference impose, and how much RAM headroom remains for other tasks? What is the effect of sustained inference on device thermal state and throttling behaviour? At what link bandwidth does split inference reduce end-to-end decision latency relative to transmitting full observations? On low-power devices, how does OpenGL shader execution compare to a CPU implementation in throughput and stability? How do power limits and power consumption affect inference throughput and stability? description These questions are addressed through learning experiments on visual control tasks, on-device execution benchmarks, and end-to-end measurements of decision latency and server scalability under bandwidth constraints.

7.4.1

Learning

MiniConv encoders are evaluated on two MuJoCo locomotion tasks (Walker2d, Hopper) and the classic control Pendulum task under visual observations. Walker2d is trained with PPO , Hopper with SAC , and Pendulum with DDPG , selected based on preliminary stability under pixel observations and standard practice in Stable-Baselines3 for the respective tasks. Unless otherwise stated, Walker2d and Hopper are trained for 2,000 episodes and Pendulum for 1,000 episodes. Because algorithms differ across tasks, cross-task comparisons are not meaningful; the focus is on within-task comparisons between encoders. Results are reported for a single run per condition (fixed seed), and variance across seeds is not yet characterised.

Algorithms and baselines Table 7.1 summarises the learning algorithm used for each task.

Table 7.1Algorithms used for each visual control task.

For each task, the Full-CNN baseline corresponds to the default convolutional feature extractor used by Stable-Baselines3 for image observations (CnnPolicy). The MiniConv conditions replace only this observation encoder (with $K \in \{4,16\}$ output channels); the downstream policy, value networks, and all other training settings are unchanged across encoder variants. The split-policy architecture does not assume a particular RL algorithm; results should be interpreted as within-task evidence that encoder partitioning can be compatible with learning under multiple common RL algorithms.

All experiments use 84 $\times$ 84 RGB pixel observations stacked over three frames, processed through SB3's default image normalisation. Environments use Gymnasium : Walker2d-v4 and Hopper-v4 via MuJoCo , and Pendulum-v1 (Classic Control).

These experiments test whether replacing the standard image encoder with MiniConv preserves the ability to learn high-return behaviour under pixel observations. Within each task, MiniConv remains competitive with the Full-CNN baseline, but summary statistics exhibit task- and representation-size-dependent trade-offs between final and mean return. Each condition reports Best (maximum episodic return observed), Mean (average episodic return over training), and Final (mean episodic return over the final 100 episodes). These findings address Q1--Q2. Given that each condition is evaluated in a single fixed-seed run, the reported differences should be interpreted as indicative rather than statistically characterised.

Walker2d (PPO) MiniConv $$K=4$$ achieves slightly higher final return than Full-CNN (3360 vs 3296), whilst Full-CNN attains higher mean return over training (2800 vs 2680). $$K=16$$ reaches the highest single episode (3800) but exhibits lower sustained performance, suggesting less consistent behaviour under pixel observations (Table 7.2).

lrrrr Architecture	Best	Final	Mean	Episodes
MiniConv encoder (K=4)	3640	3360	2680	2000
MiniConv encoder (K=16)	3800	3184	2320	2000
Full-CNN	3600	3296	2800	2000

Table 7.2Walker2d (PPO): episodic return statistics over 2,000 episodes (single fixed-seed run).

Hopper (SAC) MiniConv $$K=4$$ yields the strongest final return on Hopper (2360 vs 2240 for Full-CNN), whilst Full-CNN attains higher mean return (1720 vs 1680). The gap between best and final return across all encoders indicates substantial variability in sustained performance under pixel observations in these single-seed runs (Table 7.3).

lrrrr Architecture	Best	Final	Mean	Episodes
MiniConv encoder (K=4)	2680	2360	1680	2000
MiniConv encoder (K=16)	2640	2200	1600	2000
Full-CNN	2656	2240	1720	2000

Table 7.3Hopper (SAC): episodic return statistics over 2,000 episodes (single fixed-seed run).

Pendulum (DDPG) Both MiniConv encoders outperform Full-CNN on Pendulum final return ( $$K=16$$ : $$-180$$ vs $$-248$$ for Full-CNN), consistent with this task's sensitivity to smooth, consistent control (Table 7.4). The improvement of $$K=16$$ over $$K=4$$ suggests that a richer transmitted representation benefits tasks where representation quality affects stability.

lrrrr Architecture	Best	Final	Mean	Episodes
MiniConv encoder (K=4)	-140	-192	-244	1000
MiniConv encoder (K=16)	-136	-180	-232	1000
Full-CNN	-142	-248	-288	1000

Table 7.4Pendulum (DDPG): episodic return statistics over 1,000 episodes (single fixed-seed run).

Taken together, these results suggest that MiniConv encoders can remain competitive with a conventional Full-CNN baseline under visual observations, but do not uniformly dominate across summary statistics. Encoder- $$4$$ achieves slightly higher final return on Walker2d and Hopper, whilst Full-CNN attains the higher mean return in both tasks; encoder- $$16$$ is less effective on the locomotion tasks but performs best on Pendulum. This pattern indicates that the appropriate representation size is task-dependent and should be selected alongside device compute and bandwidth constraints.

7.4.2

Execution Performance

Per-frame inference time is characterised as a function of input size and device class; drift under sustained load is evaluated; and CPU temperature, RAM utilisation, and power consumption are recorded. These experiments address Q3--Q5, Q7, and Q8. The computation--communication trade-off underpinning split inference is then analysed to address Q6.

In addition to task-scale inputs, a high-resolution stress test (up to 3000 $\times$ 3000) is included to expose throttling and power-limit behaviour under sustained load, particularly on the Jetson Nano.

Figure fig:exec:short summarises per-frame processing time across devices as the input size varies. As the input size increases, frame processing time increases on the Raspberry Pi platforms, whilst the Jetson Nano exhibits substantially lower times across the tested range. On the Pi Zero 2 W, maintaining a frame rate of five frames per second requires keeping the input size below roughly 500 pixels per side (that is, below $500\times500$ ).

Raspberry Pi 4B. — Raspberry Pi Zero 2 W.

Sustained inference time is measured over extended runs (Figure fig:sustained:short). The Jetson Nano exhibits a marked increase in per-frame time after an initial period, and power limits alter this behaviour. For the Pi Zero 2 W, GPU (OpenGL) inference is substantially faster and more stable than CPU (PyTorch) inference over the same horizon.

Pi Zero 2 W (moving average). — Jetson Nano (5W limit vs no limit).

To characterise the resource pressures associated with sustained inference, Figure 7.4 reports CPU temperature and RAM utilisation on the Pi Zero 2 W (CPU vs GPU execution), and power usage and memory pressure on the Jetson Nano (5W cap vs no limit). Across these experiments, RAM utilisation remains comparatively stable, whilst temperature and power reflect the expected constraints of sustained on-device execution.

Ultimately, the utility of split-policy execution depends on the balance between computation and communication. Figure 7.5 illustrates the decision-latency components that vary between a server-only pipeline and the split-policy pipeline.

Figure 7.5A breakdown of the steps involved in each decision that contribute to decision latency.

A simplified bandwidth model considers $$B$$ as link bandwidth in bits per second, $$X$$ the input width and height, $$n$$ the number of stride-two layers in the on-device encoder (so the transmitted feature map has spatial size $(X/2^n)\times(X/2^n)$ ), and $$j$$ the per-frame on-device processing time. Both raw observations and encoded features are transmitted as uncompressed uint8 buffers: a full RGBA frame requires $$4X^2$$ bytes, whilst a $$K$$ -channel feature map requires $$K(X/2^n)^2$$ bytes ( $$K=4$$ for the latency experiments). Image compression would shift the break-even point and is left to future work. The $3.7{\times}$ improvement at $10\,\mathrm{Mb\,s^{-1}}$ therefore represents an upper bound on the benefit of the split-policy architecture for deployments that apply image compression before transmission; practical deployments with JPEG or H.264 encoding would see reduced but non-zero latency benefits at the bandwidth levels studied. Server-side compute is excluded to isolate the communication break-even point; server-side compute reductions are evaluated separately in the scalability experiment. Under these assumptions, split-policy inference yields a lower decision latency than a server-only pipeline when:

B < \frac{32X^2\left(1 - \frac{K}{4\cdot 2^{2n}}\right)}{j}.

Equation 7.1

For the Pi Zero 2 W configuration in Figure fig:zerframe:short ( $$X=400$$ , $$n=3$$ , $j \approx 0.1s$ , $$K=4$$ ), this yields a break-even bandwidth of approximately $50.4\,\mathrm{Mb\,s^{-1}}$ .

7.4.3

End-to-End Decision Latency

To address Q6 empirically, end-to-end decision latency is measured as the median wall-clock time (over 1,000 decisions per setting) from the availability of an observation on the client device to the receipt of an action from the server. A conventional client--server pipeline transmitting the full RGBA observation is compared against the split-policy pipeline, where the on-device encoder produces a spatially smaller $$K=4$$ representation and only this representation is transmitted.

Table 7.5 summarises results under bandwidth shaping. At low bandwidth, the split-policy pipeline substantially reduces decision latency, as transmission dominates the decision loop. As bandwidth increases, the benefit diminishes and a crossover occurs, after which the additional on-device compute cost dominates.

lrr Bandwidth	Server-only latency (ms)	Split-policy latency (ms)
$10\,\mathrm{Mb\,s^{-1}}$	540	145
$25\,\mathrm{Mb\,s^{-1}}$	240	140
$50\,\mathrm{Mb\,s^{-1}}$	140	138
$100\,\mathrm{Mb\,s^{-1}}$	90	137

Table 7.5End-to-end decision latency under bandwidth shaping.

Consistent with the break-even analysis, the split-policy pipeline provides the largest reduction in decision latency at $$10$$ -- $25\,\mathrm{Mb\,s^{-1}}$ , is approximately neutral around $50\,\mathrm{Mb\,s^{-1}}$ , and becomes compute-bound on the client at higher bandwidth.

7.4.4

Server Scalability

A second practical motivation for the split-policy approach is to reduce the server-side compute cost per decision by moving the early visual feature extraction to the edge device. A simple multi-client setting is considered in which a single server processes requests from multiple concurrent clients, each operating at a fixed decision rate. Experiments are performed on a suitably powerful server with an Intel CPU and an NVIDIA GPU. Table 7.6 reports the maximum number of concurrent clients that can be supported at 10Hz whilst maintaining a p95 decision latency budget of 100ms.

lrr Constraint	Server-only	Split-policy
10Hz per client, p95 latency $$<100$$ ms	12 clients	36 clients

Table 7.6Server scalability at a fixed decision rate.

Under this simple setting, split-policy inference increases the number of concurrently served clients by approximately threefold under the same latency budget, reflecting the reduction in server-side compute per request. These figures reflect the specific testbed; real-world scaling will depend on batching, asynchronous I/O, and server hardware.

7.5

Discussion

7.5.1

MiniConv in the Context of Distributed Policy Graphs

The split-policy architecture evaluated in this chapter realises a simple two-unit policy graph: an on-device encoder unit and a remote policy-head unit, connected by a network edge. This configuration directly instantiates the division of labour advocated in Chapter 2: the encoder unit performs compute-intensive visual feature extraction on-device, whilst the policy-head unit performs high-level decision-making on a remote server with greater computational resources. The communication trade-off—quantified by the bandwidth break-even analysis—reflects the cost of the network edge between these two units.

A related question concerns router placement in hierarchical policy graphs: when the encoder executes on-device whilst policy units operate remotely, routing decisions may be made either locally (requiring a compact on-device router) or remotely (introducing an additional network round-trip per decision). The CALF networking model (Chapter 8) provides the communication channel for either configuration; Chapter 9 discusses the implications for the hardware deployment prototype. The split-policy evaluation reported here does not vary routing strategy; this remains a design choice for deployment.

Viewed through the policy graph lens developed in Chapter 5, the MiniConv encoder can be understood as a low-level perception unit that processes raw sensory input and outputs a compact feature representation to a higher-level decision unit. The infrastructure developed in Chapter 8 generalises this pattern, enabling arbitrary compositions of policy units distributed across edge, fog, and cloud tiers. The results presented here—showing that compact on-device encoders can preserve task performance whilst reducing decision latency and server load in the reported settings—suggest that such division of labour can be viable even on resource-constrained hardware.

A limitation of the current work is that the encoder and policy head are trained jointly end-to-end and deployed as a fixed partition. Future work could explore dynamic partitioning strategies in which the split point adapts to runtime bandwidth and compute availability, or hierarchical compositions in which multiple edge devices contribute complementary sensory encodings to a shared policy unit—patterns directly supported by the policy graph formalism.

7.5.2

Privacy and Systems Considerations

By performing initial visual processing on-device, split-policy execution reduces the need to transmit raw frames, which can reduce exposure of sensitive information in camera and screen-based applications; however, compact feature representations can still leak information in principle, and standard transport encryption (e.g., TLS) remains necessary to protect transmitted features from third-party interception.

7.6

Conclusion

This chapter introduced MiniConv, a library of small convolutional encoders designed to compile cleanly to OpenGL fragment shaders, enabling a split-policy RL architecture in which early visual feature extraction is performed on-device. Across three visual control tasks (PPO, SAC, DDPG), MiniConv encoders appear competitive with a conventional Full-CNN baseline under pixel observations in these fixed-seed runs, with representation size exhibiting task-dependent trade-offs between final and mean return. The systems evaluation shows that the split-policy approach can substantially reduce end-to-end decision latency in bandwidth-limited settings (e.g., 540 ms to 145 ms at $10\,\mathrm{Mb\,s^{-1}}$ ) and improve server scalability under a fixed latency budget (12 to 36 concurrent clients at 10 Hz, p95 $$<100$$ ms in the testbed); benefits increase as bandwidth decreases and as the transmitted representation is made smaller, but additional on-device computation can dominate at higher bandwidth. The infrastructure and findings presented here flow directly into Chapter 8, which addresses the systems challenges of deploying policy graphs under realistic network conditions—including variable latency, jitter, and packet loss.