Streaming Deep Reinforcement Learning Finally Works

year: 2024/10
paper: https://arxiv.org/pdf/2410.14606
website: https://x.com/RichardSSutton/status/1860818651953463542
code: https://github.com/mohmdelsayed/streaming-drl
connections: AMII, streaming RL, online learning

Extended Abstract (copypasta)

Although the prevalence of batch deep RL is often attributed to its sample efficiency, a more critical reason for the absence of streaming deep RL is its frequent instability and failure to learn, which we refer to as stream barrier. This paper introduces the stream-x algorithms, the first class of deep RL algorithms to overcome stream barrier for both prediction and control and match sample efficiency of batch RL

The agents, under the streaming reinforcement learning problem, are required to process one sample at a time without storing any samples for future reuse. Such requirements create additional hurdles compared to batch deep reinforcement learning, even though both learn from a non-stationary stream of data.
Streaming learning methods mainly require CPUs instead of GPUs since no batch updates are used, unless a very large neural network is used in which they might benefit from GPUs. In such case, the overhead of context switching between CPU and GPU might be negligible compared to the GPU computational cost required for the forward and backward passes in very large networks.

Issues that hinder learning:

learning instability due to occasional large updates,

learning instability due to activation nonstationarity, and

improper scaling of data

These issues are already present in batch methods causing several detrimental effects such as drop in performance, high variance, or inability to improve performance. However, they are exacerbated with streaming learning as updates can fluctuate more from one step to another due to non-i.i.d. sampling for updates. For example, streaming learning is more prone to instability as successive per-sample gradients can point in different directions, making it difficult to choose a single working step size. In contrast, batch methods mitigate this issue by averaging gradients from an i.i.d.-sampled batch drawn from a large pool.

3.1 Sample efficiency with sparse initialization and eligibility traces

Algorithm 1: SparseInit

$Require: network f, sparsity level s$
$for weight W and bias b do n \leftarrow s \times fan_in Permutation set P of size fan_in Index set I of size n (subset of P) W_{i, j} \sim U [- 1/ fan_in, 1/ fan_in], \forall i, j W_{i, j} \leftarrow 0, \forall i \in I, \forall j b_{i} \leftarrow 0, \forall i Return: initialized network f$
In words: Initialize all weights and biases using LeCun Initialization, then set a random subset of those to 0.

3.2 Adjusting step sizes for maintaining update stability

In the streaming case, it is not clear if choosing a step size that reduces the error in the current sample is the best strategy. A more pertinent goal in streaming learning is to de-emphasize an update if it is too large, for example, if the update overshoots the target on a single sample.

Effective Step Size

Let $δ (S)$ be the error on a sample before making an update (pre-update error). For TD-learning, this might be $r + γ v^{(s^{'})} - v^{(s)}$ . Let $δ^{+} (S)$ be the error on the same sample after making an update (post-update error).

The effective step size $ξ$ measures the relative amount of error correction:
$ξ = \frac{δ ( S ) - δ ^{+} ( S )}{δ ( S )}$
This quantity indicates how much of the original error was corrected by the update:
$ξ > 1$ overshooting: The update was too aggressive, possibly changing the error’s sign
$ξ = 1$ full correction: The update perfectly corrected the error
$ξ < 1$ partial correction: The update made progress but did not overshoot
$ξ = 0$ no correction: The update had no effect on the error
$ξ < 0$ negative correction: The update made the error worse

E.g.:
$δ = 10, δ^{+} = 2 \Rightarrow ξ = (10 - 2) /10 = 0.8$ (partial correction)
$δ = 10, δ^{+} = - 2 \Rightarrow ξ = (10 + 2) /10 = 1.2$ (overshoot)

Algorithm 2: Bounding Effective Step Size with Backtracking (idealized optimizer)

$Initialize: maximum effective step size ξ_{max} (e.g., 0.05) Require: eligibility trace z_{w}, weight vector w, error function δ, and starting step size α w^{'} \leftarrow w + α δ_{w} z_{w} while \frac{δ _{w} - δ _{w^{'}}}{δ _{w}} > ξ_{max} do α \leftarrow β α w^{'} \leftarrow w + α δ_{w} z_{w} return w^{'} ▹ Counterfactual weights ▹ Exact effective step size detection ▹ Backtracking line search until condition is met ▹ Note that z_{w} = \nabla_{w} f for supervised learning$
The algorithm prevents overshooting by adaptively reducing the step size until the effective step size is below a maximum threshold $ξ_{max} \in (0, 1]$ .
Small $ξ_{m a x} \to 0$ conservative updates, $ξ_{m a x} \to 1$ aggressive updates.

Backtracking line search is an optimization technique that starts with an initial step size $α$ and progressively reduces it by a factor $β \in (0, 1)$ until certain conditions are met. In this case, we keep reducing $α$ until the effective step size $ξ$ falls below $ξ_{max}$ . This is similar to classical line search methods, but here we’re focused on preventing overshooting on individual samples rather than optimizing a global objective.

For supervised learning, the eligibility trace $z_{w}$ simplifies to $\nabla_{w} f$ because there’s no temporal component - we’re just trying to minimize error on individual input-output pairs. In reinforcement learning, eligibility traces serve a broader purpose by maintaining a decaying history of state visitations, helping to assign credit for rewards to earlier states.

The main drawback of this approach is that it requires multiple forward passes to compute $δ^{+}$ for each new candidate step size. This can be expensive when many forward passes are required until we find a step size that satisfies the criteria.

Approximate Effective Step Size

To avoid the costly multiple forward passes required by the exact backtracking search, the authors use a first-order Taylor approximation, assuming local linearity (holds approximately when the updates are small, which is partly what we are aiming to achieve).
The post-update prediction for input $x$ can then be written as:
$f (x; w^{+}) = f (x; w - u (x; w)) = f (x; w) - \nabla_{w} f (x; w)^{⊤} u (x; w) under local linearity$
where $w^{+}$ and $w$ are parameters after and before the update, and $u$ is the update vector.
For TD(λ), $u = α δ z$ , where $α$ is the step size, $δ$ the TD error, and $z$ the eligibility trace vector.
For supervised learning, $u = α \nabla_{w} f$ .

We use the already-computed gradients to compute the directional derivative, estimating changes due to a small weight update.

Local Linearity

Local linearity means that for small changes in the weights, we can approximate how the network’s output will change using just the gradient (tangent line) at the current point. Think of it like approximating a curve with its tangent - this works well when you only move a small distance along the curve:

f(x + \Delta x) \approx f(x) + f’(x)\Delta x

$T hi s a pp ro x ima t i o nb eco m es l ess a cc u r a t e a s u p d a t es g e tl a r g er, w hi c hi s w h y w e w an tt o k ee p u p d a t ess ma ll in t h e f i rs tpl a ce . F or n e u r a l n e tw or k s, t hi s m e an s w ec ana pp ro x ima t e h o wt h e p re d i c t i o n w i ll c han g e a f t er a w e i g h t u p d a t e w i t h o u t a c t u a ll y ha v in g t or u nan o t h er f or w a r d p a ss t h ro ug h t h e n e tw or k .$

This leaves us with the following effective step size for TD( $λ$ ) under nonlinear function approximation:
$ξ = \frac{( r + γ v ( w ; x ^{'} ) - v ( w ; x )) - ( r + γ v ( w _{+} ; x ^{'} ) - v ( w _{+} ; x ))}{δ} = \frac{α δ γ z ^{T} \nabla _{w} v ( w ; x ^{'} ) - α δ z ^{T} \nabla _{w} v ( w ; x )}{δ} = α z^{T} (γ \nabla_{w} v (w; x^{'}) - \nabla_{w} v (w; x))$
We compute the TD-error on a sample before and after the update, and divide the difference by the original error to get the effective step size.
This would still require an additional backward pass for the value function at the next state $x^{'}$ , so we further approximate the effective step size:
$ξ = α z^{T} (γ \nabla_{w} v (w; x^{'}) - \nabla_{w} v (w; x)) \leq α ∣ z ∣^{T} \nabla_{w} ∣ γ v (w; x^{'}) - \nabla_{w} v (w; x) ∣ \leq α ∣ z ∣^{T} 1 = α ∥ z ∥_{1} \leq κ α ∥ z ∥_{1}, where κ > 1 \leq κ α \overset{ˉ}{δ} ∥ z ∥_{1}, where \overset{ˉ}{δ} = max (∣ δ ∣, 1)$
[!note] Approximate Effective Step Size

To avoid multiple forward passes in the exact backtracking method, the paper uses a first-order Taylor approximation under a local linearity assumption. For small updates, we can estimate how much the network output changes using only the gradient at the current weights. This is cheaper than performing another forward or backward pass.

When we update weights from $w$ to $w^{+}$ by a vector $u$ , local linearity tells us
$f (x; w^{+}) = f (x; w - u (x; w)) \approx f (x; w) - \nabla_{w} f (x; w)^{⊤} u (x; w) under local linearity$
For TD(λ), $u = α δ z$ , where $α$ is the step size, $δ$ the TD error, and $z$ the eligibility trace vector.
For supervised learning, $u = α \nabla_{w} f$ .
This approximation avoids re-running a full forward pass at the new weights, because the directional derivative in the above equation only needs the existing gradient. The local linearity assumption holds / the approximation is most accurate when the update $u$ is small, which aligns with our goal of preventing large, destabilizing steps.

Local Linearity f as if it is linear in a small neighborhood around w. Formally, for small ∆w:

Local linearity treats

f(x; \boldsymbol{w} + \Delta \boldsymbol{w}) \approx f(x; \boldsymbol{w}) + \nabla_{\boldsymbol{w}} f(x; \boldsymbol{w})^\top \Delta \boldsymbol{w}

$T hi s w or k s b es tw h e n ‖∆ w ‖ i ss ma ll . I t i s t h es am e i d e aa s a pp ro x ima t in g a c u r v e b y i t s t an g e n tl in e a t a p o in t .$

In TD-learning, the error $δ$ depends on both the current state $x$ and the next state $x'$ :
$δ = r + γ v (w; x^{'}) - v (w; x)$
After the update, the new error $δ^{+}$ depends on $w^{+}$ . The effective step size $ξ$ is defined as
$ξ = \frac{δ - δ ^{+}}{δ}$
Computing $δ^{+}$ exactly at $x'$ and $w^{+}$ would require an extra backward pass. Instead, the paper uses local linearity to approximate how $v$ changes at $x$ and $x^{'}$ , yielding
$ξ = \frac{( r + γ v ( w ; x ^{'} ) - v ( w ; x )) - ( r + γ v ( w _{+} ; x ^{'} ) - v ( w _{+} ; x ))}{δ} = \frac{α δ γ z ^{T} \nabla _{w} v ( w ; x ^{'} ) - α δ z ^{T} \nabla _{w} v ( w ; x )}{δ} = α z^{⊤} (γ \nabla_{w} v (w; x^{'}) - \nabla_{w} v (w; x))$
Even this approximation requires computing a new gradient at $x^{'}$ . We can avoid this by deriving a bound: The difference between gradients at nearby states can’t be too large (Lipschitz continuity), and for nearby states with discount factor close to 1, this difference is at most 1.
This lets us bound the effective step size using just the L1 norm of the eligibility trace: $ξ \leq κ α \overset{ˉ}{δ} ∥ z ∥_{1}$

This bound tells us that to prevent overshooting, we should scale down our step size when the eligibility traces have large magnitude $∥ z ∥_{1}$ or $\overset{ˉ}{δ}$ is large.
$ξ = α z^{T} (γ \nabla_{w} v (w; x^{'}) - \nabla_{w} v (w; x)) \leq α ∣ z ∣^{T} \nabla_{w} ∣ γ v (w; x^{'}) - \nabla_{w} v (w; x) ∣ \leq α ∣ z ∣^{T} 1 = α ∥ z ∥_{1} \leq κ α ∥ z ∥_{1}, where κ > 1 \leq κ α \overset{ˉ}{δ} ∥ z ∥_{1}, where \overset{ˉ}{δ} = max (∣ δ ∣, 1)$

Extra notes on bounding $γ [\nabla_{w} v (w; x^{'})]_{i} - [\nabla_{w} v (w; x)]_{i} \leq 1$ Note: $∣ z ∣^{⊤} 1 = ∥ z ∥_{1}$ They do this because something something lipschitz continuity of gradients tells us that nearby states will have similar gradients something something. $\overset{ˉ}{δ}$ exists so that if $δ$ is very small, the bound does not become trivial. $κ$ exists so … ???

Algorithm 3: Overshooting-bounded Gradient Descent (ObGD)

$Require: \overset{ˉ}{δ} M α w return w eligibility trace z_{w}, weight vector w, error δ, step size α, scaling factor κ \leftarrow max (∣ δ ∣, 1) \leftarrow κ α \overset{ˉ}{δ} ∥ z_{w} ∥_{1} ▹ Note that z_{w} = \nabla_{w} f for supervised learning \leftarrow min (\frac{α}{M}, α) \leftarrow w + α δ z_{w}$
This algorithm avoids the need for multiple forward passes by directly scaling the step size based on the theoretical bound of the effective step size. The scaling ensures updates remain controlled without requiring backtracking.

The key idea is to scale down the step size $α$ based on:

How large the current error $δ$ is, $\overset{ˉ}{δ}$ )

How “steep” the current update direction is ( $∥ z_{w} ∥_{1}$ )

A safety factor $κ$ that accounts for potential nonlinearity

3.3 Stabilizing activation distribution under non-stationarity

→ They use layer normalization (applied to the pre-activation of each layer (before applying the activation $σ$ ))

3.4 Proper scaling of data

They use Welfords method to scale observations and rewards.

Algorithm 4: SampleMeanVar

Transclude of Welfords-method#^5e7046

Algorithm 5: Scale Reward

$Initialize: Require: u_, p, σ^{2}, n Return: u \leftarrow 0 r, γ, p, T, n \leftarrow γ (1 - T) u + r \leftarrow SampleMeanVar (u, 0, p, n) \frac{u - p}{σ ^{2} + ϵ}, p$

Algorithm 6: NormalizeObservation

$Require: μ, σ^{2}, p, n, Return: S, μ, p, n \leftarrow SampleMeanVar (S, μ, p, n) \frac{S - μ}{σ ^{2} + ϵ}, μ, p, n$

3.5 Stable streaming deep reinforcement learning methods

$θ$ is the actor, $w$ is the critic

Max Wolf's Second Brain

Explorer

Streaming Deep Reinforcement Learning Finally Works

Graph View