Streaming learning is an evolution of online learning that does not store past experience to train in mini-batches.

Why store past experience?

  • It works stably for offline learnig
  • Memory replay occurs in natural intelligence (albeit not with raw experience)
  • Replaying samples multiple times extracts more information / is more sample efficient
  • Stability
Link to original

Stream barrier”: Stability was a blocker for streaming RL in the pre-deep RL era (and the deep-RL era, until it got solved).

Why solve streaming deep RL?


Traditional deep RL methods fail to overcome the stream barrier (below 1k avg return in hopper, nothing useful is learned).

Causes for the stream barrier

Learning with gradient descent can be unstable under nonstationarity (see catastrophic forgetting, loss of plasticity).
Mini-batch gradient descent somewhat this instability by i.i.d. sampling, but and since by definition we cannot do that, streaming learning faces the most extreme form of nonstationarity.

Pathologies under nonstationary deep learning

Remedies:

  • Weight initialization, re-initialization, perturbation/search 1 2
  • Sparse representation and norm regularization
  • Normalization techniques
  • Skip connections, gradient clipping, modern activations
  • Adaptive optimizers

Stream-X algorithms overcome the stream barrier, by applying techniques on top of classic RL algorithms with eligibility traces:

  • sparse init
  • layernorm
  • obs and reward norm
  • new optimizer for bounding step-size based on update size (most important)

The main idea is to prevent overshooting updates on a single example. See the note for context and details.

Algorithm 3: Overshooting-bounded Gradient Descent (ObGD)

This algorithm avoids the need for multiple forward passes by directly scaling the step size based on the theoretical bound of the effective step size. The scaling ensures updates remain controlled without requiring backtracking.

The key idea is to scale down the step size based on:

  • How large the current error is, )
  • How “steep” the current update direction is ()
  • A safety factor that accounts for potential nonlinearity
Link to original

Stream AC uses only a single sample per step leading to the same performance but with orders of magnitude less compute:

I think this freed up compute is best spent on learning things in parallel ( meta learning). The continual learning properties are perfect for training OMNI-EPIC agents.

Carmack asks: “The CL tasks looks very daunting … if you make a billion sequential updates, you are almost guaranteed to guaranteed to destroy the earlier things … thoughts …?”

My two cents: Gradients update meta-learned weights, task-specific / contextual information of an agents lifetime are stored in the fast weights / context; the agent needs to learn to construct its own memory.

References

Rupam Mahmood , Streaming Deep RL, Upper Bound 2025
Streaming Deep Reinforcement Learning Finally Works

Footnotes

  1. Addressing Loss of Plasticity and Catastrophic Forgetting in Continual Learning

  2. Loss of plasticity in deep continual learning