When the World Changes Without Warning: Statistical Context Detection in Lifelong Reinforcement Learning

The promise of lifelong reinforcement learning rests on an assumption so fundamental it often goes unquestioned. We assume that when an agent encounters a new task, it knows that the task has changed. This assumption manifests in the form of task labels, environment identifiers, or explicit curriculum signals provided by human designers. Yet in truly autonomous systems operating in open worlds, no such oracle announces when the transition dynamics have shifted or when the reward function has been rewritten. The agent must detect these boundaries from the stream of its own experience, a problem far more subtle than the catastrophic forgetting that dominates the literature.

The paper Statistical Context Detection for Deep Lifelong Reinforcement Learning addresses this blind spot directly, proposing a framework where agents infer task boundaries through statistical hypothesis testing in latent representational spaces. Their approach leverages optimal transport theory and Wasserstein distances to detect changes in environment dynamics without requiring pretraining on labeled task distributions or assuming finite observation spaces. The result is a mechanism for online context detection that operates in tandem with policy optimization, enabling genuine lifelong learning without human supervision of task boundaries.

The Detection Problem in Dynamical Systems

Most existing approaches to context detection in reinforcement learning rely on monitoring shifts in the observation distribution. These methods assume that when a task changes, the visual or sensory input changes accordingly. However, this assumption fails in the most interesting cases. A robotic arm may encounter a new object with identical visual appearance but different mass distribution. A navigation agent may enter a region with identical landmarks but different physical constraints. In these scenarios, the transition function or reward structure changes while the observation distribution remains statistically similar.

The authors identify this as the core challenge. Changes in transition or reward functions can only be detected through their interaction with the agent's actions. Unlike supervised continual learning, where the data distribution is exogenous, reinforcement learning involves a closed loop where the policy itself determines which state transitions are sampled. This coupling creates a chicken and egg problem. To detect that the environment has changed, the agent needs to behave differently, but to behave differently, it needs to know the environment has changed.

The proposed solution moves the detection problem from the high dimensional observation space to a compressed latent action reward space. By encoding sequences of actions and their resulting rewards into a lower dimensional representation, the method captures the essential dynamics of the environment rather than its superficial appearance. This sidesteps the curse of dimensionality while maintaining sensitivity to