Toward Agents That Reason About Their Computation

year: 2025/10
paper: toward-agents-that-reason-about-their-computation
website:
code: https://github.com/AdrianOrenstein/toward-agents-reasoning-about-compute
connections: DQN, AMII, options framework, compute efficiency, big world hypothesis, compute efficiency

Motivation

Long-lived agents in complex environments can’t afford fixed computational processes, they need to adapt resource usage to the situation.

Classical RL agents don’t get more compute-efficient as they improve. Their computational processes (sensing, acting, learning, planning) are fixed at design time, and the agent never sees the cost. A DQN runs the same forward pass every 5 frames whether the screen is empty or chaotic. Humans don’t work this way: a chess beginner deliberates every move, a grandmaster coasts on pattern recognition and only thinks hard at critical moments.
This paper asks: what if you made it part of the RL objective instead?

coding agents arguably already have this: penalizing verbosity at train-time, plus ICL at test-time.

Takeaway

Frame compute allocation as part of the objective and the agent figures it out from experience.
The fact that performance improves with fewer decisions shows fixed-rate computation was wasteful to begin with.

"Compute DQN"

Extend DQN with the options framework: the agent picks from $O = A \times T$ , where $T = {1, 2, 4, 8}$ are durations. So instead of just choosing an action, it chooses “action + how many steps to repeat it before thinking again”. Standard DQN acts every 5 frames (12 Hz). Compute DQN can go as low as every 40 frames (1.5 Hz).

A per-decision compute cost $c$ is subtracted from the reward whenever the agent makes a decision. During option execution (repeating the action), $c = 0$ . The reward becomes $r_{t}^{task} - c_{t}$ , framing compute usage as part of the optimization problem.

$c$ is set per-game as $\frac{G _{T}}{T}$ (average reward per step achieved by standard DQN), so the tradeoff between acting and saving compute is on a similar scale to the task reward. Both agents are trained with the same compute budget: 40M decisions.

Results

Compute DQN outperforms DQN on 75% of 46 Atari games under equal training compute, while using 3.4x fewer decisions on average (3.6 Hz vs 12 Hz).

The learned strategies are game-specific and situation-adaptive

Pong: Coasts when the ball is heading toward the opponent, ramps up decisions right before hitting the ball.
Breakout: Low decision rate early on when the ball is slow, increases as ball speed and difficulty ramp up. Drops when the ball is stuck behind the wall.
Asterix: Spikes during dense collectible waves, drops between waves. Decision rate scales with object speed.

Varying $c$ traces out a Pareto frontier higher cost → fewer decisions with some performance loss lower cost → agent converges toward DQN's 12 Hz rate.

Graph View

Toward Agents That Reason About Their Computation