year: 2025/10
paper: toward-agents-that-reason-about-their-computation
website:
code: https://github.com/AdrianOrenstein/toward-agents-reasoning-about-compute
connections: DQN, AMII, options framework, compute efficiency, big world hypothesis, compute efficiency
Motivation
Long-lived agents in complex environments can’t afford fixed computational processes, they need to adapt resource usage to the situation.
Classical RL agents don’t get more compute-efficient as they improve. Their computational processes (sensing, acting, learning, planning) are fixed at design time, and the agent never sees the cost. A DQN runs the same forward pass every 5 frames whether the screen is empty or chaotic. Humans don’t work this way: a chess beginner deliberates every move, a grandmaster coasts on pattern recognition and only thinks hard at critical moments.
This paper asks: what if you made it part of the RL objective instead?
coding agents arguably already have this: penalizing verbosity at train-time, plus ICL at test-time.
Takeaway
Frame compute allocation as part of the objective and the agent figures it out from experience.
The fact that performance improves with fewer decisions shows fixed-rate computation was wasteful to begin with.
"Compute DQN"
Extend DQN with the options framework: the agent picks from , where are durations. So instead of just choosing an action, it chooses “action + how many steps to repeat it before thinking again”. Standard DQN acts every 5 frames (12 Hz). Compute DQN can go as low as every 40 frames (1.5 Hz).
A per-decision compute cost is subtracted from the reward whenever the agent makes a decision. During option execution (repeating the action), . The reward becomes , framing compute usage as part of the optimization problem.
is set per-game as (average reward per step achieved by standard DQN), so the tradeoff between acting and saving compute is on a similar scale to the task reward. Both agents are trained with the same compute budget: 40M decisions.
Results
Compute DQN outperforms DQN on 75% of 46 Atari games under equal training compute, while using 3.4x fewer decisions on average (3.6 Hz vs 12 Hz).
The learned strategies are game-specific and situation-adaptive
Pong: Coasts when the ball is heading toward the opponent, ramps up decisions right before hitting the ball.
Breakout: Low decision rate early on when the ball is slow, increases as ball speed and difficulty ramp up. Drops when the ball is stuck behind the wall.
Asterix: Spikes during dense collectible waves, drops between waves. Decision rate scales with object speed.Varying traces out a Pareto frontier higher cost → fewer decisions with some performance loss lower cost → agent converges toward DQN's 12 Hz rate.