Main

Humans and many other animals can adapt rapidly to new information and changing environments. Such adaptation often involves spending extended and variable periods of time contemplating possible futures before taking an action1,2. For example, as we prepare to go to work, temporary roadworks might require us to adapt and mentally review the available routes. Because thinking does not involve the acquisition of new information or interactions with the environment, its ubiquity for human decision-making is perhaps surprising. However, thinking allows us to perform more computations with limited information, which can improve performance on downstream tasks3. Because physically interacting with the environment can incur unnecessary risk or consume time and other resources, the benefits of planning often make up for the time spent on the planning process itself.

Despite a wealth of cognitive science research on the algorithmic underpinnings of planning1,4,5,6, little is known about the underlying neural mechanisms. This question has been difficult to address because of a scarcity of intracortical recordings during planning and contextual adaptation. However, recent work includes large-scale neural recordings during increasingly complex behaviors from the hippocampus and prefrontal cortex (PFC), regions known to be important for memory, decision-making and adaptation7,8,9,10,11,12,13. These studies have demonstrated the importance of the PFC for generalizing abstract task structure across contexts10,11. Additionally, it has been suggested that planning could be mediated by hippocampal forward replays5,7,8,14,15,16,17. Despite these preliminary theories, it is unclear how hippocampal replays could be integrated within the dynamics of downstream circuits to implement planning18.

While prevailing theories of learning from replays generally rely on dopamine-mediated synaptic plasticity5,19,20, it is unclear whether this process could operate sufficiently quickly to also inform online decision-making. It has recently been suggested that some forms of fast adaptation could result from recurrent meta-reinforcement learning (meta-RL)10,21,22, where adaptation to new tasks is directly implemented by the recurrent dynamics of the prefrontal network. The dynamics themselves are learned through gradual changes in synaptic weights, which are modified over many different environments and tasks in a slow process of RL. Such recurrent neural network (RNN)-based agents can rapidly adapt to a new task or environment with fixed weights after training by integrating their experiences into the hidden state of the RNN10,21,22,23,24. However, previous models are generally only capable of making instantaneous decisions and cannot improve their choices by ‘thinking’ before taking an action.

In this work, we propose a model that similarly combines slow synaptic learning with fast adaptation through recurrent dynamics in the prefrontal network. In contrast to previous work, however, this recurrent meta-learner can choose to momentarily forgo physical interactions with the environment and instead think (refs. 25,26). This process of thinking is formalized as the simulation of sequences of imagined actions, sampled from the policy of the agent itself, which we refer to as ‘rollouts’ (Fig. 1a). We introduce a flexible maze navigation task to study the relationship between the behavior of such RL agents and that of humans (Fig. 1b). RL agents trained on this task learn to use rollouts to improve their policy and selectively trigger rollouts in situations where humans also spend more time deliberating.

Fig. 1: Task and model schematics.
figure 1

a, The RL agent consisted of an RNN, which received information about the environment and executed actions in response. The primary output of the agent was a policy from which the next action was sampled. This action could be to either move in the environment in a given direction (up, down, left or right) or think by using an internal world model to simulate a possible future trajectory (a rollout). The agent was trained to maximize its average reward per episode and to predict (1) the upcoming state; (2) the current goal location; and (3) the value of the current state. When the agent decided to plan, the first two predictors were used in an open-loop planning process, where the agent iteratively sampled imagined actions and predicted what the resulting state would be and whether the goal had been (virtually) reached. The output of this planning process was appended to the agent’s input on the subsequent time step (details in text). A physical action was assumed to take 400 ms and a rollout was assumed to take 120 ms (ref. 36). b, Schematic illustrating the dynamic maze task. In each episode lasting T = 20 s, a maze and a goal location were randomly sampled. Each time the goal was reached, the subject received a reward and was subsequently teleported to a new random location, from which it could return to the goal to receive more reward. The maze had periodic boundaries, meaning that subjects could exit one side of the maze to appear at the opposite side. c, Schematic illustrating how policy rollouts can improve performance by altering the momentary policy. An agent might perform a policy rollout leading to low value (top; black), which would decrease the probability of physically performing the corresponding sequence of actions. Conversely, a rollout leading to high value (bottom; orange) would increase the probability of the corresponding action sequence. Notably, these policy changes occur at the level of network dynamics rather than parameter updates (Supplementary Note 1).

Source data

We draw explicit parallels between the model rollouts and hippocampal replays through reanalyses of recent hippocampal recordings from rats performing a similar maze task7, where the content and behavioral correlates of hippocampal replays have a striking resemblance to the policy rollouts in our computational model. Our work, thus, addresses two key questions from previous studies of hippocampal replay and planning. First, we show that a recurrent network can meta-learn when to plan instead of having to precompute a ‘plan’ to decide whether to use it5,Methods). In this setting, the agent’s probability of choosing to perform a rollout when encountering a new state, π(rollout), was a monotonically increasing function of human thinking time in the same situation (Fig. 2e). The Pearson correlation between these two quantities was r = 0.186 ± 0.007 (mean ± s.e.m. across participants), which was significantly higher than expected by chance (Fig. 2f; chance level, r = 0 ± 0.004). An above-chance correlation between thinking times and π(rollout) of r = 0.070 ± 0.006 persisted after conditioning on the momentary distance to goal (Fig. 2f, ‘Residual’), which was also correlated with thinking times (r = 0.272 ± 0.006). The similarity between planning in humans and RL agents thus extends beyond this salient feature of the task, including an increased tendency to plan on the first step of a trial (Extended Data Fig. 3).

In addition to the similarities during the exploitation phase, a significant correlation was observed between human thinking time and π(rollout) during exploration (r = 0.098 ± 0.008). In this phase, both humans and RL agents spent more time thinking during later stages of exploration (Extended Data Fig. 7). Model rollouts during exploration correspond to planning toward an imagined goal from the posterior over goal locations, which becomes narrower as more states are explored (Extended Data Fig. 7). This finding suggests that humans may similarly engage in increasingly goal-directed behavior as the goal posterior becomes narrower over the course of exploration. In summary, a meta-RL agent, endowed with the ability to perform rollouts, learned to do so in situations similar to when humans appear to plan. This provides a putative normative explanation for the variability in human thinking times observed in the dynamic maze task.

Rollouts improve the policy of the RL agent

In the previous section, we saw that an RL agent can learn to use policy rollouts as part of its decision-making process and that the timing and number of rollouts correlate with variability in human thinking times. We next aimed to understand why the agent chooses to perform rollouts and how they guide behavior. We considered the agent right after it first located the goal in each episode (that is, at the first iteration of trial 2; Fig. 1b) and forced it to perform a predefined number of rollouts, which we varied. We then counted the number of actions needed to return to the goal while preventing any further rollouts during this return phase (Methods).

The average number of actions needed to reach the goal decreased monotonically as the number of forced rollouts increased up to at least 15 rollouts (Fig. 3a). To confirm that this performance improvement depended on the information contained in the policy rollouts rather than being driven by additional iterations of recurrent network dynamics, we repeated the analysis with no feedback from the rollout to the RNN and found a much weaker effect (Fig. 3a, dashed gray line). The increase in performance with rollout number was also associated with a concomitant decrease in policy entropy (Fig. 3b). Thus, performing more rollouts both improved performance and reduced uncertainty (Methods). These findings confirm that the agent successfully learned to use policy rollouts to optimize its future behavior. However, the question remains of whether this policy improvement is appropriately balanced with the temporal opportunity cost of performing a rollout. In general, rollouts are beneficial in situations where the policy improvement resulting from a rollout is greater than the temporal cost of 120 ms of performing the rollout. Explicitly forbidding rollouts (Methods) impaired the performance of the agent (Fig. 3c), suggesting that it had successfully learned to trade off the cost and benefits of rollouts14,25,26. Randomizing the occurrence in time of the rollouts while preserving their number also led to a performance drop (Fig. 3c), confirming that the RL agent used rollouts specifically when they improved performance.

Fig. 3: Rollouts improve the network policy.
figure 3

a, Average trial 2 performance as a function of the number of rollouts enforced at the beginning of the trial. Performance was quantified as the number of steps needed to reach the goal in the absence of further rollouts. The solid gray line (‘Optimal’) indicates optimal performance and the dashed gray line (‘Ctrl’) indicates a control simulation where the indicated number of rollouts was performed but with the feedback to the RNN from the rollout channels set to zero. The performance gap from the nonperturbed agent confirms that the performance improvement with increasing numbers of rollouts is dependent on the information contained in the rollouts and not just additional iterations of recurrent network dynamics. b, Policy entropy as a function of the number of rollouts enforced at the beginning of trial 2. The entropy was computed after renormalizing the policy over the four physical actions. The horizontal gray line indicates the entropy of a uniform policy. c, Left, original performance of the RL agent. Center, performance when renormalizing the policy over physical actions to prevent any rollouts. Right, performance after shuffling the timing of the rollouts while kee** the number of rollouts constant. Performance was quantified as the average number of rewards collected per episode. The dashed lines indicate the five individual RL agents and the solid line indicates the mean and s.e.m. across agents. Avg., average. d, Schematic showing an example of a successful (dark blue) and an unsuccessful (light blue) rollout from the same physical location (blue circle). The black cross indicates the goal location (not visible to the agent or human participants). e, Probability of taking the first simulated action of the rollout, \({\hat{a}}_{1}\), before (\({\pi }^{{{{\rm{pre}}}}}({\hat{a}}_{1})\)) and after (\({\pi }^{{{{\rm{post}}}}}({\hat{a}}_{1})\)) the rollout. This was evaluated separately for successful (left) and unsuccessful (right) rollouts. \({\pi }^{{{{\rm{pre}}}}}({\hat{a}}_{1})\) was above chance (gray line) in both cases and increased for successful rollouts, while it decreased for unsuccessful rollouts. Bars and error bars indicate the mean and s.e.m. across five agents (gray dots). The magnitude of the change in \(\pi ({\hat{a}}_{1})\) for successful (Succ.) and unsuccessful (Unsucc.) rollouts depended on the planning horizon (Extended Data Fig. 4).

Source data

To further dissect the effect of rollouts on agent behavior, we classified each rollout, \(\hat{\tau }\) (a sequence \(\{{\hat{a}}_{1},{\hat{a}}_{2},\ldots \}\) of rolled-out actions), as being either ‘successful’ if it reached the goal according to the agent’s internal world model or ‘unsuccessful’ if it did not (Fig. 3d). We hypothesized that the policy improvement observed in Fig. 3a could arise from upregulating the probability of following a successful rollout and downregulating the probability of following an unsuccessful rollout. To test this hypothesis, we enforced a single rollout after the agent first found the reward and analyzed the effect of this rollout on the policy, separating the analysis by successful and unsuccessful rollouts. Importantly, we could compare the causal effect of rollout success by matching the history of the agent and performing rejection sampling from the rollout process until either a successful or an unsuccessful rollout occurred (Methods). Specifically, we asked how a rollout affected the probability of taking the first rolled-out action, \({\hat{a}}_{1}\), by comparing the value of this probability before (\({\pi }^{{{{\rm{pre}}}}}({\hat{a}}_{1})\)) and after (\({\pi }^{{{{\rm{post}}}}}({\hat{a}}_{1})\)) the rollout. \({\pi }^{{{{\rm{pre}}}}}({\hat{a}}_{1})\) was slightly higher for successful rollouts than unsuccessful rollouts, with both types of rollouts exhibiting a substantially higher-than-chance probability—a consequence of the model rollouts being ‘on-policy’ (Fig. 3e). However, while successful rollouts increased \(\pi ({\hat{a}}_{1})\), unsuccessful rollouts decreased \(\pi ({\hat{a}}_{1})\) (Fig. 3e). This finding demonstrates that the agent combines the spatial information of a rollout with knowledge about its consequences, based on its internal world model, to guide future behavior (Supplementary Note 1).

Hippocampal replays resemble policy rollouts

In our computational model, we designed policy rollouts to take the form of spatial trajectories that the agent could subsequently follow and to occur only when the agent was stationary. These two properties are also important signatures of forward hippocampal replays—patterns of neural activity observed using electrophysiological recordings from rodents during spatial navigation7,8,9. We, therefore, investigated whether forward replay in biological agents could serve a similar function during decision-making to policy rollouts in the RL agent.

To this end, we reanalyzed a recently published hippocampal dataset from rats navigating a dynamic maze similar to the task in Fig. 1b (ref. 7). Animals had to repeatedly return to an initially unknown ‘home’ location, akin to the goal in our task (Extended Data Fig. 8). Both this home location and the configuration of the maze changed between sessions. The rats could not be ‘teleported’ between trials as in our task; instead, they had to navigate to an unknown rewarded ‘away’ location selected at random after each home trial. These away trials served as a useful control because the animals did not know the location of the rewarded well at the beginning of the trial. Unlike the human data (Fig. 2c), we found no correlation between the initial distance to goal of the animal and time spent at the previously rewarded location (Extended Data Fig. 9). We hypothesize that this is because (1) the animals had to spend time consuming reward before they could continue and (2) a delay was imposed between reward consumption and the next reward becoming available. These periods could potentially be used for planning without incurring a substantial temporal opportunity cost, unlike the human task that explicitly enforced a trade-off between the time spent thinking and acting.

We, thus, focused on the spatiotemporal content of hippocampal replays following previous hypotheses that they could form a neural substrate of planning7,8,15. We studied replay events detected in hippocampal recordings made with tetrode drives during the maze task (n ∈ [187, 333] simultaneously recorded neurons per session; Extended Data Fig. 8c). To detect replays, we followed Widloski and Foster7 and first trained a Bayesian decoder to estimate the animal’s position on a discretized grid from the neural data during epochs when the animal was moving. We then applied this decoder during epochs when the animal was stationary at a reward location before initiating a new trial and defined replays as consecutive sequences of at least three adjacent decoded grid locations (Fig. 4a and Extended Data Fig. 8; see Methods for details).

Fig. 4: Hippocampal replays resemble model rollouts.
figure 4

a, Illustration of experimental task structure and example replays7. Each episode had a different wall configuration and a randomly sampled home location (cross). Between each home trial, the animal had to move to an away location, which was sampled anew on each trial (black circles). Colored lines indicate example replay trajectories originating at the blue dots. Replays were detected during the stationary periods at the away locations before returning to the home location and classified according to whether they reached the home location (dark-blue versus light-blue lines). b, Fraction of replay transitions that pass through a wall in the experimental (black) and model (blue) data. Control values indicate the fraction of wall crossings in resampled environments with different wall configurations. Dashed lines indicate individual biological sessions (n = 37) or RL agents (n = 5) and solid lines indicate the mean and s.e.m. across sessions or RL agents. c, Fraction of replays that pass through the goal location in experimental (black) and model (blue) data. Control values indicate the average fraction of replays passing through a randomly sampled nongoal location (Methods). Dashed and solid lines are as in b. There was no effect for the away trials, where the goal was unknown (Extended Data Fig. 9). d, Probability of taking the first replayed action, \(P({a}_{1}={\hat{a}}_{1})\), for successful (Succ) and unsuccessful (Un) replays during home trials (left; black) and away trials (center; gray) and in the RL agent (right; blue). Bars and error bars indicate the mean and s.e.m. across sessions or RL agents (gray dots; n = 37 and n = 5, respectively). e, Over-representation of successful replays during trials with at least three replays in the experimental data (left) and RL agents (right). The over-representation increased with replay number, an effect not seen in the away trials (Extended Data Fig. 9). Over-representation was computed by dividing the success frequency by a reference frequency computed for randomly sampled alternative hypothetical goal locations. Bars and error bars indicate the mean and s.e.m. across replays pooled from all animals (left) or standard error across five RL agents (right; dashed lines).

Source data

Similar to previous work7, we found that the hippocampal replays avoided passing through walls to a greater extent than expected by chance (Fig. 4b; P < 0.001, permutation test). This finding suggests that hippocampal replays are shaped by a rapidly updating internal model of the environment, similar to how forward rollouts in the RL agent are shaped by its internal world model (Fig. 1a). Additionally, the goal location was over-represented in the hippocampal replays, consistent with the assumption of on-policy rollouts in the RL agent (Fig. 4c; P < 0.001, permutation test)7.

Inspired by our findings in the RL agent, we investigated whether a replayed action was more likely to be taken by the animal if the replay was successful than if it was unsuccessful. Here, we defined a successful replay as one that reached the goal location without passing through a wall (Fig. 4a). Consistent with the RL model, the first simulated action in biological replays agreed with the next physical action more often for successful replays than for unsuccessful replays (Fig. 4d, black; P < 0.001, permutation test). Such an effect was not observed in the away trials (Fig. 4d, gray; P = 0.129, permutation test), where the animals had no knowledge of the reward location and therefore could not know what constituted a successful replay. These findings are consistent with the hypothesis that successful replays should increase the probability of taking the replayed action, while unsuccessful replays should decrease this probability.

In the RL agent, we had direct access to the momentary policy and could quantify the causal effect of a replay on behavior (Fig. 3e). In the biological circuit, it is unknown whether the increased probability of following the first action of a successful replay is because the replay altered the policy (as in the RL agent) or because the baseline policy was already more likely to reach the goal before the replay. To circumvent this confound, we analyzed consecutive replays while the animal remained stationary. If our hypotheses hold that (1) hippocampal replays resemble on-policy rollouts of an imagined action sequence and (2) performing a replay improves the policy, then consecutive replays should become increasingly successful even in the absence of any behavior between the replays.

To test this prediction, we considered trials where the animal performed a sequence of at least three replays at the away location before moving to the home location. We then quantified the fraction of replays that were successful as a function of the replay index within the sequence, after regressing out the effect of time (Methods)39. We expressed this quantity as the degree to which the true goal was over-represented in the replay events by dividing the fraction of successful replays by a baseline calculated from the remaining nongoal locations, such that an over-representation of 1 implies that a replay was no more likely to be successful than expected by chance. This over-representation increased with each consecutive replay during the home trials (Fig. 4e, left) and both the second and third replays exhibited substantially higher over-representation than the first replay (P = 0.068 and P = 0.009, respectively; permutation test; Methods). Such an effect was not seen during the away trials, where the rewarded location was not known to the animal (Extended Data Fig. 9).

These findings are consistent with a theory in which replays represent on-policy rollouts that are used to iteratively refine the agent’s policy, which in turn improves the quality of future replays—a phenomenon also observed in the RL agent (Fig. 4e, right). In the RL agent, this effect could arise in part because the agent is less likely to perform an additional rollout after a successful rollout than after an unsuccessful rollout (Extended Data Fig. 10). To eliminate this confound, we drew two samples from the policy each time the agent chose to perform a rollout and we used one sample to update the hidden state of the agent, while the second sample was used to compute the goal over-representation (Methods). Such decoupling is not feasible in the experimental data because we cannot read out the ‘policy’ of the animal. This leaves open the possibility that the increased goal over-representation with consecutive biological replays is in part because of a reduced probability of performing an additional replay after a successful replay. However, we note that (1) the rodent task was not a ‘reaction time task’ because a delay of 5–15 s was imposed between the end of reward consumption and the next reward becoming available. This makes a causal effect of replay success on the total number of replays less likely. (2) if such an effect did exist, that is also consistent with a theory where hippocampal replays guide planning.

Discussion

We developed a theory of planning in the prefrontal–hippocampal network, implemented as an RNN model and instantiated in a spatial navigation task requiring multistep planning (Fig. 1). This model consists of a recurrent meta-RL agent augmented with the ability to plan using policy rollouts and it explains the structure observed in human behavior (Fig. 2). Our results suggest that mental rollouts could play a major role in the striking human ability to adapt rapidly to new tasks by facilitating behavioral optimization without the potential cost of executing suboptimal actions. Because mental simulation is generally faster and less risky than executing physical actions40, this can improve overall performance despite the temporal opportunity cost of planning (Fig. 3)14,25.

Our theory also suggests a role for hippocampal replays during sequential decision-making. A reanalysis of rat hippocampal recordings during a navigation task showed that patterns of hippocampal replays and their relationship to behavior resembled those of rollouts in our model (Fig. 4). These results suggest that hippocampal forward replays could be a core component of planning and that the mechanistic insights derived from our model could generalize to biological circuits. In particular, we hypothesize that forward replays should affect subsequent behavior differently depending on whether they lead to high-value or low-value states (Fig. 3)13, consistent with previous models where replays update state-action values to improve future behavior5. We suggest that forward replays could implement planning through feedback to the PFC, which drives a ‘hidden state optimization’ reminiscent of recent models of motor preparation (Supplementary Note 1)41. This model-based policy refinement differs from prior work that posited an arbitration between model-free and model-based policies computed separately42,43. Instead, we hypothesize that model-based computations iteratively update a single policy that can be used for decision-making at different stages of refinement. This is consistent with previous work proposing that model-based computations can iteratively refine values or world models learned through model-free mechanisms5,44,45.

Neural mechanisms of planning and decision-making

Our model raises several hypotheses about neural dynamics in the hippocampus and PFC and how these dynamics affect behavior. One is that hippocampal replays should causally affect animal behavior, as also suggested in previous work7,8,15. This has been difficult to test experimentally due to the confound of how the behavioral intentions and history of an animal affect replay content15. Perhaps more interestingly, we predict that hippocampal replays should directly affect PFC representations, consistent with previous work showing coordinated activity between the hippocampus and PFC during sharp-wave ripples12. Specifically, PFC activity should change to make replayed actions more likely if the replayed trajectory is better than expected and less likely if worse than expected, reminiscent of actor–critic algorithms in the RL literature (Supplementary Note 1). These predictions can be investigated in experiments that record neural activity simultaneously from the hippocampus and PFC, where both the timing and the qualitative change in PFC representations can be related to hippocampal replays. This would be most natural in rodent experiments with electrophysiology, although human experiments using magnetoencephalography for replay detection could also investigate the effect of replays on cortical representations and behavior35,36,46.

While we propose a role of hippocampal replays in sha** immediate behavior through recurrent network dynamics, this is compatible with replays also having other functions over longer timescales, such as memory consolidation47,48 or dopamine-driven synaptic plasticity19,20. Additionally, we considered only the case of local forward replays and showed that they can be used to drive improved decision-making. These replays will have high ‘need’ according to the theory of Mattar and Daw5 because they start at the current agent location and visit likely upcoming states. They should similarly have a high ‘gain’ because rollouts lead to an increase in expected future reward (Fig. 3a). However, our choice to focus on local replays in this work does not imply that nonlocal or reverse replays could not play a similar role. Backward planning from a goal location is, for example, more efficient in environments where the branching factor is larger in the forward than the reverse direction and branching rollouts were shown to improve performance in previous RL models26.

Hippocampal replay and theta sequences

This work focuses on the trade-off between thinking and acting, investigating internal computations that can improve decision-making without additional physical experience. This is the phenomenon we investigated in human behavior, where the analyses focused on stop** times. It is also an explicit feature of the RL agent, which chooses between acting and performing a rollout rather than doing both simultaneously. A putative neural correlate of such planning in the absence of behavior is hippocampal replay, given the ubiquitous finding that it occurs primarily when animals are stationary15 and its hypothesized role in decision-making5,7,8,15. While some have challenged these ideas9,49,50, our analyses show that a replay-like mechanism could, in principle, improve decision-making in a manner consistent with human behavior.

Another phenomenon suggested to play a role in decision-making is that of hippocampal theta sequences17,51,52. Theta sequences typically represent states in front of the animal17 and are affected by the current goal location52, similar to our analyses of hippocampal replays (Fig. 4). However, because theta sequences predominantly occur during active behavior, they are potentially less relevant than hippocampal replays for the trade-off between acting and thinking. Nonetheless, our RL model also suggests a potential mechanism by which short theta sequences could guide behavior by providing recurrent feedback to cortical decision-making systems about immediately upcoming states and decision points. Under this hypothesis, hippocampal replays could support longer-term planning during stationary periods while theta sequences would update these plans on the go through short-term predictions, with both types of sequences operating through recurrent feedback to cortex.

Why do we spend time thinking?

Despite our results showing that humans and RL agents make extensive use of planning, mental simulation does not generate fundamentally new information. In theory, it should therefore be possible to make equally good ‘reflexive’ decisions given enough experience and computational power. However, previous work showed that imagination can affect human choices53 and that having more time to process the available information can improve decisions54. This raises questions about the computational mechanisms underlying this process and why decision-making often takes time instead of being instantaneous. One reason could be that our decision-making system is capacity limited and lacks the computational power to instantly generate the optimal policy2). Such rapid learning could be because of the use of generic planning algorithms as a form of ‘canonical computation’ that generalizes across tasks. When combined with a new task-specific transition function learned from relatively little experience or inferred from sensory inputs, planning would facilitate data-efficient RL by trading off processing time for a better policy55. This is in contrast to our current model, which had to learn from scratch both the structure of the environment and how to use rollouts to shape its behavior. Importantly, planning as a canonical computation could be generalized not only to other navigation tasks but also to other domains, such as compositional reasoning and sequence learning, where replay was recently demonstrated in humans35,56.

Methods

Software

All models were trained in Julia version 1.7 using Flux and Zygote for automatic differentiation57. Human behavioral experiments were written in OCaml 5.0, with the front end transpiled to JavaScript for running in the participants’ browsers. All analyses of the models and human data were performed in Julia version 1.8. All analyses of hippocampal replay data were performed in Python 3.8.

Statistics

Unless otherwise stated, all plots are reported as the mean and s.e.m. across human participants (n = 94), independently trained RL agents (n = 5) or experimental sessions in rodents (n = 37).

Environment

We generated mazes using Algorithm 1.

Algorithm 1

Maze-generating algorithm

Mazes without periodic boundaries (Extended Data Fig. 1) were generated in the same way, except that states were not considered neighbors across a boundary, and four walls were removed instead of three walls in the last step of the algorithm to approximately match the distributions of shortest paths between pairs of states (Extended Data Fig. 1h).

For each environment, a goal location was sampled uniformly at random. When subjects took an action leading to the goal, they transitioned to this location before being teleported to a random location. In the computational model, this was achieved by feeding the agent an input at this location before teleporting the agent to the new location. The policy of the agent at this iteration of the network dynamics was ignored because the agent was teleported rather than taking an action.

RL model

We trained our agent to maximize the expected reward, with the expectation taken both over environments \({{{\mathcal{E}}}}\) and the agent’s policy π:

$${{{\mathcal{U}}}}={{\mathbb{E}}}_{{{{\mathcal{E}}}}}\left[J(\theta )\right]$$
$$={{\mathbb{E}}}_{{{{\mathcal{E}}}}}\left[{{\mathbb{E}}}_{\pi }\left(\mathop{\sum}\limits_{k=1}^{K}{r}_{\rm{k}}\right)\right]$$

Here, \({{{\mathcal{U}}}}\) is the utility function, k indicates the iteration within an episode and rk indicates the instantaneous reward at each iteration. We additionally introduced the following auxiliary losses:

$${{{{\mathcal{L}}}}}_{\rm{V}}=0.5{({V}_{\rm{k}}-{R}_{\rm{k}})}^{2}\,\,{{{\rm{value}}}}\,{{{\rm{function}}}}$$
$${{{{\mathcal{L}}}}}_{\rm{H}}={{\mathbb{E}}}_{\pi }\log \pi \,\,{{{\rm{entropy}}}}\,{{{\rm{regularization}}}}$$
$${{{{\mathcal{L}}}}}_{P}=-\sum\limits_{i}\left[{s}_{k+1}^{(i)}\log {\hat{s}}_{k+1}^{(i)}+{g}^{(i)}\log {\hat{g}}_{k}^{(i)}\right]\,\,{{{\rm{internal}}}}\,{{{\rm{world}}}}\,{{{\rm{model}}}}.$$

Here, \({\hat{g}}_{\rm{k}}\), and \({\hat{s}}_{{{\rm{k}}+1}}\) are additional network outputs containing the agent’s estimate of the current reward location and upcoming state, represented as categorical distributions. g and sk+1 are the corresponding ground-truth quantities, represented as one-hot vectors. \({R}_{\rm{k}}:= \mathop{\sum }\nolimits_{\rm{k}^{{\prime} } = k}^{K}{r}_{\rm{k}^{{\prime} }}\) is the empirical cumulative future reward from iteration k onward and Vk is the value function of the agent.

To maximize the utility and minimize the losses, we trained the RL agent on-policy using a policy gradient algorithm with a baseline33,34 and parameter updates of the form

$${{{{\Delta}}}} \theta \propto \sum\limits_{{ {\rm{a}}_{\rm{k}} \sim \pi}}\left[( \underbrace{\nabla_\theta \log {\pi}_k({\rm{a}_{\rm{k}})}}_{{{\rm{actor}}}} + \underbrace{\beta_v \nabla_{\theta} V_k}_{{{\rm{critic}}}} ) \delta_k - \beta_e \nabla_{\theta} \sum\limits_a \underbrace{\pi_{\rm{k,a}} \log \pi_{\rm{k,a}}}_{{{\rm{entropy}}}} + \underbrace{\beta_p {{{{\Delta}}}} \theta_{\rm{p}}}_{{{\rm{predictive}}}}\right]$$

Here, δk ≔ − Vk + Rk is the ‘advantage function’ and \({{\Delta }}{\theta }_{\rm{p}}={\nabla }_{\theta }{{{{\mathcal{L}}}}}_{\rm{P}}\) is the derivative of the predictive loss \({{{{\mathcal{L}}}}}_{P}\), which was used to train the ‘internal model’ of the agent. βp = 0.5, βv = 0.05 and βe = 0.05 are hyperparameters controlling the importance of the three auxiliary losses. While we use the predictive model explicitly in the planning loop, similar auxiliary losses are also commonly used to speed up training by encouraging the learning of useful representations58.

Our model consisted of a GRU network with 100 hidden units32 (Supplementary Note 2). The policy was computed as a linear function of the hidden state followed by a softmax normalization. The value function was computed as a linear function of the hidden state. The predictions of the next state and reward location were computed with a neural network that received as input a concatenation of the current hidden state hk and the action ak sampled from the policy (as a one-hot representation). The output layer of this feedforward network was split into a part that encoded a distribution over the predicted next state (a vector of 16 grid locations with softmax normalization) and a part that encoded the predicted reward location in the same way. This network had a single hidden layer with 33 units and a rectified linear nonlinearity.

The model was trained using Adam59 on 200,000 batches, each consisting of 40 episodes, for a total of 8 × 106 training episodes. These episodes were sampled independently from a total task space of (273 ± 13) × 106 tasks (mean ± s.e.m.). The total task space was estimated by sampling 50,000 wall configurations and computing the fraction of the resulting 1.25 × 109 pairwise comparisons that were identical, divided by 16 to account for the possible reward locations. This process was repeated ten times to estimate a mean and confidence interval. These considerations suggest that the task coverage during training was ~2.9%, which confirms that the majority of tasks seen at test time are novel (although we do not enforce this explicitly).

For all evaluations of the model, actions were sampled greedily rather than on-policy unless otherwise stated. This was done because the primary motivation for using a stochastic policy is to explore the space of policies to improve learning. Performance was better under the greedy policy at test time.

Planning

Our implementation of planning in the form of policy rollouts is described in Algorithm 2. This routine was invoked whenever a rollout was sampled from the policy instead of a physical action.

Algorithm 2

Planning routine for the RL agent

For the network update following a rollout, the input xk+1 was augmented with an additional ‘rollout input’ consisting of (1) the sequence of simulated actions, each as a one-hot vector, and (2) a binary input indicating whether the imagined sequence of states reached the imagined goal location. Additionally, the time within the session was updated by only 120 ms after a rollout in contrast to the 400-ms update after a physical action or teleportation step. For the analyses with a variable temporal opportunity cost of rollouts (Extended Data Fig. 6), we incremented time by l ⋅ 24 ms after a rollout, where l is the number of simulated actions. In Algorithm 2, we assume access to a function \(O({\tilde{s}}_{\rm{k+l+1}},\tilde{g})\), which returns imagined inputs \({\tilde{{{x}}}}_{\rm{k+l+1}}\). This function is the same as that used to generate inputs from the environment, which means that we assume that the ‘predicted’ input to the RNN during a rollout takes the same form as the ‘sensory’ input from the real environment following an action.

While both an imagined ‘physical state’ \({\tilde{s}}_{\rm{k}}\) and ‘hidden state’ \({\tilde{{{h}}}}_{\rm{k}}\) are updated during the rollout, the agent continues from the original location sk and hidden state hk after the rollout but with an augmented input. Additionally, gradients were not propagated through the rollout process, which was considered part of the ‘environment’. This means that there was no explicit gradient signal that encouraged the policy to drive useful or informative rollouts. Instead, the rollout process simply relied on the utility of the base policy optimized for acting in the environment.

Performance by number of rollouts

To quantify the performance as a function of the number of planning steps in the RL agent (Fig. 3a), we simulated each agent in 1,000 different mazes until it first found the goal and was teleported to a random location. We then proceeded to enforce a particular number of rollouts before the agent was released in trial 2. During this release phase, no more rollouts were allowed; in other words, the policy was renormalized over the physical actions and the probability of performing a rollout was set to zero. Performance was then quantified as the average number of steps needed to reach the goal during this test phase. For the control without feedback, we repeated this analysis with all feedback from the rollouts set to zero, while the recurrent dynamics were allowed to proceed as usual. The optimal reference value was computed as the average optimal path length for the trial 2 starting states.

When performing more than one sequential rollout before taking an action, the policy of the agent can continue to change through two potential mechanisms. The first is that the agent can explicitly ‘remember’ the action sequences from multiple rollouts and somehow arbitrate between them. The second is to progressively update the hidden state in a way that leads to a better expected policy with each rollout because the feedback from a rollout is incorporated into the hidden state that induces the policy used to draw the next rollout. On the basis of the analysis in Supplementary Note 1, we expect the second mechanism to be dominant, although we did not explicitly test the ability of the agent to remember multiple action sequences from sequential rollouts. For these and all other RNN analyses, the agent executed the most likely action under the policy during ‘testing’ in contrast to the sampling performed during training, where such stochasticity is necessary for exploring the space of possible actions. All results were qualitatively similar if actions were sampled during the test phase, although average performance was slightly worse.

Performance in the absence of rollouts and with shuffled rollout times

To quantify the performance of the RL agent in the absence of rollouts, we let the agent receive inputs and produce outputs as normal. However, we set the probability of performing a rollout under the policy to zero and renormalized the policy over the physical actions before choosing an action from the policy. We compared the average performance of the agent (number of rewards collected) in this setting to the performance of the default agent in the same environments.

To compare the original performance to an agent with randomized rollout times, we counted the number of rollouts performed by the default agent in each environment. We then resampled a new set of network iterations at which to perform rollouts, matching the size of this new set to the original number of rollouts performed in the corresponding environment. Finally, we let the agent interact with the environment again, while enforcing a rollout on these network iterations and preventing rollouts at all other time steps. It is worth noting that we could not predict a priori the iterations at which the agent would find the goal, at which point rollouts were not possible. If a rollout was sampled at such an iteration, we resampled this rollout from the set of remaining network iterations.

Rollouts by network size

To investigate how the frequency of rollouts depended on network size (Extended Data Fig. 2), we trained networks with 60, 80 or 100 hidden units (GRUs). Five networks were trained of each size. At regular intervals during training, we tested the networks on a series of 5,000 mazes and computed (1) the average reward per episode and (2) the fraction of actions that were rollouts rather than physical actions. We then plotted the rollout fraction as a function of average reward to see how frequently an agent of a given size performed rollouts for a particular performance.

Effect of rollouts on agent policy

To quantify the effect of rollouts on the policy of the agent, we simulated each agent in 1,000 different mazes until it first found the goal and was teleported to a random location. We then resampled rollouts until both a successful rollout and an unsuccessful rollout had been sampled. Finally, we quantified \({\pi }^{{{{\rm{pre}}}}}({\hat{a}}_{1})\) and \({\pi }^{{{{\rm{post}}}}}({\hat{a}}_{1})\) separately for the two scenarios and plotted the results in Fig. 3e. Importantly, this means that each data point in the successful analysis had a corresponding data point in the unsuccessful analysis with the exact same maze, location and hidden state. In this way, we could query the effect of rollouts on the policy without the confound of how the policy itself affects the rollouts. For this analysis, we discarded episodes where the first 100 sampled rollouts did not result in both a successful and an unsuccessful rollout.

For Extended Data Fig. 10, we used the same episodes and instead quantified π(rollout) before and after the rollout, repeating the analysis for both successful and unsuccessful rollouts.

Overlap between hidden state updates and policy gradients (Supplementary Note 1)

Using a single rollout (\(\hat{\tau }\)) to approximate the expectation over trajectories of the gradient of the expected future reward for a given episode, ∇hJfut(h), the policy gradient update in h takes the form \({{\Delta }}{{h}}\propto ({R}_{\hat{\tau }}-b){\nabla }_{\rm{{h}}}\log P(\hat{\tau })\). Here, Δh is the change in hidden state resulting from the rollout, \({R}_{\hat{\tau }}\) is the ‘reward’ of the simulated trajectory, b is a constant or state-dependent baseline and \({\nabla }_{{{h}}}\log P(\hat{\tau })\) is the gradient with respect to the hidden state of the log probability of \(\hat{\tau }\) under the policy induced by h. This implies that the derivative of the hidden state update with respect to \({R}_{\hat{\tau }}\), \({{{{\alpha }}}}^{{{{\rm{RNN}}}}}:= \frac{\partial {{\Delta }}{{h}}}{\partial {R}_{\hat{\tau }}}\), should be proportional to \({{{{\alpha }}}}^{{{{\rm{PG}}}}}:= {\nabla }_{\rm{{h}}}\log P(\hat{\tau })\).

For these analyses, we divided \(\hat{\tau }\) into its constituent actions, defining \({{{{\alpha }}}}_{\rm{k}}^{{{{\rm{PG}}}}}:= {\nabla }_{\rm{{h}}}\log p({\hat{a}}_{\rm{k}}| {\hat{a}}_{\rm{1:k-1}})\) as the derivative with respect to the hidden state of the log probability of taking the simulated action at step k, conditioned on the actions at all preceding steps (1 to k − 1) being consistent with the rollout. To compute αRNN, we also needed to take derivatives with respect to \({R}_{\hat{\tau }}\)—the reward of a rollout. A naive choice here would be to simply consider \({R}_{\hat{\tau }}\) to be the input specifying whether the rollout reached the reward. However, we hypothesized that the agent would also use information about, for example, how long the simulated trajectory was in its estimate of the ‘goodness’ of a rollout (because a shorter rollout implies that the goal was found more quickly). We, therefore, determined the direction in planning input state space that was most predictive of the time to goal of the agent. We did this by using linear regression to predict the (negative) time to next reward as a function of the planning feedback xf across episodes and rollouts. This defines the (normalized) direction \({\hat{\nu}}\) in planning input space that maximally increases the expected future reward. Finally, we defined \({R}_{\hat{\tau }}\) as the magnitude of the planning input in direction \({\hat{\nu}}\), \({R}_{\hat{\tau}}:= {{{x}}}_{\rm{f}}\cdot {\hat{\nu}}\). We could then compute αRNN with this definition of \({R}_{\hat{\tau }}\) using automatic differentiation.

In Supplementary Information, Figure S1c, we computed αRNN and \({{{{\alpha }}}}_{1}^{{{{\rm{PG}}}}}\) across 1,000 episodes and subtracted the mean across rollouts for each feature. We then performed principal component analysis on the set of \({{{{\alpha }}}}_{1}^{{{{\rm{PG}}}}}\) and projected both αRNN and \({{{{\alpha }}}}_{1}^{{{{\rm{PG}}}}}\) into the space spanned by the top three principal components (PCs). Finally, we computed the mean value of both quantities conditioned on \({\hat{a}}_{1}\) to visualize the alignment. In Supplementary Information, Figure S1d, we considered the same αRNN and \({{{{\alpha }}}}_{1}^{{{{\rm{PG}}}}}\) vectors. After mean subtraction for each feature and normalization across features for each α, we projected these into the space spanned by the top three PCs of \({{{{\alpha }}}}_{1}^{{{{\rm{PG}}}}}\). Finally, we computed the average across rollouts of the cosine similarity between the pairs of \({{{{\alpha }}}}_{1}^{{{{\rm{PG}}}}}\) and \({{{{\alpha }}}}_{1}^{{{{\rm{RNN}}}}}\) in this latent space. We performed this analysis in a low-dimensional space because we were primarily interested in changes to h within the subspace that would affect \(\log P(\hat{\tau })\). As a control, we repeated the analysis after altering the planning input xf to falsely inform the agent that it had simulated some other action \({\hat{a}}_{1,{{{\rm{ctrl}}}}}\ne {\hat{a}}_{1}\). Finally, we also repeated this analysis using \({{{{\alpha }}}}_{2}^{{{{\rm{PG}}}}}\) to characterize how the effects of the planning input propagated through the recurrent network dynamics to modulate future action probabilities.

Quantification of value functions

To quantify the error of the value function in Extended Data Fig. 5, we compared the value function computed by the agent (Vk) to the true reward-to-go (\({R}_{\rm{k}}={\sum }_{\rm{k}^{{\prime} }\ > \ = k}{r}_{\rm{k}^{{\prime} }}\)). Extended Data Fig. 5a shows the distribution of errors Vk − Rk, while the ‘constant control’ shows the distribution of \({\bar{R}}_{\rm{k}}-{R}_{\rm{k}}\), where \({\bar{R}}_{\rm{k}}\) is the mean reward-to-go across all trials and iterations. These distributions were aggregated across all agents. In Extended Data Fig. 5c, we considered sequences of n consecutive rollouts and computed the average value function before the first rollout and after each rollout. Extended Data Fig. 5d further conditions on all the rollouts in a sequence being unsuccessful.

Human data collection

The human behavioral experiments used in this study were certified as exempt from institutional review board review by the University of California San Diego Human Research Protection Program. We collected data from 100 human participants (50 male and 50 female, aged 19–57) recruited on Prolific to perform the task described in Fig. 1b. All participants provided informed consent before commencing the experiment. Subjects were asked to complete six ‘guided’ episodes where the optimal path was shown explicitly, followed by 40 nonguided episodes and 12 guided episodes. The task can be found online. During data collection, a subject was deemed ‘disengaged’ and the trial was repeated if one of three conditions were met: (1) the same key was pressed five times in a row; (2) the same key pair was pressed four times in a row; or (3) no key was pressed for 7 s. Participants were paid a fixed rate of US $3 plus a performance-dependent bonus of US $0.002 for each completed trial across both guided and nonguided episodes. The experiment took approximately 22 minutes to complete and the average pay across participants was US $10.5 per hour including the performance bonus. For the experiment without periodic boundaries (Extended Data Fig. 1), we collected data from 49 human participants (25 male and 24 female). The experiment was performed as described above, with the only difference being that we used mazes where participants could not move through the boundaries.

The data from six participants with a mean response time greater than 690 ms during the guided episodes were excluded to avoid including participants who were not sufficiently engaged with the task. For the guided episodes, only the last 10 episodes were used for further analyses. For the nonguided episodes, we discarded the first two episodes and used the last 38 episodes. This was done to give participants two episodes to get used to the task for each of the two conditions, and the first set of guided episodes was intended as an instruction in how to perform the task.

Performance as a function of trial number

We considered all episodes where the humans or RL agents completed at least four trials, evaluating the RL agents across 50,000 episodes. We then computed the average across these episodes of the number of steps to goal as a function of trial number separately for all subjects. Figure 2a illustrates the mean and s.e.m. across subjects (human participants or RL agents). The optimal value during the exploitation phase was computed by using dynamic programming to find the shortest path between each possible starting location and the goal location, averaged across all environments seen by the RL agent. To compute the exploration baseline, a brute-force search was used to identify the path that explored the full environment as quickly as possible. The optimal exploration performance was then computed as the expected time to first reward under this policy, averaged over all possible goal locations.

Estimation of thinking times

In broad strokes, we assumed that, for each action, the response time tr is the sum of a thinking time tt and some perception–action delay td, both subject to independent variability:

$${t}_{{{{\rm{r}}}}}={t}_{{{{\rm{t}}}}}+{t}_{{{{\rm{d}}}}}\quad {{{\rm{with}}}}\quad {t}_{{{{\rm{t}}}}} \sim {p}_{{{{\rm{t}}}}}\quad {{{\rm{and}}}}\quad {t}_{{{{\rm{d}}}}} \sim {p}_{{{{\rm{d}}}}}.$$

Here, {tr, tt, td} ≥ 0 because elapsed time cannot be negative. We assumed that the prior distribution over perception–action delays, pd, was identical during guided and nonguided trials. For each subject, we obtained a good model of pd (see below) by considering the distribution of response times measured during guided trials. This was possible because guided trials involved no thinking by definition, such that td ≡ tr was directly observed. Finally, for any nonguided trial with observed response tr, we formed a point estimate of the thinking time by computing the mean of the posterior p(tttr):

$${\hat{t}}_{{{{\rm{t}}}}| {t}_{{{{\rm{r}}}}}}={{\mathbb{E}}}_{\rm{p}({t}_{{{{\rm{t}}}}}| {t}_{{{{\rm{r}}}}})}[{t}_{{{{\rm{t}}}}}].$$

In more detail, we took pt during nonguided trials to be uniform between 0 and 7 s—the maximum response time allowed, beyond which subjects were considered disengaged, and the trial was discarded and reset. For pd(td), we assumed a shifted log-normal distribution,

$${p}_{{{{\rm{d}}}}}({t}_{{{{\rm{d}}}}};\mu ,\sigma ,\delta )=\left\{\begin{array}{cc}\frac{1}{({t}_{{{{\rm{d}}}}}-\delta )\sigma \sqrt{2\pi }}\exp \left[-\frac{{(\log ({t}_{{{{\rm{d}}}}}-\delta )-\mu )}^{2}}{2{\sigma }^{2}}\right]&{{{\rm{if}}}}\,{t}_{\rm{d}} > \delta \\ 0&{{{\rm{otherwise}}}}\end{array}\right.$$

where parameters μ, σ and δ were obtained from a maximum-likelihood estimation based on the collection of response times tr ≡ td observed during guided trials. For a given δ, the maximum-likelihood values of μ and σ are simply given by the mean and s.d. of the logarithm of the shifted observations. Thus, to fit this shifted log-normal model, we performed a grid search over \(\delta \in [0,\min ({t}_{r}^{{{{\rm{guided}}}}})-1]\) at 1-ms resolution and selected the value under which the optimal (μ, σ) gave the largest likelihood. This range of δ was chosen to ensure that (1) only positive values of \({t}_{r}^{{{{\rm{guided}}}}}\) had positive probability and (2) all observed \({t}_{r}^{{{{\rm{guided}}}}}\) had nonzero probability. We then retained the optimal μ, σ and δ to define the prior over pd(td) on nonguided trials for each subject.

According to Bayes’ rule, the posterior is proportional to

$$p({t}_{{{{\rm{t}}}}}| {t}_{{{{\rm{r}}}}})\propto p({t}_{{{{\rm{r}}}}}| {t}_{{{{\rm{t}}}}})p({t}_{{{{\rm{t}}}}})$$

where

$$p({t}_{{{{\rm{r}}}}}| {t}_{{{{\rm{t}}}}})=\int\nolimits_{0}^{\infty }{\rm{d}}{t}_{{{{\rm{d}}}}}\,{p}_{{{{\rm{d}}}}}({t}_{{{{\rm{d}}}}})\,p({t}_{{{{\rm{r}}}}}| {t}_{{{{\rm{t}}}}},{t}_{{{{\rm{d}}}}})$$
$$=\int\nolimits_{0}^{\infty }{\rm{d}}{t}_{{{{\rm{d}}}}}\,{p}_{{{{\rm{d}}}}}({t}_{{{{\rm{d}}}}})\,\delta ({t}_{{{{\rm{d}}}}}-({t}_{{{{\rm{r}}}}}-{t}_{{{{\rm{t}}}}}))$$
$$={p}_{{{{\rm{d}}}}}({t}_{{{{\rm{r}}}}}-{t}_{{{{\rm{t}}}}})$$

Therefore, the posterior is given by

$$p({t}_{{{{\rm{t}}}}}| {t}_{{{{\rm{r}}}}})\propto \left\{\begin{array}{ll}{p}_{{{{\rm{d}}}}}({t}_{{{{\rm{r}}}}}-{t}_{{{{\rm{t}}}}})&{{{\rm{if}}}}\,{t}_{{{{\rm{t}}}}} > 0\\ 0&{{{\rm{otherwise}}}},\end{array}\right.$$

resulting in the following posterior mean:

$${\hat{t}}_{{{{\rm{t}}}}| {t}_{{{{\rm{r}}}}}}:= {{\mathbb{E}}}_{p({t}_{{{{\rm{t}}}}}| {t}_{{{{\rm{r}}}}})}[{t}_{{{{\rm{t}}}}}]={t}_{{{{\rm{r}}}}}-\int\nolimits_{\delta }^{{t}_{{{{\rm{r}}}}}}\,{t}_{{{{\rm{d}}}}}\,{p}_{{{{\rm{d}}}}}({t}_{{{{\rm{d}}}}}| {t}_{{{{\rm{d}}}}} < {t}_{{{{\rm{r}}}}};\mu ,\sigma ,\delta )\,{\rm{d}}{t}_{{{{\rm{d}}}}}.$$

Here, pd(tdtd < tr) denotes pd(td) renormalized over the interval td < tr and the condition (td < tr) is equivalent to (tt > 0). We note that the integral runs from δ to tr because pd(td) = 0 for td < δ. Because δ simply shifts the distribution over td, we can rewrite this as

$${\hat{t}}_{{{{\rm{t}}}}| {t}_{{{{\rm{r}}}}}}={t}_{{{{\rm{r}}}}}-\delta -\int\nolimits_{0}^{{t}_{{{{\rm{r}}}}}-\delta }\,x\,{p}_{{{{\rm{d}}}}}(x| x < {t}_{{{{\rm{r}}}}}-\delta ;\mu ,\sigma ,\delta =0)\,{\rm{d}}x.$$

This is useful because the conditional expectation of a log-normally distributed random variable with δ = 0 is given in closed form by

$${{\mathbb{E}}}_{\mu ,\sigma }[x| x < k]=\int\nolimits_{0}^{k}\,x\,p(x| x < k;\mu ,\sigma ,\delta =0)\,{\rm{d}}x$$
$$=\exp [\mu +0.5{\sigma }^{2}]\frac{{{\Phi }}\left(\frac{\log (k)-\mu -{\sigma }^{2}}{\sigma }\right)}{{{\Phi }}\left(\frac{\log (k)-\mu }{\sigma }\right)},$$

where Φ( ⋅ ) is the cumulative density function of the standard Gaussian, \({{{\mathcal{N}}}}(0,1)\). This allows us to compute the posterior mean thinking time for an observed response time tr in closed form as

$${\hat{t}}_{t| {t}_{{{{\rm{r}}}}}}={t}_{{{{\rm{r}}}}}-\delta -{{\mathbb{E}}}_{\mu ,\sigma }[x| x < {t}_{{{{\rm{r}}}}}-\delta ].$$

We note that the support of pd(tdtd < tr; μ, σ, δ) is td ∈ [δ, tr]. For 0.6% of the nonguided decisions, the value of tr was lower than the estimated δ for the corresponding participant, in which case p(tttr) was undefined. In such cases, we defined the thinking time to be \({\hat{t}}_{{{{\rm{t}}}}| {t}_{{{{\rm{r}}}}}}=0\) because the response time was shorter than our estimated minimum perception–action delay. A necessary (but not sufficient) condition for tr < δ is that tr is smaller than the smallest response time in the guided trials.

The whole procedure of fitting and inference described above was repeated separately for actions that immediately followed a teleportation step (that is, the first action in each trial) and for all other actions. This is because we expected the first action in each trial to be associated with an additional perceptual delay compared to actions that followed a predictable transition.

While this approach dissociates thinking from other forms of sensorimotor processing to some extent, the thinking times reported in this work still only represent a best estimate given the available data and we use thinking to refer to any internal computational process guiding decision-making. This does not necessarily imply a conscious process that we can introspect because decision-making occurs on a fast timescale of hundreds of milliseconds.

All results were qualitatively similar using other methods for estimating thinking time, including (1) a log-normal prior over td with no shift (δ = 0); (2) using the posterior mode instead of the posterior mean; (3) estimating a constant td from the guided trials; and (4) estimating a constant td as the 0.1 or 0.25 quantile of tr from the nonguided trials.

Thinking times in different situations

To investigate how the thinking time varied in different situations, we considered only exploitation trials and computed for every action (1) the minimum distance to the goal at the beginning of the corresponding trial and (2) what action number this was within the trial. We then computed the mean thinking time as a function of action number separately for each initial distance to goal. This analysis was repeated across experimental subjects and results were reported as the mean and s.e.m. across subjects.

We repeated this analysis for the RL agents, where thinking time was now defined on the basis of the average number of rollouts performed, conditioned on action within trial and initial distance to goal.

Comparison of human and model thinking times

For each subject and each RL agent, we clamped the trajectory of the agent to that taken by the subject (that is, we used the human actions instead of sampling from the policy). After taking an action, we recorded π(rollout) under the model on the first time step of the new state for comparison to human thinking times. We then sampled a rollout with probability π(rollout) and took an action (identical to the next human action) with probability 1 − π(rollout), repeating this process until the next state was reached. Finally, we computed the average π(rollout) across 20 iterations of each RL agent for comparison to the human thinking time in each state. Figure 2e shows the human thinking time as a function of π(rollout), with the bars and error bars illustrating the mean and s.e.m. in each bin. For this analysis, data were aggregated across all participants. Results were similar if we compared human thinking times with the average number of rollouts performed rather than the initial π(rollout).

In Fig. 2f, we computed the correlation between thinking time and various regressors on a participant-by-participant basis and reported the result as the mean and s.e.m. across participants (n = 94). For the residual correlation, we first computed the mean thinking time for each momentary distance to goal for each participant and the corresponding mean π(rollout) for the RL agents. We then subtracted the appropriate mean values from the thinking times (for human participants) and π(rollout) (for RL agents). In other words, we subtracted the average across all situations where the momentary position was five steps from the goal from each of the individual data points with a momentary distance of five steps from the goal, with a similar approach for all other distances. Finally, we computed the correlation between the residual π(rollout) and the residual thinking times. This analysis was repeated across all participants and the result was reported as the mean and s.e.m. across participants. Note that all measures of the distance to goal refer to the shortest path to goal rather than the number of steps actually taken by the participant to reach the goal.

Analysis of hippocampal replays

For our analyses of hippocampal replays in rats, we used data recently recorded by Widloski and Foster7. This dataset consisted of a total of 37 sessions from three rats (n = 17, 12 and 8 sessions for each rat) as they performed a dynamic maze task. This task was carried out in a square arena with nine putative reward locations. In each session, six walls were placed in the arena and a single reward location was randomly selected as the home well. The task involved alternating between moving to this home well and a randomly selected away well. Importantly, a delay of 5–15 s was imposed between the animal leaving the previous rewarded well before the reward (chocolate milk) became available at the next rewarded well. On the away trials, the emergence of the reward was also accompanied by a visual cue at the rewarded well, informing the animal that this was the reward location. We considered only replays at the previous well before this visual cue and the reward became available. In a given session, the animals generally performed around 80 trials (40 home trials and 40 away trials; Extended Data Fig. 8). For further task details, refer to Widloski and Foster7.

For our analyses, we included only trials that lasted less than 40 s. We did this to discard time periods where the animals were not engaged with the task. Additionally, we discarded the first home trial of each session, where the home location was unknown, because we wanted to compare the hippocampal replays with model rollouts during the exploitation phase of the maze task. For all analyses, we discretized the environment into a 5 × 5 grid (the 3 × 3 grid of wells and an additional square of states around these) to facilitate more direct comparisons with our human and RNN task. Following Widloski and Foster7, we defined ‘movement epochs’ as times where the animal had a velocity greater than 2 cm s−1 and ‘stationary epochs’ as times there the animal had a velocity less than 2 cm s−1.

Replay detection

To detect replays, we followed Widloski and Foster7 and fitted a Bayesian decoder to neural activity as a function of position during movement epochs in each session, assuming Poisson noise statistics and considering only neurons with an average firing rate of at least 0.1 Hz over the course of the session. This decoder was trained on a rolling window of neural activity spanning 75 ms and sampled at 5-ms intervals7. We then detected replays during stationary epochs by classifying each momentary hippocampal state as the maximum-likelihood state under the Bayesian decoder, again using neural activity in 75-ms windows at 5-ms intervals. Forward replays were defined as sequences of states that included two consecutive transitions to an adjacent state (that is, a temporally and spatially contiguous sequence of three or more states) and originated at the true animal location. For all animals, we analyzed only replays where the animal was at the previous reward location before it initiated the new trial (refer to Widloski and Foster7). To increase noise robustness, we allowed for short ‘lapses’ in a replay, defined as periods with a duration less than or equal to 20 ms, where the decoded location moved to a distant location before returning to the previously decoded location. These lapses were ignored for downstream analyses.

Wall avoidance

To compute the wall avoidance of replays (Fig. 4b), we calculated the fraction of state transitions that passed through a wall. This was done across all replays preceding a home trial (that is, when the animal knew the next goal). As a control, we computed the same quantity averaged over seven control conditions, which corresponded to the remaining nonidentical rotations and reflections of the walls from the corresponding session. We repeated this analysis for all sessions and reported the results in Fig. 4 as the mean and s.e.m. across sessions. To test for significance, we randomly permuted the ‘true’ and ‘control’ labels independently for each session and computed the fraction of permutations (out of 10,000), where the difference between ‘control’ and ‘true’ was larger than the experimentally observed value. This analysis was also repeated in the RL agent, where the control value was computed with respect to 50,000 other wall configurations sampled from the maze-generating algorithm (Algorithm 1).

Reward enrichment

To compute the reward enrichment in hippocampal replays (Fig. 4c), we computed the fraction of all replays preceding a home trial that passed through the reward location. As a control, we repeated this analysis for the remaining seven locations that were neither the reward location nor the current agent location (for each replay). Control values were reported as the average across these seven control locations across all replays. This analysis was repeated for all sessions. While this can lead to systematic differences between true and control values in individual trials depending on how close the true reward location is to the current animal locations, the distance to goal will be the same in expectation between the true reward location and the control locations. This is also why we did not see an effect for the away trials in Extended Data Fig. 9c.

To test for significance, we randomly permuted the ‘goal’ and ‘control’ labels independently for each session. Here, the goal and control values permuted were those computed by averaging across all trials and control locations in the session (that is, we randomly swapped the 37 pairs of data points shown in gray in Fig. 4c). We then computed the fraction of permutations (out of 10,000) where the difference between goal and control was larger than the experimentally observed value after averaging across sessions.

This analysis was also repeated in the RL agent, where the control value was computed across the 14 locations that were not the current agent location or the true goal.

Behavior by replay type

To investigate how the animal behavior depended on the type of replay (Fig. 4d), we analyzed home trials and away trials separately. We constructed a list of all the ‘first’ replayed actions \({\hat{a}}_{1}\), defined as the cardinal direction corresponding to the first state transition in each replay. We then constructed a corresponding list of the first physical action following the replay, corresponding to the cardinal direction of the first physical state transition after the replay. Finally, we computed the overlap between these two vectors to arrive at the probability of ‘following’ a replay. This overlap was computed separately for successful and unsuccessful replays, where successful replays were defined as those that reached the goal without passing through a wall. For the unsuccessful replays, we considered the seven remaining locations that were not the current animal location or current goal. We then computed the average overlap under the assumption that each of these locations was the goal, while discarding replays that were successful for the true goal. The reason for not considering replays that were successful for the true goal in the unsuccessful setting is because we were primarily interested in the distinction between replays that were successful versus unsuccessful to the true goal and, therefore, wanted disjoint sets of replays in these two analyses. However, we did this while considering whether replays were successful toward a control location to better match the spatiotemporal statistics of replays in the two categories. The analysis was performed independently across all sessions and results were reported as the mean and s.e.m. across sessions. To test for significance, we randomly permuted the successful and unsuccessful labels independently for each session and computed the fraction of permutations (out of 10,000) where the difference between successful and unsuccessful replays was larger than the experimentally observed value.

To confirm that our results were not biased by the choice to exclude replays that were successful to the true goal location from our set of unsuccessful replays, we performed an additional control analysis, where the control replays were the full set of replays that were successful toward a randomly sampled control location, concatenated across control locations. In this case, the control value for the home trials was \(P({a}_{1}={\hat{a}}_{1})=0.433\) instead of \(P({a}_{1}={\hat{a}}_{1})=0.403\) with the disjoint set of unsuccessful replays used in the main text. This is still significantly smaller than the value of \(P({a}_{1}={\hat{a}}_{1})=0.622\) for the true goal location, with our permutation test in both cases yielding P < 0.001 for the home trials and no significant effect for the away trials.

This analysis was also repeated for the RL agent, where we considered all exploitation trials together because they were not divided into home or away trials. In this case, the control was computed with respect to all 14 locations that were not the current location or current goal location.

Effect of consecutive replays

To compute how the probability of a replay being successful depended on replay number (Fig. 4e), we considered all trials where an animal performed at least three replays. We then computed a binary vector indicating whether each replay was successful. From this vector, we subtracted the expected success frequency from a linear model predicting success from (1) the time since arriving at the current well and (2) the time until departing the current well. We did this to account for any effect of time that was separate from the effect of replay number because such an effect was previously reported by Ólafsdóttir et al.39. However, this work also notes that many of what they denoted as disengaged replays were nonlocal and would automatically be filtered out by our focus on local replays. When fitting this linear model, we capped all time differences at a maximum value of ∣Δt∣ = 15 s to avoid the analysis being dominated by outliers and because Ólafsdóttir et al.39 observed an effect for time differences only in this range. Our results were not sensitive to altering or removing this threshold. We then conditioned on replay number and computed the probability of success (after regressing out time) as a function of replay number. Finally, we repeated this analysis for all seven control locations for each replay and divided the true values by control values defined as the average across replays of the average across control locations. A separate correction factor was subtracted from these control locations, which was computed by fitting a linear model to predict the average probability of successfully reaching a control location as a function of the predictors described above. The normalization by control locations was performed to account for changes in replay statistics that might affect the results, such as systematically increasing or decreasing replay durations with replay number. To compute the statistical significance of the increase in goal over-representation, we also performed this analysis after independently permuting the order of the replays in each trial to break any temporal structure. This permutation was performed after regressing out the effect of time. We repeated this analysis across 10,000 independent permutations and computed statistical significance as the number of permutations for which the increase in over-representation was greater than or equal to the experimental value.

For the corresponding analysis in the RL agents, we did not regress out time because there is no separability between time and replay number. Additionally, the RL agent cannot alter its policy in the absence of explicit network updates, which are tied to either a rollout or an action in our model. As noted in the main text, an increase in the probability of success with replay number in the RL agent could also arise from the fact that performing further replays is less likely after a successful replay than after an unsuccessful replay (Extended Data Fig. 10). We, therefore, performed the analysis of consecutive replays in the RL agent in a ‘cross-validated’ manner at the level of the policy. In other words, every time the agent performed a rollout, we drew two samples from the rollout generation process. The first of these samples was used as normal by the agent to update hk and drive future behavior. The second sample was not used by the agent but was instead used to compute the ‘success frequency’ for our analyses. This was done to break the correlation between the choice of performing a replay and the assessment of how good the policy was, which allowed us to compute an unbiased estimate of the quality of the policy as a function of replay number. As mentioned in the main text, such an analysis was not possible for the biological data. However, because the biological task was not a reaction time task, we expect less of a causal effect of replay success on the number of replays. Additionally, as noted in the text, if some of the effect in the biological data is in fact driven by a decreased propensity for further replays after a successful replay, that is in itself supporting evidence for a theory of replays as a form of planning.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.