Account
Extended Data Fig. 4: Analyses of networks with different hyperparameters. | Nature Neuroscience

Extended Data Fig. 4: Analyses of networks with different hyperparameters.

From: A recurrent network model of planning explains hippocampal replay and human behavior

Extended Data Fig. 4

To investigate the robustness of our results to network size (N) and maximum planning horizon (L), we trained five networks with each combination of N ∈ {60, 100, 140} and L ∈ {4, 8, 12}. The results in the main text are all reported for a network with N = 100 and L = 8. (a) Correlation between human response times and the mean π(rollout) across five RL agents for each set of hyperparameters (c.f. Fig. 2f). x-ticks indicate network size and planning horizon as (N, L). Error bars indicate standard error of the mean across human participants (gray dots; n = 94). (b) Improvement in the network policy after five rollouts compared to the policy in the absence of rollouts (c.f. Fig. 3a). The policy improvement was quantified as the average number of steps needed to reach the goal on trial 2 in the absence of rollouts, minus the average number of steps needed with five rollouts enforced at the beginning of the trial. Positive values indicate that rollouts improved the policy. Bars and error bars indicate mean and standard error across five RL agents (gray dots). (c) For each set of hyperparameters, we computed the average change in \(\pi ({\hat{a}}_{1})\) from before a rollout to after a rollout and report this change separately for successful (‘succ’) and unsuccessful (‘un’) rollouts (c.f. Fig. 3e). Positive values indicate that \({\hat{a}}_{1}\) became more likely and negative values that \({\hat{a}}_{1}\) became less likely after the rollout. Bars and error bars indicate mean and standard error across five RL agents (gray dots). Networks with longer planning horizons tend to have less positive \(\Delta \pi ({\hat{a}}_{1})\) for successful rollouts and more negative \(\Delta \pi ({\hat{a}}_{1})\) for unsuccessful rollouts. This is consistent with a policy gradient-like algorithm (Supplementary Note 1) with a baseline that approximates the probability of success, which increases with planning horizon. Since longer rollouts are more likely to reach the goal, we should expect them to be successful and not strongly update our policy when it occurs. Conversely, an unsuccessful rollout is less likely and should lead to a large policy change.

Source data

Back to article page

Navigation