Stimulus Selection in a Q-learning Model Using Fisher Information and Monte Carlo Simulation

Fujita, Kazuya; Okada, Kensuke; Katahira, Kentaro

doi:10.1007/s42113-022-00163-0

Stimulus Selection in a Q-learning Model Using Fisher Information and Monte Carlo Simulation

Original Paper
Published: 30 January 2023

Volume 6, pages 262–279, (2023)
Cite this article

Computational Brain & Behavior Aims and scope Submit manuscript

193 Accesses
Explore all metrics

Abstract

Reinforcement learning models have been extensively studied for decision-making tasks with reward feedback. However, in designing an experiment to collect data for Q-learning models, the quantitative effect of a presented stimulus on the estimation precision of participant parameters has generally not been considered. That is, the lack of a mathematical framework has prevented researchers from designing an optimal experiment. To tackle this problem, this study analytically derives the Fisher information. Furthermore, this study formulates a stochastic representation of the Q-learning model, which is one of the most commonly applied reinforcement learning models. With this derivation, a two-step procedure is proposed to select the optimal stimuli in terms of estimation precision, in which low-cost Fisher information evaluation and more detailed finite-sample Monte Carlo simulation are combined. The simulation studies show that reward probability reversal leads to a high estimation precision for the learning rate parameter. By contrast, for the inverse temperature parameter, a larger difference in reward probability between options leads to higher estimation precision. These results reveal that the optimal experimental design is dependent on which trait parameters of the Q-learning model are of interest to researchers. Further, it is found that the use of undesirable stimuli in terms of trait parameter precision leads to a large bias in the correlation coefficient estimate. Based on the results, the approaches to designing experiments in the Q-learning model are discussed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Examinations of Biases by Model Misspecification and Parameter Reliability of Reinforcement Learning Models

Article Open access 21 June 2023

Decomposing the effects of context valence and feedback information on speed and accuracy during reinforcement learning: a meta-analytical approach using diffusion decision modeling

Article Open access 07 June 2019

Exploration and recency as the main proximate causes of probability matching: a reinforcement learning analysis

Article Open access 10 November 2017

Data Availability

The datasets generated and/or analyzed during the current study are available from the Open Science Framework repository at https://osf.io/msf5e/.

Code Availability

The R code used to produce the results in the five simulation studies is available from the Open Science Framework repository at https://osf.io/msf5e/.

References

Ahn, W. Y., Gu, H., Shen, Y., Haines, N., Hahn, H. A., Teater, J. E., Myung, J. I., & Pitt, M. A. (2020). Rapid, precise, and reliable measurement of delay discounting using a Bayesian learning algorithm. Scientific Reports, 10, 12091. https://doi.org/10.1038/s41598-020-68587-x
Article PubMed PubMed Central Google Scholar
Bak, J. H., & Pillow, J. W. (2018). Adaptive stimulus selection for multi-alternative psychometric functions with lapses. Journal of Vision, 18, 1–25. https://doi.org/10.1167/18.12.4
Article Google Scholar
Beevers, C. G., Worthy, D. A., Gorlick, M. A., Nix, B., Chotibut, T., & Maddox, W. T. (2013). Influence of depression symptoms on history-independent reward and punishment processing. Psychiatry Research, 207, 53–60. https://doi.org/10.1016/j.psychres.2012.09.054
Article PubMed Google Scholar
Broomell, S. B., & Bhatia, S. (2014). Parameter recovery for decision modeling using choice data. Decision, 1, 252–274. https://doi.org/10.1037/dec0000020
Article Google Scholar
Cavagnaro, D. R., Gonzalez, R., Myung, J. I., & Pitt, M. A. (2013). Optimal decision stimuli for risky choice experiments: An adaptive approach. Management Science, 59, 358–375. https://doi.org/10.1287/mnsc.1120.1558
Article PubMed PubMed Central Google Scholar
Cavagnaro, D. R., Myung, J. I., Pitt, M. A., & Kujala, J. V. (2010). Adaptive design optimization: A mutual information-based approach to model discrimination in cognitive science. Neural Computation, 22, 887–905. https://doi.org/10.1162/neco.2009.02-09-959
Article PubMed Google Scholar
Cavanaugh, J. E., & Shumway, R. H. (1996). On computing the expected Fisher information matrix for state-space model parameters. Statistics & Probability Letters, 26, 347–355. https://doi.org/10.1016/0167-7152(95)00031-3
Article Google Scholar
Chang, H. H. (2015). Psychometrics behind computerized adaptive testing. Psychometrika, 80, 1–20. https://doi.org/10.1007/s11336-014-9401-5
Article PubMed Google Scholar
Chase, H. W., Frank, M. J., Michael, A., Bullmore, E. T., Sahakian, B. J., & Robbins, T. W. (2010). Approach and avoidance learning in patients with major depression and healthy controls : Relation to anhedonia. Psychological Medicine, 40, 433–440. https://doi.org/10.1017/S0033291709990468
Article PubMed Google Scholar
Chen, P., Engel, S., & Wang, C. (2019). The multivariate adaptive design for efficient estimation of the time course of perceptual adaptation. Behavior Research Methods, 52, 1073–1090. https://doi.org/10.3758/s13428-019-01301-6
Article Google Scholar
Daw, N. (2011). Trial-by-trial data analysis using computational models. In Delgado, M. R., Phelps, E. A., & Robbins, T. W (Ed). Decision making, affect, and learning: Attention and performance XXIII. https://doi.org/10.1093/acprof:oso/9780199600434.001.0001
Daw, N. D., Gershman, S. J., Seymour, B., Dayan, P., & Dolan, R. J. (2011). Model-based influences on humans’ choices and striatal prediction errors. Neuron, 69, 1204–1215. https://doi.org/10.1016/j.neuron.2011.02.027
Article PubMed PubMed Central Google Scholar
Dezfouli, A., Griffiths, K., Ramos, F., Dayan, P., & Balleine, W. (2019). Models that learn how humans learn : The case of decision-making and its disorders. PLoS Computational Biology, 15, e1006903. https://doi.org/10.1371/journal.pcbi.1006903
Article PubMed PubMed Central Google Scholar
Doll, R. J., Buitenweg, J. R., Meijer, H. G. E., & Veltink, P. H. (2014). Tracking of nociceptive thresholds using adaptive psychophysical methods. Behavior Research Methods, 46, 55–66. https://doi.org/10.3758/s13428-013-0368-4
Article PubMed Google Scholar
Ferrando, P., & Lorenzo-Seva, U. (2007). An item response theory model for incorporating response time data in binary personality items. Applied Psychological Measurement, 31, 525–543. https://doi.org/10.1177/0146621606295197
Article Google Scholar
Fleiss, J. L., & Shrout, P. E. (1977). The effects of measurement errors on some multivariate procedures. American Journal of Public Health, 67, 1188–1191. https://doi.org/10.2105/ajph.67.12.1188
Article PubMed PubMed Central Google Scholar
Gershman, S. J. (2016). Empirical priors for reinforcement learning models. Journal of Mathematical Psychology, 71, 1–6. https://doi.org/10.1016/j.jmp.2016.01.006
Article Google Scholar
Ito, M., & Doya, K. (2009). Validation of decision-making models and analysis of decision variables in the rat basal ganglia. The Journal of Neuroscience, 29, 9861–9874. https://doi.org/10.1523/JNEUROSCI.6157-08.2009
Article PubMed PubMed Central Google Scholar
Katahira, K. (2016). How hierarchical models improve point estimates of model parameters at the individual level. Journal of Mathematical Psychology, 73, 37–58. https://doi.org/10.1016/j.jmp.2016.03.007
Article Google Scholar
Katahira, K. (2018). The statistical structures of reinforcement learning with asymmetric value updates. Journal of Mathematical Psychology, 87, 31–45. https://doi.org/10.1016/j.jmp.2018.09.002
Article Google Scholar
Katahira, K., Fujimura, T., Okanoya, K., & Okada, M. (2011). Decision-making based on emotional images. Frontiers in Psychology, 2, 311. https://doi.org/10.3389/fpsyg.2011.00311
Article PubMed PubMed Central Google Scholar
Kontsevich, L. L., & Tyler, C. W. (1999). Bayesian adaptive estimation of psychometric slope and threshold. Vision Research, 39, 2729–2737. https://doi.org/10.1016/S0042-6989(98)00285-5
Article PubMed Google Scholar
Kunisato, Y., Okamoto, Y., Ueda, K., Onoda, K., Okada, G., Yoshimura, S., Suzuki, S., Samejima, K., & Yamawaki, S. (2012). Effects of depression on reward-based decision making and variability of action in probabilistic learning. Journal of Behavior Therapy and Experimental Psychiatry, 43, 1088–1094. https://doi.org/10.1016/j.jbtep.2012.05.007
Liu, K. (1988). Measurement error and its impact on partial correlation and multiple linear regression analyses. American Journal of Epidemiology, 127, 864–874. https://doi.org/10.1093/oxfordjournals.aje.a114870
Article PubMed Google Scholar
Ly, A., Marsman, M., Verhagen, J., Grasman, R. P. P. P., & Wagenmakers, E. J. (2017). A tutorial on fisher information. Journal of Mathematical Psychology, 80, 40–55. https://doi.org/10.1016/j.jmp.2017.05.006
Article Google Scholar
Mulder, J., & Van Der Linden, W. J. (2009). Multidimensional adaptive testing with optimal design criteria for item selection. Psychometrika, 74, 273–296. https://doi.org/10.1007/s11336-008-9097-5
Article PubMed Google Scholar
Myung, J. I., Cavagnaro, D. A., & Pitt, M. A. (2013). A tutorial on adaptive design optimization. Journal of Mathematical Psychology, 57, 53–67. https://doi.org/10.1016/j.jmp.2013.05.005
Article PubMed PubMed Central Google Scholar
Myung, J. I., & Pitt, M. A. (2009). Optimal experimental design for model discrimination. Psychological Review, 116, 499–518. https://doi.org/10.1037/a0016104
Article PubMed PubMed Central Google Scholar
Robinson, O. J., & Chase, H. W. (2017). Learning and choice in mood disorders: Searching for the computational parameters of anhedonia. Computational Psychiatry, 1, 208–233. https://doi.org/10.1162/CPSY_a_00009
Article PubMed Google Scholar
Segall, D. O. (1996). Multidimensional adaptive testing. Psychometrika, 61, 331–354. https://doi.org/10.1007/BF02294343
Article Google Scholar
Shumway, R. H., & Stoffer, D. S. (1982). An approach to time series smoothing and forecasting using the EM algorithm. Journal of Time Series Analysis, 3, 253–264. https://doi.org/10.1126/science.275.5306.1593
Article Google Scholar
Stan Development Team (2020). Rstan: The R interface to Stan. R package version 2.21.2, http://mc-stan.org/
Steyvers, M., Lee, M. D., & Wagenmakers, E. (2009). A Bayesian analysis of human decision-making on bandit problems. Journal of Mathematical Psychology, 53, 168–179. https://doi.org/10.1016/j.jmp.2008.11.002
Article Google Scholar
Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction (2nd ed.). MIT Press.
Google Scholar
Toubia, O., Johnson, E., Evgeniou, T., & Delquié, P. (2013). Dynamic experiments for estimating preferences: An adaptive method of eliciting time and risk parameters. Management Science, 59, 613–640. https://doi.org/10.1287/mnsc.1120.1570
Article Google Scholar
Toyama, A., Katahira, K., & Ohira, H. (2017). A simple computational algorithm of model-based choice preference. Cognitive, Affective & Behavioral Neuroscience, 17, 764–783. https://doi.org/10.3758/s13415-017-0511-2
Article Google Scholar
Toyama, A., Katahira, K., & Ohira, H. (2019). Biases in estimating the balance between model-free and model-based learning systems due to model misspecification. Journal of Mathematical Psychology, 91, 88–102. https://doi.org/10.1016/j.jmp.2019.03.007
Article Google Scholar
van der Linden, W. J. (2018). Adaptive testing. In: van der Linden W. J. (Ed). Handbook of item response theory, volume three: application. (pp.197–228). https://doi.org/10.1201/9781315119144
Watkins, C. J. C. H., & Dayan, P. (1992). Q-learning. Machine Learning, 8, 279–292. https://doi.org/10.1007/BF00992698
Article Google Scholar
Wilson, R. C., & Collins, A. G. E. (2019). Ten simple rules for the computational modeling of behavioral data. eLife, 8, 1–33. https://doi.org/10.7554/eLife.49547
Article Google Scholar
Yang, J., Pitt, M. A., Ahn, W. Y., & Myung, J. I. (2021). ADOpy: A python package for adaptive design optimization. Behavior Research Methods, 53(2), 874–897. https://doi.org/10.3758/s13428-020-01386-4
Article PubMed PubMed Central Google Scholar
Zhang, S., & Lee, M. D. (2010). Optimal experimental design for a class of bandit problems. Journal of Mathematical Psychology, 54, 499–508. https://doi.org/10.1016/j.jmp.2010.08.002
Article Google Scholar

Download references

Funding

This work was supported by grants from the Japan Society for the Promotion of Science (Grant Numbers 1920J22350, 18H03612).

Author information

Authors and Affiliations

Graduate School of Informatics, Nagoya University, Nagoya, Japan
Kazuya Fujita & Kentaro Katahira
Japan Society for the Promotion of Science, Tokyo, Japan
Kazuya Fujita
Graduate School of Education, The University of Tokyo, Tokyo, Japan
Kensuke Okada

Authors

Kazuya Fujita
View author publications
You can also search for this author in PubMed Google Scholar
Kensuke Okada
View author publications
You can also search for this author in PubMed Google Scholar
Kentaro Katahira
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kazuya Fujita.

Ethics declarations

Ethical Approval

Not applicable.

Consent to Publish

Not applicable.

Consent to Participate

Not applicable.

Competing Interests

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file1 (DOCX 493 KB)

Appendix 1

In this appendix, we derive the Fisher information matrix for CAT. As noted previously, the key here is to regard ${r}_{t,k}$ as a random variable. Then, the log-likelihood is given by

$$\begin{array}{c}\mathrm{log}p\left({{\varvec{r}}}_{\left({\varvec{T}}\right)},\boldsymbol{ }{{\varvec{u}}}_{\left({\varvec{T}}\right)}|{\varvec{\theta}},\boldsymbol{ }{{\varvec{z}}}_{\left({\varvec{T}}\right)}\right)={\sum }_{t=1}^{T}\left\{\mathrm{log}p\left({{\varvec{u}}}_{{\varvec{t}}}|{\varvec{\theta}},\boldsymbol{ }{{\varvec{z}}}_{{\varvec{t}}-1}\right)+\mathrm{log}p\left({{\varvec{r}}}_{{\varvec{t}}}|{{\varvec{u}}}_{{\varvec{t}}},{\varvec{\theta}},\boldsymbol{ }{{\varvec{z}}}_{{\varvec{t}}-1}\right)\right\}.\end{array}$$

(A1)

However, researchers can ignore $\mathrm{log}p\left({{\varvec{r}}}_{\left({\varvec{T}}\right)}|{{\varvec{u}}}_{\left({\varvec{T}}\right)},{\varvec{\theta}},\boldsymbol{ }{{\varvec{z}}}_{\left({\varvec{T}}\right)}\right)$ during the calculation of the derivative and estimation because $\mathrm{log}p\left({{\varvec{r}}}_{\left({\varvec{T}}\right)}|{{\varvec{u}}}_{\left({\varvec{T}}\right)},{\varvec{\theta}},\boldsymbol{ }{{\varvec{z}}}_{\left({\varvec{T}}\right)}\right)$ is independent of the participant parameter ${\varvec{\theta}}$. That is, it holds that

$$\begin{array}{c}\frac{\partial }{\partial{\varvec{\theta}}}\mathrm{log}p\left({{\varvec{r}}}_{\left({\varvec{T}}\right)},\boldsymbol{ }{{\varvec{u}}}_{\left({\varvec{T}}\right)}|{\varvec{\theta}},\boldsymbol{ }{{\varvec{z}}}_{\left({\varvec{T}}\right)}\right)=\frac{\partial }{\partial{\varvec{\theta}}}\mathrm{log}p\left({{\varvec{u}}}_{\left({\varvec{T}}\right)}|{\varvec{\theta}},\boldsymbol{ }{{\varvec{z}}}_{\left({\varvec{T}}\right)}\right).\end{array}$$

(A2)

Although the data-generating process of reward ${r}_{t,k}$, $\mathrm{log}p\left({{\varvec{r}}}_{\left({\varvec{T}}\right)}|{{\varvec{u}}}_{\left({\varvec{T}}\right)},{\varvec{\theta}},\boldsymbol{ }{{\varvec{z}}}_{\left({\varvec{T}}\right)}\right)$, has often been ignored, it is modeled herein to conduct CAT (Supplement C). Considering the equations, we can derive Eqs. (17) to (20) as in the “Deriving the Fisher Information Matrix Using Obtained Observations” section.

${\mathrm{E}}_{\left({{\varvec{u}}}_{\left({\varvec{t}}\right)},{{\varvec{r}}}_{\left({\varvec{t}}\right)}\right)}\left[\left(\frac{\partial }{\partial \alpha }{Q}_{t,k}\left({\varvec{\theta}}\right)\right)\left(\frac{\partial }{\partial \alpha }{Q}_{t,{k}^{^{\prime}}}\left({\varvec{\theta}}\right)\right)\right]$, ${\mathrm{E}}_{\left({{\varvec{u}}}_{\left({\varvec{t}}\right)},{{\varvec{r}}}_{\left({\varvec{t}}\right)}\right)}\left[\left(\frac{\partial }{\partial \alpha }{Q}_{t,k}\left({\varvec{\theta}}\right)\right){Q}_{t,{k}^{^{\prime}}}\left({\varvec{\theta}}\right)\right]$, and ${\mathrm{E}}_{\left({{\varvec{u}}}_{\left({\varvec{t}}\right)},{{\varvec{r}}}_{\left({\varvec{t}}\right)}\right)}\left[{Q}_{t,k}\left({\varvec{\theta}}\right){Q}_{t,{k}^{^{\prime}}}\left({\varvec{\theta}}\right)\right]$ are required to calculate Eqs. (18)–(20). From Eq. (16), $\left(\frac{\partial }{\partial a}{Q}_{t,k}\left({\varvec{\theta}}\right)\right)\left(\frac{\partial }{\partial a}{Q}_{t,{k}^{^{\prime}}}\left({\varvec{\theta}}\right)\right)$ is given by

$$\begin{array}{c}\left(\frac{\partial }{\partial \alpha }{Q}_{t,k}\left({\varvec{\theta}}\right)\right)\left(\frac{\partial }{\partial \alpha }{Q}_{t,{k}^{^{\prime}}}\left({\varvec{\theta}}\right)\right)= \\ \left(1-\mathrm{I}\left(k\right)\alpha \right)\left(1-\mathrm{I}\left({k}^{^{\prime}}\right)\alpha \right)\left[\left(\frac{\partial }{\partial \alpha }{Q}_{t-1,k}\left({\varvec{\theta}}\right)\right)\left(\frac{\partial }{\partial \alpha }{Q}_{t-1,{k}^{^{\prime}}}\left({\varvec{\theta}}\right)\right)\right]+\\ \left(1-\mathrm{I}\left(k\right)\alpha \right)I\left({k}^{^{\prime}}\right)\left[\left(\frac{\partial }{\partial \alpha }{Q}_{t-1,k}\left({\varvec{\theta}}\right)\right)\right]{\Delta }_{t-1,{k}^{^{\prime}}}+ \\ \left(1-\mathrm{I}\left({k}^{^{\prime}}\right)\alpha \right)I\left(k\right)\left[\left(\frac{\partial }{\partial \alpha }{Q}_{t-1,{k}^{^{\prime}}}\left({\varvec{\theta}}\right)\right)\right]{\Delta }_{t-1,k}+ \\ I\left(k\right)I\left({k}^{^{\prime}}\right){\Delta }_{t-1,k}{\Delta }_{t-1,{k}^{^{\prime}}} ,\end{array}$$

(A3)

where ${\Delta }_{t,k}={r}_{t,k}-{Q}_{t,k}\left({\varvec{\theta}}\right)$, $\mathrm{I}\left(k\right)={\mathrm{I}}_{\mathrm{A}}\left(k-1+{u}_{t-1}\right)$. $\left(1-\mathrm{I}\left(k\right)\alpha \right)\left(1-\mathrm{I}\left({k}^{\mathrm{^{\prime}}}\right)\alpha \right)$ is dependent only on ${u}_{t-1}$, and $\left(\frac{\partial }{\partial \alpha }{Q}_{t-1,k}\left({\varvec{\theta}}\right)\right)\left(\frac{\partial }{\partial \alpha }{Q}_{t-1,k\mathrm{^{\prime}}}\left({\varvec{\theta}}\right)\right)$ is dependent on ${{\varvec{u}}}_{\left({\varvec{t}}-2\right)},{{\varvec{r}}}_{({\varvec{t}}-2)}$ considering the updated equation. The other terms remain the same. Therefore, we can calculate expectations separately as follows:

$$\begin{array}{c}{\mathrm{E}}_{\left({{\varvec{u}}}_{\left({\varvec{t}}\right)},{{\varvec{r}}}_{\left({\varvec{t}}\right)}\right)}\left[\left(\frac{\partial }{\partial \alpha }{Q}_{t,k}\left({\varvec{\theta}}\right)\right)\left(\frac{\partial }{\partial \alpha }{Q}_{t,{k}^{^{\prime}}}\left({\varvec{\theta}}\right)\right)\right]= \\ {\mathrm{E}}_{{{\varvec{u}}}_{\left({\varvec{t}}-1\right)}}\left[\left(1-\mathrm{I}\left(k\right)\alpha \right)\left(1-\mathrm{I}\left({k}^{^{\prime}}\right)\alpha \right)\right] {\mathrm{E}}_{\left({{\varvec{u}}}_{\left({\varvec{t}}-2\right)},{{\varvec{r}}}_{\left({\varvec{t}}-1\right)}\right)}\left[\left(\frac{\partial }{\partial \alpha }{Q}_{t-1,k}\left({\varvec{\theta}}\right)\right)\left(\frac{\partial }{\partial \alpha }{Q}_{t-1,{k}^{^{\prime}}}\left({\varvec{\theta}}\right)\right)\right]+\\ {\mathrm{E}}_{{{\varvec{u}}}_{\left({\varvec{t}}-1\right)}}\left[\left(1-\mathrm{I}\left(k\right)\alpha \right)\mathrm{I}\left({k}^{^{\prime}}\right)\right] {\mathrm{E}}_{\left({{\varvec{u}}}_{\left({\varvec{t}}-2\right)},{{\varvec{r}}}_{\left({\varvec{t}}-1\right)}\right)}\left[\left[\left(\frac{\partial }{\partial \alpha }{Q}_{t-1,k}\left({\varvec{\theta}}\right)\right)\right]{\Delta }_{t-1,{k}^{^{\prime}}}\right]+ \\ {\mathrm{E}}_{{{\varvec{u}}}_{\left({\varvec{t}}-1\right)}}\left[\left(1-\mathrm{I}\left({k}^{^{\prime}}\right)\alpha \right)\mathrm{I}\left(k\right)\right] {\mathrm{E}}_{\left({{\varvec{u}}}_{\left({\varvec{t}}-2\right)},{{\varvec{r}}}_{\left({\varvec{t}}-1\right)}\right)}\left[\left[\left(\frac{\partial }{\partial \alpha }{Q}_{t-1,{k}^{^{\prime}}}\left({\varvec{\theta}}\right)\right)\right]{\Delta }_{t-1,k}\right]+ \\ {\mathrm{E}}_{{{\varvec{u}}}_{\left({\varvec{t}}-1\right)}}\left[\mathrm{I}\left(k\right)\mathrm{I}\left({k}^{^{\prime}}\right)\right] {\mathrm{E}}_{\left({{\varvec{u}}}_{\left({\varvec{t}}-2\right)},{{\varvec{r}}}_{\left({\varvec{t}}-1\right)}\right)}\left[{\Delta }_{t-1,k}{\Delta }_{t-1,{k}^{^{\prime}}}\right],\end{array}$$

(A4)

where

$$\begin{array}{c}{\mathrm{E}}_{{{\varvec{u}}}_{\left({\varvec{t}}-1\right)}}\left[\left(1-\mathrm{I}\left(k\right)\alpha \right)\left(1-\mathrm{I}\left({k}^{^{\prime}}\right)\alpha \right)\right]={\delta }_{k{k}^{^{\prime}}}\left\{\mathrm{E}\left[\mathrm{I}\left(k\right)\right]{\left(1-\alpha \right)}^{2}+\left(1-\mathrm{E}\left[\mathrm{I}\left(k\right)\right]\right)\right\}+\left(1-{\delta }_{k{k}^{^{\prime}}}\right)\left(1-\alpha \right),\\ {\mathrm{E}}_{{{\varvec{u}}}_{\left({\varvec{t}}-1\right)}}\left[\left(1-\mathrm{I}\left(k\right)\alpha \right)\mathrm{I}\left({k}^{^{\prime}}\right)\right]=\left(1-{\delta }_{k{k}^{^{\prime}}}\alpha \right)\mathrm{E}\left[\mathrm{I}\left(k\right)\right]\\ {\mathrm{E}}_{{{\varvec{u}}}_{\left({\varvec{t}}-1\right)}}\left[\mathrm{I}\left(k\right)\mathrm{I}\left({k}^{^{\prime}}\right)\right]={\delta }_{k{k}^{^{\prime}}}\mathrm{E}\left[\mathrm{I}\left(k\right)\right].\end{array}$$

(A5)

Note that delta function, δ_kk', and $\mathrm{E}\left[\mathrm{I}\left(k\right)\right]$ are given by

$$\begin{array}{c}{\delta }_{k{k}^{^{\prime}}}=\left\{\begin{array}{c}1 , if \: k={k}^{^{\prime}}\\ 0 , if \: k\ne {k}^{^{\prime}}\end{array}\right.\\ \mathrm{E}\left[\mathrm{I}\left(k\right)\right]=\left\{\begin{array}{c}p\left({u}_{t-1}=1|{\varvec{\theta}}\right), if \: k=1\\ 1-p\left({u}_{t-1}=1|{\varvec{\theta}}\right), if \: k=2\end{array}\right.,\end{array}$$

(A6)

respectively. For example, ${\mathrm{E}}_{{{\varvec{u}}}_{\left({\varvec{t}}-1\right)}}\left[\left(1-\mathrm{I}\left(k\right)\alpha \right)\mathrm{I}\left({k}^{^{\prime}}\right)\right]$ is $p\left({u}_{t-1}=1|{\varvec{\theta}}\right)\left(1-\alpha \right)$ if $k={k}^{^{\prime}}=1$, $1-p\left({u}_{t-1}=1|{\varvec{\theta}}\right)$ if $k=1, {k}^{^{\prime}}=2$, $p\left({u}_{t-1}=1|{\varvec{\theta}}\right)$ if $k=2, {k}^{^{\prime}}=1$, and $\left(1-p\left({u}_{t-1}=1|{\varvec{\theta}}\right)\right)\left(1-\alpha \right)$ if $k={k}^{^{\prime}}=2$. Therefore, ${\mathrm{E}}_{{{\varvec{u}}}_{\left({\varvec{t}}-1\right)}}\left[\left(1-\mathrm{I}\left(k\right)\alpha \right)\mathrm{I}\left({k}^{^{\prime}}\right)\right]$ can be calculated using Eq. (A5). In the same way, ${\mathrm{E}}_{\left({{\varvec{u}}}_{\left({\varvec{t}}\right)},{{\varvec{r}}}_{\left({\varvec{t}}\right)}\right)}\left[\left(\frac{\partial }{\partial \alpha }{Q}_{t,k}\left({\varvec{\theta}}\right)\right){Q}_{t,{k}^{^{\prime}}}\left({\varvec{\theta}}\right)\right]$ and ${\mathrm{E}}_{\left({{\varvec{u}}}_{\left({\varvec{t}}\right)},{{\varvec{r}}}_{\left({\varvec{t}}\right)}\right)}\left[{Q}_{t,k}\left({\varvec{\theta}}\right){Q}_{t,{k}^{^{\prime}}}\left({\varvec{\theta}}\right)\right]$ are calculated as follows:

$$\begin{array}{c}{\mathrm{E}}_{\left({{\varvec{u}}}_{\left({\varvec{t}}\right)},{{\varvec{r}}}_{\left({\varvec{t}}\right)}\right)}\left[\left(\frac{\partial }{\partial \alpha }{Q}_{t,k}\left({\varvec{\theta}}\right)\right){Q}_{t,{k}^{^{\prime}}}\left({\varvec{\theta}}\right)\right]= \\ {\mathrm{E}}_{{u}_{\left(t-1\right)}}\left[1-\mathrm{I}\left(k\right)\alpha \right] {\mathrm{E}}_{\left({u}_{\left(t-2\right)},{r}_{\left(t-1\right)}\right)}\left[\left(\frac{\partial }{\partial \alpha }{Q}_{t-1,k}\left({\varvec{\theta}}\right)\right){Q}_{t-1,{k}^{^{\prime}}}\left({\varvec{\theta}}\right)\right]+ \\ {\mathrm{E}}_{{u}_{\left(t-1\right)}}\left[\left(1-\mathrm{I}\left(k\right)\right)\mathrm{I}\left({k}^{^{\prime}}\right)\right]\alpha {\mathrm{E}}_{\left({u}_{\left(t-2\right)},{r}_{\left(t-1\right)}\right)}\left[\left(\frac{\partial }{\partial \alpha }{Q}_{t-1,k}\left({\varvec{\theta}}\right)\right){\Delta }_{t-1,{k}^{^{\prime}}}\right]+\\ E\left[\mathrm{I}\left(k\right)\right] {\mathrm{E}}_{\left({u}_{\left(t-2\right)},{r}_{\left(t-1\right)}\right)}\left[{\Delta }_{t-1,k}{Q}_{t-1,k}\left({\varvec{\theta}}\right)\right]+ \\ {\mathrm{E}}_{{u}_{\left(t-1\right)}}\left[\mathrm{I}\left(k\right)\mathrm{I}\left({k}^{^{\prime}}\right)\right]\alpha {\mathrm{E}}_{\left({u}_{\left(t-2\right)},{r}_{\left(t-1\right)}\right)}\left[{\Delta }_{t-1,k}{\Delta }_{t-1,{k}^{^{\prime}}}\right],\end{array}$$

(A7)

$$\begin{array}{c}{\mathrm{E}}_{\left({{\varvec{u}}}_{\left({\varvec{t}}\right)},{{\varvec{r}}}_{\left({\varvec{t}}\right)}\right)}\left[{Q}_{t,k}\left({\varvec{\theta}}\right){Q}_{t,{k}^{^{\prime}}}\left({\varvec{\theta}}\right)\right]= \\ {\mathrm{E}}_{\left({{\varvec{u}}}_{\left({\varvec{t}}-2\right)},{{\varvec{r}}}_{\left({\varvec{t}}-1\right)}\right)}\left[{Q}_{t-1,k}\left({\varvec{\theta}}\right){Q}_{t-1,{k}^{^{\prime}}}\left({\varvec{\theta}}\right)\right]+ \\ E\left[\mathrm{I}\left({k}^{^{\prime}}\right)\right]\alpha {\mathrm{E}}_{\left({{\varvec{u}}}_{\left({\varvec{t}}-2\right)},{{\varvec{r}}}_{\left({\varvec{t}}-1\right)}\right)}\left[{Q}_{t-1,k}\left({\varvec{\theta}}\right){\Delta }_{t-1,{k}^{^{\prime}}}\right]+ \\ E\left[\mathrm{I}\left(k\right)\right]\alpha {\mathrm{E}}_{\left({{\varvec{u}}}_{\left({\varvec{t}}-2\right)},{{\varvec{r}}}_{\left({\varvec{t}}-1\right)}\right)}\left[{Q}_{t-1,{k}^{^{\prime}}}\left({\varvec{\theta}}\right){\Delta }_{t-1,k}\right]+ \\ {\mathrm{E}}_{{{\varvec{u}}}_{\left({\varvec{t}}-1\right)}}\left[\mathrm{I}\left(k\right)\mathrm{I}\left({k}^{^{\prime}}\right)\right]{\alpha }^{2} {\mathrm{E}}_{\left({{\varvec{u}}}_{\left({\varvec{t}}-2\right)},{{\varvec{r}}}_{\left({\varvec{t}}-1\right)}\right)}\left[{\Delta }_{t-1,k}{\Delta }_{t-1,{k}^{^{\prime}}}\right],\end{array}$$

(A8)

where ${\mathrm{E}}_{{{\varvec{u}}}_{\left({\varvec{t}}-1\right)}}\left[1-\mathrm{I}\left(k\right)\alpha \right]=\left(1-\mathrm{E}\left[\mathrm{I}\left(k\right)\right]\alpha \right)$.

Further, ${\mathrm{E}}_{\left({{\varvec{u}}}_{\left({\varvec{t}}\right)},{{\varvec{r}}}_{\left({\varvec{t}}\right)}\right)}\left[\frac{\partial }{\partial \alpha }{Q}_{t,k}\left({\varvec{\theta}}\right)\right]$, ${\mathrm{E}}_{\left({{\varvec{u}}}_{\left({\varvec{t}}\right)},{{\varvec{r}}}_{\left({\varvec{t}}\right)}\right)}\left[{Q}_{t,k}\left({\varvec{\theta}}\right)\right]$, ${\mathrm{E}}_{\left({{\varvec{u}}}_{\left({\varvec{t}}\right)},{{\varvec{r}}}_{\left({\varvec{t}}\right)}\right)}\left[{r}_{t, k}\right]$, and ${\mathrm{E}}_{\left({{\varvec{u}}}_{\left({\varvec{t}}\right)},{{\varvec{r}}}_{\left({\varvec{t}}\right)}\right)}\left[{r}_{t, k}^{2}\right]$ are required to calculate Eqs. (A4), (A7), and (A8). Considering the updated equations, Eqs. (3) and (16), ${\mathrm{E}}_{\left({{\varvec{u}}}_{\left({\varvec{t}}\right)},{{\varvec{r}}}_{\left({\varvec{t}}\right)}\right)}\left[{Q}_{t,k}\left({\varvec{\theta}}\right)\right]$ and ${\mathrm{E}}_{\left({{\varvec{u}}}_{\left({\varvec{t}}\right)},{{\varvec{r}}}_{\left({\varvec{t}}\right)}\right)}\left[\frac{\partial }{\partial \alpha }{Q}_{t,k}\left({\varvec{\theta}}\right)\right]$ are given by

$$\begin{array}{c}{\mathrm{E}}_{\left({{\varvec{u}}}_{\left({\varvec{t}}\right)},{{\varvec{r}}}_{\left({\varvec{t}}\right)}\right)}\left[{Q}_{t,k}\left({\varvec{\theta}}\right)\right]={\mathrm{E}}_{\left({{\varvec{u}}}_{\left({\varvec{t}}-2\right)},{{\varvec{r}}}_{\left({\varvec{t}}-2\right)}\right)}\left[{Q}_{t-1,k}\left({\varvec{\theta}}\right)\right]+\mathrm{E}\left[\mathrm{I}\left(k\right)\right]\alpha {\mathrm{E}}_{\left({{\varvec{u}}}_{\left({\varvec{t}}-2\right)},{{\varvec{r}}}_{\left({\varvec{t}}-1\right)}\right)}\left[{\Delta }_{t-1,k}\right],\\ \begin{array}{c}{\mathrm{E}}_{\left({{\varvec{u}}}_{\left({\varvec{t}}\right)},{{\varvec{r}}}_{\left({\varvec{t}}\right)}\right)}\left[\frac{\partial }{\partial \alpha }{Q}_{t,k}\left({\varvec{\theta}}\right)\right]= {\mathrm{E}}_{{{\varvec{u}}}_{\left({\varvec{t}}-1\right)}}\left[(1-\mathrm{I}\left(k\right)\alpha \right] {\mathrm{E}}_{\left({{\varvec{u}}}_{\left({\varvec{t}}-2\right)},{{\varvec{r}}}_{\left({\varvec{t}}-2\right)}\right)}\left[\frac{\partial }{\partial \alpha }{Q}_{t-1,k}\left({\varvec{\theta}}\right)\right]+\\ E\left[\mathrm{I}\left(k\right)\right] {\mathrm{E}}_{\left({{\varvec{u}}}_{\left({\varvec{t}}-2\right)},{{\varvec{r}}}_{\left({\varvec{t}}-1\right)}\right)}\left[{\Delta }_{t-1,k}\right],\end{array}\end{array}$$

(A9)

respectively. Considering the data-generating process of ${r}_{t,k}$, Eq. (6), ${\mathrm{E}}_{\left({{\varvec{u}}}_{\left({\varvec{t}}\right)},{{\varvec{r}}}_{\left({\varvec{t}}\right)}\right)}\left[{r}_{t, k}\right]$ and ${\mathrm{E}}_{\left({{\varvec{u}}}_{\left({\varvec{t}}\right)},{{\varvec{r}}}_{\left({\varvec{t}}\right)}\right)}\left[{r}_{t, k}^{2}\right]$ are given by

$$\begin{array}{c}{\mathrm{E}}_{\left({{\varvec{u}}}_{\left({\varvec{t}}\right)},{{\varvec{r}}}_{\left({\varvec{t}}\right)}\right)}\left[{r}_{t, k}\right]={Re}_{t,1}p\left({u}_{t-1}=1|{\varvec{\theta}}\right)\left(2-k\right){RP}_{t,1}+{Re}_{t,2}\left(1-p\left({u}_{t-1}=1|{\varvec{\theta}}\right)\right)\left(k-1\right){RP}_{t,2},\\ \begin{array}{c}{\mathrm{E}}_{\left({{\varvec{u}}}_{\left({\varvec{t}}\right)},{{\varvec{r}}}_{\left({\varvec{t}}\right)}\right)}\left[{r}_{t, k}^{2}\right]={Re}_{t,1}^{2}p\left({u}_{t-1}=1|{\varvec{\theta}}\right)\left(2-k\right)R{P}_{t,1}+\\ {Re}_{t,2}^{2}\left(1-p\left({u}_{t-1}=1|{\varvec{\theta}}\right)\right)\left(k-1\right){RP}_{t,2}, \end{array}\end{array}$$

(A10)

respectively. Our R code is based on the above equations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Fujita, K., Okada, K. & Katahira, K. Stimulus Selection in a Q-learning Model Using Fisher Information and Monte Carlo Simulation. Comput Brain Behav 6, 262–279 (2023). https://doi.org/10.1007/s42113-022-00163-0

Download citation

Accepted: 12 December 2022
Published: 30 January 2023
Issue Date: June 2023
DOI: https://doi.org/10.1007/s42113-022-00163-0

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Stimulus Selection in a Q-learning Model Using Fisher Information and Monte Carlo Simulation

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Examinations of Biases by Model Misspecification and Parameter Reliability of Reinforcement Learning Models

Decomposing the effects of context valence and feedback information on speed and accuracy during reinforcement learning: a meta-analytical approach using diffusion decision modeling

Exploration and recency as the main proximate causes of probability matching: a reinforcement learning analysis

Data Availability

Code Availability

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Ethical Approval

Consent to Publish

Consent to Participate

Competing Interests

Additional information

Publisher's Note

Supplementary Information

Supplementary file1 (DOCX 493 KB)

Appendix 1

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Stimulus Selection in a Q-learning Model Using Fisher Information and Monte Carlo Simulation

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Examinations of Biases by Model Misspecification and Parameter Reliability of Reinforcement Learning Models

Decomposing the effects of context valence and feedback information on speed and accuracy during reinforcement learning: a meta-analytical approach using diffusion decision modeling

Exploration and recency as the main proximate causes of probability matching: a reinforcement learning analysis

Data Availability

Code Availability

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Ethical Approval

Consent to Publish

Consent to Participate

Competing Interests

Additional information

Publisher's Note

Supplementary Information

Supplementary file1 (DOCX 493 KB)

Appendix 1

Appendix 1

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation