Abstract
Reinforcement learning models have been extensively studied for decision-making tasks with reward feedback. However, in designing an experiment to collect data for Q-learning models, the quantitative effect of a presented stimulus on the estimation precision of participant parameters has generally not been considered. That is, the lack of a mathematical framework has prevented researchers from designing an optimal experiment. To tackle this problem, this study analytically derives the Fisher information. Furthermore, this study formulates a stochastic representation of the Q-learning model, which is one of the most commonly applied reinforcement learning models. With this derivation, a two-step procedure is proposed to select the optimal stimuli in terms of estimation precision, in which low-cost Fisher information evaluation and more detailed finite-sample Monte Carlo simulation are combined. The simulation studies show that reward probability reversal leads to a high estimation precision for the learning rate parameter. By contrast, for the inverse temperature parameter, a larger difference in reward probability between options leads to higher estimation precision. These results reveal that the optimal experimental design is dependent on which trait parameters of the Q-learning model are of interest to researchers. Further, it is found that the use of undesirable stimuli in terms of trait parameter precision leads to a large bias in the correlation coefficient estimate. Based on the results, the approaches to designing experiments in the Q-learning model are discussed.
Similar content being viewed by others
Data Availability
The datasets generated and/or analyzed during the current study are available from the Open Science Framework repository at https://osf.io/msf5e/.
Code Availability
The R code used to produce the results in the five simulation studies is available from the Open Science Framework repository at https://osf.io/msf5e/.
References
Ahn, W. Y., Gu, H., Shen, Y., Haines, N., Hahn, H. A., Teater, J. E., Myung, J. I., & Pitt, M. A. (2020). Rapid, precise, and reliable measurement of delay discounting using a Bayesian learning algorithm. Scientific Reports, 10, 12091. https://doi.org/10.1038/s41598-020-68587-x
Bak, J. H., & Pillow, J. W. (2018). Adaptive stimulus selection for multi-alternative psychometric functions with lapses. Journal of Vision, 18, 1–25. https://doi.org/10.1167/18.12.4
Beevers, C. G., Worthy, D. A., Gorlick, M. A., Nix, B., Chotibut, T., & Maddox, W. T. (2013). Influence of depression symptoms on history-independent reward and punishment processing. Psychiatry Research, 207, 53–60. https://doi.org/10.1016/j.psychres.2012.09.054
Broomell, S. B., & Bhatia, S. (2014). Parameter recovery for decision modeling using choice data. Decision, 1, 252–274. https://doi.org/10.1037/dec0000020
Cavagnaro, D. R., Gonzalez, R., Myung, J. I., & Pitt, M. A. (2013). Optimal decision stimuli for risky choice experiments: An adaptive approach. Management Science, 59, 358–375. https://doi.org/10.1287/mnsc.1120.1558
Cavagnaro, D. R., Myung, J. I., Pitt, M. A., & Kujala, J. V. (2010). Adaptive design optimization: A mutual information-based approach to model discrimination in cognitive science. Neural Computation, 22, 887–905. https://doi.org/10.1162/neco.2009.02-09-959
Cavanaugh, J. E., & Shumway, R. H. (1996). On computing the expected Fisher information matrix for state-space model parameters. Statistics & Probability Letters, 26, 347–355. https://doi.org/10.1016/0167-7152(95)00031-3
Chang, H. H. (2015). Psychometrics behind computerized adaptive testing. Psychometrika, 80, 1–20. https://doi.org/10.1007/s11336-014-9401-5
Chase, H. W., Frank, M. J., Michael, A., Bullmore, E. T., Sahakian, B. J., & Robbins, T. W. (2010). Approach and avoidance learning in patients with major depression and healthy controls : Relation to anhedonia. Psychological Medicine, 40, 433–440. https://doi.org/10.1017/S0033291709990468
Chen, P., Engel, S., & Wang, C. (2019). The multivariate adaptive design for efficient estimation of the time course of perceptual adaptation. Behavior Research Methods, 52, 1073–1090. https://doi.org/10.3758/s13428-019-01301-6
Daw, N. (2011). Trial-by-trial data analysis using computational models. In Delgado, M. R., Phelps, E. A., & Robbins, T. W (Ed). Decision making, affect, and learning: Attention and performance XXIII. https://doi.org/10.1093/acprof:oso/9780199600434.001.0001
Daw, N. D., Gershman, S. J., Seymour, B., Dayan, P., & Dolan, R. J. (2011). Model-based influences on humans’ choices and striatal prediction errors. Neuron, 69, 1204–1215. https://doi.org/10.1016/j.neuron.2011.02.027
Dezfouli, A., Griffiths, K., Ramos, F., Dayan, P., & Balleine, W. (2019). Models that learn how humans learn : The case of decision-making and its disorders. PLoS Computational Biology, 15, e1006903. https://doi.org/10.1371/journal.pcbi.1006903
Doll, R. J., Buitenweg, J. R., Meijer, H. G. E., & Veltink, P. H. (2014). Tracking of nociceptive thresholds using adaptive psychophysical methods. Behavior Research Methods, 46, 55–66. https://doi.org/10.3758/s13428-013-0368-4
Ferrando, P., & Lorenzo-Seva, U. (2007). An item response theory model for incorporating response time data in binary personality items. Applied Psychological Measurement, 31, 525–543. https://doi.org/10.1177/0146621606295197
Fleiss, J. L., & Shrout, P. E. (1977). The effects of measurement errors on some multivariate procedures. American Journal of Public Health, 67, 1188–1191. https://doi.org/10.2105/ajph.67.12.1188
Gershman, S. J. (2016). Empirical priors for reinforcement learning models. Journal of Mathematical Psychology, 71, 1–6. https://doi.org/10.1016/j.jmp.2016.01.006
Ito, M., & Doya, K. (2009). Validation of decision-making models and analysis of decision variables in the rat basal ganglia. The Journal of Neuroscience, 29, 9861–9874. https://doi.org/10.1523/JNEUROSCI.6157-08.2009
Katahira, K. (2016). How hierarchical models improve point estimates of model parameters at the individual level. Journal of Mathematical Psychology, 73, 37–58. https://doi.org/10.1016/j.jmp.2016.03.007
Katahira, K. (2018). The statistical structures of reinforcement learning with asymmetric value updates. Journal of Mathematical Psychology, 87, 31–45. https://doi.org/10.1016/j.jmp.2018.09.002
Katahira, K., Fujimura, T., Okanoya, K., & Okada, M. (2011). Decision-making based on emotional images. Frontiers in Psychology, 2, 311. https://doi.org/10.3389/fpsyg.2011.00311
Kontsevich, L. L., & Tyler, C. W. (1999). Bayesian adaptive estimation of psychometric slope and threshold. Vision Research, 39, 2729–2737. https://doi.org/10.1016/S0042-6989(98)00285-5
Kunisato, Y., Okamoto, Y., Ueda, K., Onoda, K., Okada, G., Yoshimura, S., Suzuki, S., Samejima, K., & Yamawaki, S. (2012). Effects of depression on reward-based decision making and variability of action in probabilistic learning. Journal of Behavior Therapy and Experimental Psychiatry, 43, 1088–1094. https://doi.org/10.1016/j.jbtep.2012.05.007
Liu, K. (1988). Measurement error and its impact on partial correlation and multiple linear regression analyses. American Journal of Epidemiology, 127, 864–874. https://doi.org/10.1093/oxfordjournals.aje.a114870
Ly, A., Marsman, M., Verhagen, J., Grasman, R. P. P. P., & Wagenmakers, E. J. (2017). A tutorial on fisher information. Journal of Mathematical Psychology, 80, 40–55. https://doi.org/10.1016/j.jmp.2017.05.006
Mulder, J., & Van Der Linden, W. J. (2009). Multidimensional adaptive testing with optimal design criteria for item selection. Psychometrika, 74, 273–296. https://doi.org/10.1007/s11336-008-9097-5
Myung, J. I., Cavagnaro, D. A., & Pitt, M. A. (2013). A tutorial on adaptive design optimization. Journal of Mathematical Psychology, 57, 53–67. https://doi.org/10.1016/j.jmp.2013.05.005
Myung, J. I., & Pitt, M. A. (2009). Optimal experimental design for model discrimination. Psychological Review, 116, 499–518. https://doi.org/10.1037/a0016104
Robinson, O. J., & Chase, H. W. (2017). Learning and choice in mood disorders: Searching for the computational parameters of anhedonia. Computational Psychiatry, 1, 208–233. https://doi.org/10.1162/CPSY_a_00009
Segall, D. O. (1996). Multidimensional adaptive testing. Psychometrika, 61, 331–354. https://doi.org/10.1007/BF02294343
Shumway, R. H., & Stoffer, D. S. (1982). An approach to time series smoothing and forecasting using the EM algorithm. Journal of Time Series Analysis, 3, 253–264. https://doi.org/10.1126/science.275.5306.1593
Stan Development Team (2020). Rstan: The R interface to Stan. R package version 2.21.2, http://mc-stan.org/
Steyvers, M., Lee, M. D., & Wagenmakers, E. (2009). A Bayesian analysis of human decision-making on bandit problems. Journal of Mathematical Psychology, 53, 168–179. https://doi.org/10.1016/j.jmp.2008.11.002
Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction (2nd ed.). MIT Press.
Toubia, O., Johnson, E., Evgeniou, T., & Delquié, P. (2013). Dynamic experiments for estimating preferences: An adaptive method of eliciting time and risk parameters. Management Science, 59, 613–640. https://doi.org/10.1287/mnsc.1120.1570
Toyama, A., Katahira, K., & Ohira, H. (2017). A simple computational algorithm of model-based choice preference. Cognitive, Affective & Behavioral Neuroscience, 17, 764–783. https://doi.org/10.3758/s13415-017-0511-2
Toyama, A., Katahira, K., & Ohira, H. (2019). Biases in estimating the balance between model-free and model-based learning systems due to model misspecification. Journal of Mathematical Psychology, 91, 88–102. https://doi.org/10.1016/j.jmp.2019.03.007
van der Linden, W. J. (2018). Adaptive testing. In: van der Linden W. J. (Ed). Handbook of item response theory, volume three: application. (pp.197–228). https://doi.org/10.1201/9781315119144
Watkins, C. J. C. H., & Dayan, P. (1992). Q-learning. Machine Learning, 8, 279–292. https://doi.org/10.1007/BF00992698
Wilson, R. C., & Collins, A. G. E. (2019). Ten simple rules for the computational modeling of behavioral data. eLife, 8, 1–33. https://doi.org/10.7554/eLife.49547
Yang, J., Pitt, M. A., Ahn, W. Y., & Myung, J. I. (2021). ADOpy: A python package for adaptive design optimization. Behavior Research Methods, 53(2), 874–897. https://doi.org/10.3758/s13428-020-01386-4
Zhang, S., & Lee, M. D. (2010). Optimal experimental design for a class of bandit problems. Journal of Mathematical Psychology, 54, 499–508. https://doi.org/10.1016/j.jmp.2010.08.002
Funding
This work was supported by grants from the Japan Society for the Promotion of Science (Grant Numbers 1920J22350, 18H03612).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Ethical Approval
Not applicable.
Consent to Publish
Not applicable.
Consent to Participate
Not applicable.
Competing Interests
The authors declare no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Appendix 1
Appendix 1
In this appendix, we derive the Fisher information matrix for CAT. As noted previously, the key here is to regard \({r}_{t,k}\) as a random variable. Then, the log-likelihood is given by
However, researchers can ignore \(\mathrm{log}p\left({{\varvec{r}}}_{\left({\varvec{T}}\right)}|{{\varvec{u}}}_{\left({\varvec{T}}\right)},{\varvec{\theta}},\boldsymbol{ }{{\varvec{z}}}_{\left({\varvec{T}}\right)}\right)\) during the calculation of the derivative and estimation because \(\mathrm{log}p\left({{\varvec{r}}}_{\left({\varvec{T}}\right)}|{{\varvec{u}}}_{\left({\varvec{T}}\right)},{\varvec{\theta}},\boldsymbol{ }{{\varvec{z}}}_{\left({\varvec{T}}\right)}\right)\) is independent of the participant parameter \({\varvec{\theta}}\). That is, it holds that
Although the data-generating process of reward \({r}_{t,k}\), \(\mathrm{log}p\left({{\varvec{r}}}_{\left({\varvec{T}}\right)}|{{\varvec{u}}}_{\left({\varvec{T}}\right)},{\varvec{\theta}},\boldsymbol{ }{{\varvec{z}}}_{\left({\varvec{T}}\right)}\right)\), has often been ignored, it is modeled herein to conduct CAT (Supplement C). Considering the equations, we can derive Eqs. (17) to (20) as in the “Deriving the Fisher Information Matrix Using Obtained Observations” section.
\({\mathrm{E}}_{\left({{\varvec{u}}}_{\left({\varvec{t}}\right)},{{\varvec{r}}}_{\left({\varvec{t}}\right)}\right)}\left[\left(\frac{\partial }{\partial \alpha }{Q}_{t,k}\left({\varvec{\theta}}\right)\right)\left(\frac{\partial }{\partial \alpha }{Q}_{t,{k}^{^{\prime}}}\left({\varvec{\theta}}\right)\right)\right]\), \({\mathrm{E}}_{\left({{\varvec{u}}}_{\left({\varvec{t}}\right)},{{\varvec{r}}}_{\left({\varvec{t}}\right)}\right)}\left[\left(\frac{\partial }{\partial \alpha }{Q}_{t,k}\left({\varvec{\theta}}\right)\right){Q}_{t,{k}^{^{\prime}}}\left({\varvec{\theta}}\right)\right]\), and \({\mathrm{E}}_{\left({{\varvec{u}}}_{\left({\varvec{t}}\right)},{{\varvec{r}}}_{\left({\varvec{t}}\right)}\right)}\left[{Q}_{t,k}\left({\varvec{\theta}}\right){Q}_{t,{k}^{^{\prime}}}\left({\varvec{\theta}}\right)\right]\) are required to calculate Eqs. (18)–(20). From Eq. (16), \(\left(\frac{\partial }{\partial a}{Q}_{t,k}\left({\varvec{\theta}}\right)\right)\left(\frac{\partial }{\partial a}{Q}_{t,{k}^{^{\prime}}}\left({\varvec{\theta}}\right)\right)\) is given by
where \({\Delta }_{t,k}={r}_{t,k}-{Q}_{t,k}\left({\varvec{\theta}}\right)\), \(\mathrm{I}\left(k\right)={\mathrm{I}}_{\mathrm{A}}\left(k-1+{u}_{t-1}\right)\). \(\left(1-\mathrm{I}\left(k\right)\alpha \right)\left(1-\mathrm{I}\left({k}^{\mathrm{^{\prime}}}\right)\alpha \right)\) is dependent only on \({u}_{t-1}\), and \(\left(\frac{\partial }{\partial \alpha }{Q}_{t-1,k}\left({\varvec{\theta}}\right)\right)\left(\frac{\partial }{\partial \alpha }{Q}_{t-1,k\mathrm{^{\prime}}}\left({\varvec{\theta}}\right)\right)\) is dependent on \({{\varvec{u}}}_{\left({\varvec{t}}-2\right)},{{\varvec{r}}}_{({\varvec{t}}-2)}\) considering the updated equation. The other terms remain the same. Therefore, we can calculate expectations separately as follows:
where
Note that delta function, δkk', and \(\mathrm{E}\left[\mathrm{I}\left(k\right)\right]\) are given by
respectively. For example, \({\mathrm{E}}_{{{\varvec{u}}}_{\left({\varvec{t}}-1\right)}}\left[\left(1-\mathrm{I}\left(k\right)\alpha \right)\mathrm{I}\left({k}^{^{\prime}}\right)\right]\) is \(p\left({u}_{t-1}=1|{\varvec{\theta}}\right)\left(1-\alpha \right)\) if \(k={k}^{^{\prime}}=1\), \(1-p\left({u}_{t-1}=1|{\varvec{\theta}}\right)\) if \(k=1, {k}^{^{\prime}}=2\), \(p\left({u}_{t-1}=1|{\varvec{\theta}}\right)\) if \(k=2, {k}^{^{\prime}}=1\), and \(\left(1-p\left({u}_{t-1}=1|{\varvec{\theta}}\right)\right)\left(1-\alpha \right)\) if \(k={k}^{^{\prime}}=2\). Therefore, \({\mathrm{E}}_{{{\varvec{u}}}_{\left({\varvec{t}}-1\right)}}\left[\left(1-\mathrm{I}\left(k\right)\alpha \right)\mathrm{I}\left({k}^{^{\prime}}\right)\right]\) can be calculated using Eq. (A5). In the same way, \({\mathrm{E}}_{\left({{\varvec{u}}}_{\left({\varvec{t}}\right)},{{\varvec{r}}}_{\left({\varvec{t}}\right)}\right)}\left[\left(\frac{\partial }{\partial \alpha }{Q}_{t,k}\left({\varvec{\theta}}\right)\right){Q}_{t,{k}^{^{\prime}}}\left({\varvec{\theta}}\right)\right]\) and \({\mathrm{E}}_{\left({{\varvec{u}}}_{\left({\varvec{t}}\right)},{{\varvec{r}}}_{\left({\varvec{t}}\right)}\right)}\left[{Q}_{t,k}\left({\varvec{\theta}}\right){Q}_{t,{k}^{^{\prime}}}\left({\varvec{\theta}}\right)\right]\) are calculated as follows:
where \({\mathrm{E}}_{{{\varvec{u}}}_{\left({\varvec{t}}-1\right)}}\left[1-\mathrm{I}\left(k\right)\alpha \right]=\left(1-\mathrm{E}\left[\mathrm{I}\left(k\right)\right]\alpha \right)\).
Further, \({\mathrm{E}}_{\left({{\varvec{u}}}_{\left({\varvec{t}}\right)},{{\varvec{r}}}_{\left({\varvec{t}}\right)}\right)}\left[\frac{\partial }{\partial \alpha }{Q}_{t,k}\left({\varvec{\theta}}\right)\right]\), \({\mathrm{E}}_{\left({{\varvec{u}}}_{\left({\varvec{t}}\right)},{{\varvec{r}}}_{\left({\varvec{t}}\right)}\right)}\left[{Q}_{t,k}\left({\varvec{\theta}}\right)\right]\), \({\mathrm{E}}_{\left({{\varvec{u}}}_{\left({\varvec{t}}\right)},{{\varvec{r}}}_{\left({\varvec{t}}\right)}\right)}\left[{r}_{t, k}\right]\), and \({\mathrm{E}}_{\left({{\varvec{u}}}_{\left({\varvec{t}}\right)},{{\varvec{r}}}_{\left({\varvec{t}}\right)}\right)}\left[{r}_{t, k}^{2}\right]\) are required to calculate Eqs. (A4), (A7), and (A8). Considering the updated equations, Eqs. (3) and (16), \({\mathrm{E}}_{\left({{\varvec{u}}}_{\left({\varvec{t}}\right)},{{\varvec{r}}}_{\left({\varvec{t}}\right)}\right)}\left[{Q}_{t,k}\left({\varvec{\theta}}\right)\right]\) and \({\mathrm{E}}_{\left({{\varvec{u}}}_{\left({\varvec{t}}\right)},{{\varvec{r}}}_{\left({\varvec{t}}\right)}\right)}\left[\frac{\partial }{\partial \alpha }{Q}_{t,k}\left({\varvec{\theta}}\right)\right]\) are given by
respectively. Considering the data-generating process of \({r}_{t,k}\), Eq. (6), \({\mathrm{E}}_{\left({{\varvec{u}}}_{\left({\varvec{t}}\right)},{{\varvec{r}}}_{\left({\varvec{t}}\right)}\right)}\left[{r}_{t, k}\right]\) and \({\mathrm{E}}_{\left({{\varvec{u}}}_{\left({\varvec{t}}\right)},{{\varvec{r}}}_{\left({\varvec{t}}\right)}\right)}\left[{r}_{t, k}^{2}\right]\) are given by
respectively. Our R code is based on the above equations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Fujita, K., Okada, K. & Katahira, K. Stimulus Selection in a Q-learning Model Using Fisher Information and Monte Carlo Simulation. Comput Brain Behav 6, 262–279 (2023). https://doi.org/10.1007/s42113-022-00163-0
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s42113-022-00163-0