Hi team, this is my first time posting -- any advice you are able to offer is greatly appreciated!
I'm working on custom .stan and .R codes for a bandit task based off of a few hBayesDM .stan files. I'm fitting a (model-free) temporal-difference RL model to task behavior. In each trial of the task, the agent chooses between 2 out of a set of bandits (let's say N = 4 bandits), and I'm trying different approaches to coding a softmax choice function over the 2 options presented on each trial. My goal is to code the choice function in a manner that only incorporates the two options on each trial as variables, and leaves the unobserved bandits on each trial out of the equation. In essence, I do not want the values of the options that are not shown on each trial to influence the softmax calculation of the values of the options that are presented on each trial.
At present, I'm using a "choice[i, t] ~ categorical_logit( Q-values * tau[i])" line in the Model section, and correspondingly a "choice[i, t] ~ categorical_logit_lpmf( Q-values * tau[i] )" line in the Generated Quantities section, where choice[i, t] (int<lower=1, upper=4) is the chosen bandit by participant [i] on each trial [t], Q-values is a N=4 component vector of learned action values derived from the TDRL model, and tau is a temperature parameter. However, I do not want the categorical_logit model to use all 4 values of the Q-values vector, but instead to use only the Q-values of the 2 options presented on each trial (which will be some random pairing of the 4 possible bandits).
Is the categorical_logit model the best route for this setup? Since the choices are coded as integers 1:4, I've been using a 4-component vector the store the Q-values. Is it more sensible to re-code the choices to something that would work with a bernoulli_logit model (i.e., having the option on the left be choice=1, and the right be choice=0)? I think working up a way to escape the categorical labels (i.e., integer bandit indexing) would obviate the need to use the 4-component vector and instead I could use a 2-component vector for calculating a softmax choice between the 2 options on each trial. Could someone check me on this thinking?
Thanks for reading!
Paul Sands