Coding softmax choice rule for a subset of bandits

103 views
Skip to first unread message

Paul Sands

unread,
Sep 28, 2020, 2:50:00 PM9/28/20
to hbayesdm-users
Hi team, this is my first time posting -- any advice you are able to offer is greatly appreciated!

I'm working on custom .stan and .R codes for a bandit task based off of a few hBayesDM .stan files. I'm fitting a (model-free) temporal-difference RL model to task behavior. In each trial of the task, the agent chooses between 2 out of a set of bandits (let's say N = 4 bandits), and I'm trying different approaches to coding a softmax choice function over the 2 options presented on each trial. My goal is to code the choice function in a manner that only incorporates the two options on each trial as variables, and leaves the unobserved bandits on each trial out of the equation. In essence, I do not want the values of the options that are not shown on each trial to influence the softmax calculation of the values of the options that are presented on each trial. 

At present, I'm using a "choice[i, t] ~ categorical_logit( Q-values * tau[i])" line in the Model section, and correspondingly a "choice[i, t] ~ categorical_logit_lpmf( Q-values * tau[i] )" line in the Generated Quantities section, where choice[i, t] (int<lower=1, upper=4) is the chosen bandit by participant [i] on each trial [t], Q-values is a N=4 component vector of learned action values derived from the TDRL model, and tau is a temperature parameter. However, I do not want the categorical_logit model to use all 4 values of the Q-values vector, but instead to use only the Q-values of the 2 options presented on each trial (which will be some random pairing of the 4 possible bandits). 

Is the categorical_logit model the best route for this setup? Since the choices are coded as integers 1:4, I've been using a 4-component vector the store the Q-values. Is it more sensible to re-code the choices to something that would work with a bernoulli_logit model (i.e., having the option on the left be choice=1, and the right be choice=0)? I think working up a way to escape the categorical labels (i.e., integer bandit indexing) would obviate the need to use the 4-component vector and instead I could use a 2-component vector for calculating a softmax choice between the 2 options on each trial. Could someone check me on this thinking?

Thanks for reading!
Paul Sands

Lei Zhang

unread,
Sep 29, 2020, 8:08:00 AM9/29/20
to Paul Sands, hbayesdm-users
hi Paul,

I once encountered similar situations and the idea is to create some helping variables. See below. 

In an example where options 2 and 3 are presented as stimuli, and the participant chose 2, then, 

data {
...
  int<lower=1, upper=4> chosen[N, T];   // option 2 in this case
  int<lower=1, upper=4> unchosen[N, T]; // option 3 in this case
...
}

model {
...
  vector[4] Q_values;
  vector[2] Q_presented;

  Q_presented[1] = Q_values[chosen[i,t]];  // here 1/2 indices need to corresponde the button presses
  Q_presented[2] = Q_values[unchosen[i,t]];

  choice[i, t] ~ categorical_logit( tau[i] * Q_presented); // choice here 1/2
...  
}

Hope that helps,
Lei



--
You received this message because you are subscribed to the Google Groups "hbayesdm-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hbayesdm-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/hbayesdm-users/f7e6df87-d060-48e9-abfa-705d3ff9a372n%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages