In the version of the BART that I am using, reward is highly correlated with the explosion probability, such that the amount of reward you can gain increases as the explosion probability increases (see table 1 of this
paper).
When adding reward to the bart_ewmv model, I originally used the "total value" of a balloon at a given pump number (i.e., [0.05, 0.15, 0.25, 0.55, 0.95, 1.45, 2.05, 2.75, 3.45, 4.25, 5.15]). I have now changed that to the "reward gained" with each pump (i.e., [0.05, 0.10, 0.10, 0.30, 0.40, .50, .60, .70, .70, .80, .90]). However, it might also be the case that using decimal numbers for reward would make it difficult for the model to calculate u_pump. So, a colleague suggested multiplying by 10 to get whole numbers for reward (i.e., [5, 10, 10, 30, 40, 50, 60, 70, 70, 80, 90].
These two changes seem to have solved my problems, as the models are now converging.