In my current setup, the target is either always good or always bad -- only the agent isn't aware of the target's behavior. An agent wants to capture a good target, which moves randomly, for positive reward. An agent wants to avoid getting caught by a bad target, which moves towards the agent, because getting caught causes negative reward.
Since target knows its own behavior, its movement is always based on one model that corresponds to its behavior (good or bad), and the only way the agent can update its belief of target's behavior is by observing which way target moves each time.
For example, if target moves closer towards the agent, the agent may think the target is "bad", or vice versa. This is where I thought Beta distribution can be used by incrementing the appropriate pseudocount.
So I guess it's neither, since at ALL time step the target is good or bad ...?