Bayesian parameter learning as a belief updater in POMDPs.jl?

jeffp...@gmail.com

unread,

Dec 5, 2017, 2:42:27 PM12/5/17

to julia-pomdp-users

Hi Zach,

I have a problem where in 2D grid world, an agent and a target play pursuit-evasion game. However, agent doesn't know if target is good or bad, and if agent catches "good target" it gets positive reward, but if agent catches "bad" target it gets negative reward. The state includes fully observable positions of agent and target and a binary variable (x) that is a probability of the target being "good."

I'm thinking based on a deterministic observation of target's movement, agent can update its belief of x based on bayesian parameter learning, e.g. for p(x) = Beta(good, bad), one type of observation leads to Beta(good+1, bad). Is there a package/function I can use to achieve this - basically using Bayesian parameter learning as a belief updater in POMDPs.jl?

Thanks in advance!

Zachary Sunberg

unread,

Dec 5, 2017, 3:06:36 PM12/5/17

to julia-pomdp-users

There is not currently a package to do this, but it should be easy to write your own Updater. See http://juliapomdp.github.io/POMDPs.jl/latest/concepts/#beliefs-and-updaters

It will look something like this (though I might not understand the problem details correctly).

using POMDPs
using StaticArrays

struct AgentTargetBelief
    agent::SVector{2,Int}
    target::SVector{2,Int}
    # parameters for beta
    good::Int
    bad::Int
end

struct MyBayesUpdater <: Updater
    model::YourPOMDPType
end

function update(up::MyBayesUpdater, b::AgentTargetBelief, a::YourActionType, o::YourObsType)
    # return an updated AgentTargetBelief that is the belief when the agent starts with belief b, takes action a, and receives observation o 
end

Depending on which solver you use, the policy may want to sample from the belief, in which case, you would need to implement rand:

function rand(rng::AbstractRNG, b::AgentTargetBelief)
    # return a state sampled from the belief
end

or it may want to get the pdf, in which case you will need to implement the iterator and pdf functions for AgentTargetBelief.

jeffp...@gmail.com

unread,

Dec 5, 2017, 7:27:58 PM12/5/17

to julia-pomdp-users

Thanks for the detailed response.

So according to the structure above, how would defining iterator & pdf work?

because this is basically how it goes in rand(rng, b) function:

b = AgentTargetBelief;

d = Beta(good, bad);

P(behavior == good) = rand(d)

and in pdf(b::AgentTargetBelief, s::State) function, I would imagine that if s.behavior = good, the pdf function should output P(behavior == good), which according to the line above is randomly selected based on Beta distribution. If output of pdf isn't deterministic, what's the best way to determine it?

Zachary Sunberg

unread,

Dec 5, 2017, 8:38:33 PM12/5/17

to julia-pomdp-users

Oh, right - didn't think this through too rigorously... I forgot that a Beta distribution is a distribution over distributions. I'll need to understand your problem better before saying anything else.

I was assuming that the target was deterministically either good or bad with a known prior distribution. A Beta distribution isn't really appropriate in this case.

A few POMDP variations where the Beta distribution would be appropriate for are as follows:

1) At each time step the target is good or bad - this is determined independently of previous times by a parameter p_good::Float64 unknown to the agent. Observation is a perfect measurement of the goodness of the agent.

2) At each time step the target is either good or bad, but, at each time step it can switch according to a parameter p_switch::Float64 unknown to the agent. Observation is a noisy measurement of the goodness and a perfect measurement of whether it switched.

Do any of these match your problem, or is it something else?

jeffp...@gmail.com

unread,

Dec 5, 2017, 9:08:00 PM12/5/17

to julia-pomdp-users

In my current setup, the target is either always good or always bad -- only the agent isn't aware of the target's behavior. An agent wants to capture a good target, which moves randomly, for positive reward. An agent wants to avoid getting caught by a bad target, which moves towards the agent, because getting caught causes negative reward.

Since target knows its own behavior, its movement is always based on one model that corresponds to its behavior (good or bad), and the only way the agent can update its belief of target's behavior is by observing which way target moves each time.

For example, if target moves closer towards the agent, the agent may think the target is "bad", or vice versa. This is where I thought Beta distribution can be used by incrementing the appropriate pseudocount.

So I guess it's neither, since at ALL time step the target is good or bad ...?

Zachary Sunberg

unread,

Dec 6, 2017, 4:27:06 PM12/6/17

to julia-pomdp-users

Ok, so in that case, I don't think you'd want to use a Beta distribution. You just need to do a Bayesian update of the probability of it being good or bad. The best way to do this would be with a custom updater. The belief would look something like this:

struct AgentTargetBelief
    agent::SVector{2, Int}
    target::SVector{2, Int}
    p_good::Float64
end

In the update function, agent and target would be set deterministically based on the observation, and p_good would be updated based on the Bayes rule similar to page 137 of the dmu book.

If you don't want to implement your own updater, you could probably get away with using a particle filter from ParticleFilters.jl.

Message has been deleted

jeffp...@gmail.com

unread,

Dec 7, 2017, 5:45:58 PM12/7/17

to julia-pomdp-users

So I actually tried doing the following to use Beta ...

struct AgentTargetBelief
    agent::SVector{2, Int}


    target::SVectpr{2, Int}
    good::Int
    bad::Int
    p_good::Float
end

where p_good is updated based on Beta(good, bad) in update function, and in rand the state is sampled based on p_good. In pdf, probability distribution of a sample is simply p_good based on what behavior the state is in.

So it seems everything is working, but now I'm having some problem where what updater does doesn't seem to be reflected in the state variables .... if you don't mind is it possible to maybe take 10 minutes of your time in Durand and show you how my code is working?

Zachary Sunberg

unread,

Dec 7, 2017, 6:35:14 PM12/7/17

to julia-pomdp-users

Sure, just come to Durand 227

Zachary Sunberg

unread,

Dec 7, 2017, 6:38:24 PM12/7/17

to julia-pomdp-users

Please come as soon as possible though; I plan to leave here around 4:15

Reply all

Reply to author

Forward