differences between: value, utility, reward and cost

Dailos Guerra Ramos

unread,

Nov 11, 2011, 6:21:32 AM11/11/11

to stanford...@googlegroups.com

Hi All,

I'm struggling with the difference between the concepts state value V(s), state utility U(s), cost and reward.

I can distinguish the policy view where each state has an action associated with it, and the other view where each state has a number associated with it: is this number the value, the utility, the cost or the rewards of the state? what the difference between them?

What does Markov Decision Process represent in this framework?

Thank you for your help.

Dailos

David Weiseth

unread,

Nov 11, 2011, 9:51:16 AM11/11/11

to stanford...@googlegroups.com

Programmatically the agent works backwards to create the map of potential (rewards - costs).

This representation of each square is a representation of how hard it will be to get to the goal, by reducing the potential reward gained by the costs of being in that square, which is a sum of all the costs to get from that square to the goal (backwards/recursive is how the program would work to develop this).

By knowing how hard it will be to get from a given square to the goal state, the agent can make the best decision on which path to take to maximize its chance of gaining the most of the goal reward.

The agent is never assured of knowing the future and exactly how much reward it will receive because the action phase is stochastic, where we are not assured to get what we desire at each action step.

So the number is the potential reward, I say potential because this is a stochastic environment. The reward is adjusted for the potential cost associated with getting from there to the reward state/goal state. The reward state, is reduced by the costs to get there, this is part of the problem givens.

It is not necessary to have the number in this square represent the actual cost overall the agent has incurred in the real execution, but just a comparison of actions so it can create the policy that represents the best decision when in that state on which action to take.

This is my understanding, hope it helps, someone else might have a more succinct or technically correct answer.

--David

Dailos

--
You received this message because you are subscribed to the Google Groups "Stanford AI Class" group.
To post to this group, send email to stanford...@googlegroups.com.
To unsubscribe from this group, send email to stanford-ai-cl...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/stanford-ai-class?hl=en.

Dailos Guerra Ramos

unread,

Nov 14, 2011, 5:33:00 AM11/14/11

to stanford...@googlegroups.com

Great explanation, much clearer now.

Thanks!

David Weiseth

unread,

Nov 14, 2011, 9:25:13 AM11/14/11

to stanford...@googlegroups.com

You are welcome.

Utility is the factor we use to make the decision from all the Rewards/Costs/Values that is summed for each state on the way to the goal ( keeping in mind that it is stochastic too, so probabilities are involved).

Each state as a Value/Reward/Cost. R(s)

Cost is a negative Reward.

Just also remember, the potential Reward of a particular action for that state, is only the potential plan of it so we can compare and make decisions for creating the Policy. The Policy is what the agent actually uses to execute.

The recursive nature of the program reflects the fact that the problem is a root tree problem, where each Utility needs the roots closest to the goal to calculate the total Utility so it makes sense to work from the root to the branches as you go from a hub to the spokes of the wheel.

If the problem did not have a hub spoke arrangement from the actual agent location state to the goal state, then this would not work efficiently, so if you have multiple goal states that are mixed up in around the environment, this algorithm does not heuristically give us the optimum speed to create the Policy.

David Weiseth

unread,

Nov 14, 2011, 9:57:06 AM11/14/11

to stanford...@googlegroups.com

One correction I used

Reward is like Cost and is just a matter of that one state.

Value I grouped incorrectly, this is like the Utility, meaning you need to take into account the Rewards of the path to the goal state in calculating it. Also in determining the action sequence that leads to the maximum reward goal state.

Sorry for that mistake. --David

Reply all

Reply to author

Forward