454 views

Skip to first unread message

Apr 6, 2021, 5:04:31 AM4/6/21

to lcz...@googlegroups.com

Human players often would not go into too risky moves, even with better value of evaluation.

Example, player as White:

Current position: 0.1

Next position Pos1: eval 0.2 (risky / unpredictable, range -0.1 - 0.5)

Next position Pos2: eval 0.15 (less risky / more predictable, range 0.1 - 0.2)

So, it would make sense for Lc0 to take not only the middle value of position evaluation, but also to learn how big the range is, meaning how risky or unpredictable the position is.

Lc0 then could choose the best evaluation / risk ratio. This ratio would be also learned, just like in the case of human players ("I am pawn up, and it is a middle game, so I will take this eval/range over that eval/range").

It seems to me that one-dimensional evaluation is making Lc0 blind to too risky moves.

Apr 12, 2021, 1:47:14 PM4/12/21

to LCZero

There are works on estimating the variance in addition to the expected return.

Just google for: reinforcement learning risk variance

In fact, the formulas involved are not too different from the standard formulas used for computing the return.

However, it does involve training (and using) another network for the prediction of the variance, which might make it dubious when computational resources are limited.

May 28, 2021, 1:44:54 PM5/28/21

to LCZero

This idea is old. For instance, see the 1999 paper "Propagating Distributions Up Directed Acyclic Graphs" by Eric Baum and Warren D. Smith (who used to hang out in this forum).

However, I think this is fundamentally wrong. Rational decision making is achieved by maximizing the expected value of the utility of the outcome (see Von Neumann–Morgenstern utility theorem). This is a single number, so I think it is correct for engines to have a one-dimensional evaluation space. If a move is risky, it's because it increases the probability of a bad outcome, and the evaluation already contains a penalty for it.

Álvaro.

May 29, 2021, 12:58:33 AM5/29/21

to Álvaro Begué, LCZero

On 5/28/21, Álvaro Begué <alvaro...@gmail.com> wrote:

> This idea is old. For instance, see the 1999 paper "Propagating

> Distributions Up Directed Acyclic Graphs" by Eric Baum and Warren D. Smith

> (who used to hang out in this forum).

>

> However, I think this is fundamentally wrong. Rational decision making is

> achieved by maximizing the expected value of the utility of the outcome

> (see Von Neumann–Morgenstern utility theorem).

--yes.
> This idea is old. For instance, see the 1999 paper "Propagating

> Distributions Up Directed Acyclic Graphs" by Eric Baum and Warren D. Smith

> (who used to hang out in this forum).

>

> However, I think this is fundamentally wrong. Rational decision making is

> achieved by maximizing the expected value of the utility of the outcome

> (see Von Neumann–Morgenstern utility theorem).

> This is a single number,

> so I think it is correct for engines to have a one-dimensional evaluation

> space.

>If a move is risky, it's because it increases the probability of a

> bad outcome, and the evaluation already contains a penalty for it.

also possibly high negative reward.

You might find out which one if explore it deeper; but if do not, then

you do not know which.

But even without exploring deeper, you can still know "it has high variance"

versus "low" IF you have a prob-distribution-valued eval function.

And then you can (1) use that knowledge to try to guide your decisions about

which parts of the search tree to explore deeper. You then will

search more effectively.

Also, even without exploring deeper, (2) the expectation

values of internal nodes of the tree will change

if the leaves have prob-distribution-values, versus if those leaves

just had scalar values

equal to the expected utility of that prob-distribution.

Baum and Smith both invented all this, and then demonstrated experimentally that

the better internal node valuations (2) yielded stronger gameplayers,

and also that the smarter search (1) also yielded stronger players.

(I'm Smith.)

These strength advantages were very large for some games.

And chess and go looked like they would be games for which it would work well,

in fact probably better than any of the games we tried.

[But we never experimented on chess+go because we only experimented on a

variety of simpler games, basically because we were not good enough programmers

and/or not enough time. Also, my interest in chess came at the wrong

time, just before IBM did deep blue hardware. An my interest in go

also at exactly the wrong time, just before alphazero team.]

So sorry: your criticism here holds no water.

However, a different criticism which actually might pack a punch is this:

when we did those experiments, it was back before stockfish introduced

powerful selective AB search tricks (late move reduction...) which

greatly increased

the strength of chessprogs. So it might be that those tricks decrease the

advantage of BPIP search versus AB, perhaps by enough that we should not

care about BPIP anymore. That is unknown. More recently I realized

that MCTS searches in fact can be regarded as a kind of "poor man's

BPIP search" which may or may not be a better way to get BPIP-like

effects. (They kind of sample from a prob-distribution rather than

represent it explicitly, but near-same effect.) Alphazero's success

with MCTS then could be regarded, therefore, as more evidence Baum+Smith were on

right track.

The Baum+Smith work may not have gotten the exploration it deserves.

It may still have the potential to vastly increase the strength of

chess and/or go programs. Or not. Hard to say until/unless worked on.

--

Warren D. Smith

http://RangeVoting.org <-- add your endorsement (by clicking

"endorse" as 1st step)

Reply all

Reply to author

Forward

0 new messages

Search

Clear search

Close search

Google apps

Main menu