Idea: use range (evalMin - evalMax) for position evaluation

402 views
Skip to first unread message

Mirza Hadzic

unread,
Apr 6, 2021, 5:04:31 AM4/6/21
to lcz...@googlegroups.com
Human players often would not go into too risky moves, even with better value of evaluation.

Example, player as White:
Current position: 0.1
Next position Pos1: eval 0.2 (risky / unpredictable, range -0.1 - 0.5)
Next position Pos2: eval 0.15 (less risky / more predictable, range 0.1 - 0.2)

So, it would make sense for Lc0 to take not only the middle value of position evaluation, but also to learn how big the range is, meaning how risky or unpredictable the position is.

Lc0 then could choose the best evaluation / risk ratio. This ratio would be also learned, just like in the case of human players ("I am pawn up, and it is a middle game, so I will take this eval/range over that eval/range").

It seems to me that one-dimensional evaluation is making Lc0 blind to too risky moves.

shoham...@gmail.com

unread,
Apr 12, 2021, 1:47:14 PM4/12/21
to LCZero
There are works on estimating the variance in addition to the expected return.
Just google for: reinforcement learning risk variance
In fact, the formulas involved are not too different from the standard formulas used for computing the return. 

However, it does involve training (and using) another network for the prediction of the variance, which might make it dubious when computational resources are limited.

Álvaro Begué

unread,
May 28, 2021, 1:44:54 PM5/28/21
to LCZero
This idea is old. For instance, see the 1999 paper "Propagating Distributions Up Directed Acyclic Graphs" by Eric Baum and Warren D. Smith (who used to hang out in this forum).

However, I think this is fundamentally wrong. Rational decision making is achieved by maximizing the expected value of the utility of the outcome (see Von Neumann–Morgenstern utility theorem). This is a single number, so I think it is correct for engines to have a one-dimensional evaluation space. If a move is risky, it's because it increases the probability of a bad outcome, and the evaluation already contains a penalty for it.

Álvaro.

Warren D Smith

unread,
May 29, 2021, 12:58:33 AM5/29/21
to Álvaro Begué, LCZero
On 5/28/21, Álvaro Begué <alvaro...@gmail.com> wrote:
> This idea is old. For instance, see the 1999 paper "Propagating
> Distributions Up Directed Acyclic Graphs" by Eric Baum and Warren D. Smith
> (who used to hang out in this forum).
>
> However, I think this is fundamentally wrong. Rational decision making is
> achieved by maximizing the expected value of the utility of the outcome
> (see Von Neumann–Morgenstern utility theorem).

--yes.

> This is a single number,

--yes.

> so I think it is correct for engines to have a one-dimensional evaluation
> space.

--that does not follow.

>If a move is risky, it's because it increases the probability of a
> bad outcome, and the evaluation already contains a penalty for it.

--well, no: a move can have a high variance: possibly high reward but
also possibly high negative reward.
You might find out which one if explore it deeper; but if do not, then
you do not know which.
But even without exploring deeper, you can still know "it has high variance"
versus "low" IF you have a prob-distribution-valued eval function.
And then you can (1) use that knowledge to try to guide your decisions about
which parts of the search tree to explore deeper. You then will
search more effectively.

Also, even without exploring deeper, (2) the expectation
values of internal nodes of the tree will change
if the leaves have prob-distribution-values, versus if those leaves
just had scalar values
equal to the expected utility of that prob-distribution.

Baum and Smith both invented all this, and then demonstrated experimentally that
the better internal node valuations (2) yielded stronger gameplayers,
and also that the smarter search (1) also yielded stronger players.
(I'm Smith.)

These strength advantages were very large for some games.
And chess and go looked like they would be games for which it would work well,
in fact probably better than any of the games we tried.
[But we never experimented on chess+go because we only experimented on a
variety of simpler games, basically because we were not good enough programmers
and/or not enough time. Also, my interest in chess came at the wrong
time, just before IBM did deep blue hardware. An my interest in go
also at exactly the wrong time, just before alphazero team.]

So sorry: your criticism here holds no water.

However, a different criticism which actually might pack a punch is this:
when we did those experiments, it was back before stockfish introduced
powerful selective AB search tricks (late move reduction...) which
greatly increased
the strength of chessprogs. So it might be that those tricks decrease the
advantage of BPIP search versus AB, perhaps by enough that we should not
care about BPIP anymore. That is unknown. More recently I realized
that MCTS searches in fact can be regarded as a kind of "poor man's
BPIP search" which may or may not be a better way to get BPIP-like
effects. (They kind of sample from a prob-distribution rather than
represent it explicitly, but near-same effect.) Alphazero's success
with MCTS then could be regarded, therefore, as more evidence Baum+Smith were on
right track.

The Baum+Smith work may not have gotten the exploration it deserves.
It may still have the potential to vastly increase the strength of
chess and/or go programs. Or not. Hard to say until/unless worked on.

--
Warren D. Smith
http://RangeVoting.org <-- add your endorsement (by clicking
"endorse" as 1st step)
Reply all
Reply to author
Forward
0 new messages