I.e. is there more than one formula and if so, is
it possible they may result in differing results
significant enough to matter when relating to
backgammon check play analysis.
Also with regards to checker play comparison,
should we lean towards truncated rollouts with a set
horizon with no variance reduction or are the results
likely to better when applying
variance reduction a full rollout?
Thanks
Midas
Editor of Backgammon Bonanza
www.depreli.demon.co.uk
I can trivially answer "yes" to this but in a way that might not be
useful to you. Jellyfish uses one method (the name of the algorithm is
in the manual) and Snowie uses a method that might or might not be the
same. (I'll speculate that it isn't the same since Snowie generally
reports a higher "equivalent to X games" value for the same size
rollout, which suggests that it reduces variance faster.)
In addition there's at least one variance reduction method that isn't
used by any computer programs that I am aware of. This is the (hereby
named, although reported on this newsgroup previously) Zehr
Cube-reduction algorithm: when rolling out with a live cube and a
takeable cube is given, keep the cube at the same value and play two
games from the position where the cube was turned. (Do this on all
subsequent cube turnings too.) The purpose of this is to allow a live
cube on rollouts and yet keep a single game with a high cube value from
affecting the results too much.
>I.e. is there more than one formula and if so, is
>it possible they may result in differing results
>significant enough to matter when relating to
>backgammon check play analysis.
I'm sure one can use different algorithms to get different results from
the same neural net and dice, however one would also get different
results from different neural nets when using the same algorithm.
The basic idea of any variance reduction method is to estimate the
amount of luck each side has and factor that out. For example, if one
side has two on the bar against a 5-point board and rolls the necessary
doubles to come in that's a very lucky roll. A variance reduction
algorithm would measure that luck and change the results of the game in
some way to take account of it.
In the example above, if the neural net's evaluation of rolling the
double to enter is +.243 equity, and the evaluation of the position
before rolling as -.324, that roll was worth +.567. If the player who
rolled the lucky double goes on to win then instead of getting 1 point
for the win they might get 1 point less some fraction of the equity from
their lucky rolls. So you might deduct half of .567 (as well as half of
other lucky rolls, but also adding in half of the unlucky rolls, and
doing the same thing for the opponents).
Some variations on this might include:
o vary the fraction of the adjustment (half in the example above)
o only make the adjustment if the roll was especially lucky or unlucky
and make no adjustments for rolls whose value is within some range of
the position's value before rolling
o change the adjustment based on whether the roll was close to the start
of the game or close to the end of the game
Any such algorithm is going to have results vary depending on the
evaluation used to decide what is a lucky roll. In the example above a
different program rolling out the same position with the same dice might
think that rolling the double was worth only .425 and hence make a
smaller adjustment.
An example of another kind of variance reduction is one that Snowie uses
-- you can enable bearoff database truncation. When using this a
rollout stops at the point when Snowie can look it up in its bearoff
database. This database is extremely accurate and can reduce all of the
variance from luck during the bearoff. For example, Snowie might
determine that one side has a 63% chance of winning. At that point it
can stop that game and use a result of .26 equity.
>Also with regards to checker play comparison,
>should we lean towards truncated rollouts with a set
>horizon with no variance reduction or are the results
>likely to better when applying
>variance reduction a full rollout?
First one should specify the context a bit more. A full rollout will
almost always give a better evaluation if time were not a factor. What
one is usually interested in is the fastest way to get a reliable
answer.
For some positions a truncated rollout will get a reliable answer
fastest and for some it won't. If the position is one that the
evaluator understand well, or the position is one that will soon
simplify (for example a holding game) then a truncated rollout can get a
very reliable answer very quickly. One can think of these as
evaluations with a higher ply. (A 1296 game rollout with the initial
rolls varied, truncated after 2 rolls and played n-ply is equal to an
n+2-ply evaluation.)
If the evaluator doesn't understand the position well or nothing is
going to change much for a while (for example a prime vs. prime
position) then it is probably best to do the full rollout.
The best way to decide which one to do is to start with a full rollout,
compare the results to the evaluator, and if they are close then trust a
truncated rollout. Unfortunately that doesn't save any time. But with
experience one can judge which kind of rollout to do.
One final comment about computer rollouts -- in those positions in
which a truncated rollout is not accurate we might not be able to trust
the full rollout results anyway. If the evaluator doesn't understand
the position very well, it might not even understand it well enough to
do a rollout. It's easy to fall into the trap of doing a bazillion game
rollout, getting a standard deviation of .001 on the results, and
believing you've found the right answer. In reality you've found the
best answer that particular tool can give you, which isn't necessarily
the right answer.
-Michael J. Zehr
In another post I demonstrate that the effect of variance reduction depends on
the accuracy of the static evaluation function. If SW had a more accurate
neural network than JF, then SW would achieve better results from variance
reduction even with exactly the same variance reduction algorithm.
I speculate that SW and JF both use methods based upon the theory named in
JF's user manual. Of course, there may still be significant differences.
Brian Sheppard
-----== Posted via Deja News, The Leader in Internet Discussion ==-----
http://www.dejanews.com/rg_mkgrp.xp Create Your Own Free Member Forum
> >The basic idea of any variance reduction method is to estimate the
> >amount of luck each side has and factor that out. For example, if one
> >side has two on the bar against a 5-point board and rolls the necessary
> >doubles to come in that's a very lucky roll. A variance reduction
> >algorithm would measure that luck and change the results of the game in
> >some way to take account of it.
> >
> >In the example above, if the neural net's evaluation of rolling the
> >double to enter is +.243 equity, and the evaluation of the position
> >before rolling as -.324, that roll was worth +.567. If the player who
> >rolled the lucky double goes on to win then instead of getting 1 point
> >for the win they might get 1 point less some fraction of the equity from
> >their lucky rolls. So you might deduct half of .567 (as well as half of
> >other lucky rolls, but also adding in half of the unlucky rolls, and
> >doing the same thing for the opponents).
Whoa! This is not "variance reduction." This is "bias introduction."
(Math phobic readers may wish to avert their eyes at this juncture.)
The algorithm you just described computes the move-by-move change in the
neural net evaluation over the course of a single trial. The final outcome
of the trial is offset against the sum of the changes. This reduces variance,
to be sure. But *how* does it reduce variance? Let's take a look.
Let E0 be the evaluation of the root position of the rollout. After 1 turn the
evaluation changes to E1, then to E2, and so on. In the final position the
outcome is V. What is the sum of the changes from turn to turn?
In the first turn the change is E0 - E1. In the second turn the change is E1 -
E2. In the third turn the change is E2 - E3, and so on. The sum is
(E0 - E1) + (E1 - E2) + (E2 - E3) + (E3 - E4) + ... + (En - V)
This type of sum is called a "telescoping series," because it resembles a
collapsible telescope. To see why, let's rearrange the parentheses a little:
E0 (-E1 + E1) (-E2 + E2) (-E3 + E3) ... (-En + En) - V
So the sum is E0 - V! This amount will be offset against V. If the fraction we
offset against V is f, then the final answer is
f * E0 + (1 - f) * V
Well no wonder we reduce the variance! All the algorithm accomplishes is to
compute a weighted average of the true statistical answer with our initial
estimate. This is obviously the wrong answer; our goal is to eliminate our
initial estimate from the equation. The dependency on E0 is a bias.
True variance reduction does not introduce bias into the result. I believe I
have reconstructed JF's algorithm based upon the cryptic description in JF's
user manual. Before I describe it, I want to provide some motivation for it,
so that you can understand how the algorithm reduces variance.
Let E be the true cubeless equity function, and e be our static evaluation
function (e.g. a neural net). Rollouts estimate E by playing out a number of
trial games and averaging the results. A simple rollout does not exploit the
fact that e is a very good approxmation to E. Variance reduction reduces the
number of trials we need to achieve a specific level of accuracy by exploiting
the fact that e is a very good approximation of E. How does it work?
Instead of regarding the rollout as establishing a value for E, let's think of
it as establishing a value for E - e. In words, the rollout will estimate the
error in our static evaluation.
Let's begin by doing a 1-ply search, then estimate E as the average of the 1-
ply values of e. This would be completely correct if e were perfect, but since
e isn't perfect, we need to estimate the error involved. How can we do that?
Simple. Let's just pick a 1-ply position at random, and estimate the error
involved in its evaluation. Note that I have just recreated the problem that
I posed at the root (i.e., estimating the difference between e and E) but one
ply deeper into the game tree. I can continue this process until I reach a
terminal node.
To estimate E, we start at the end of the variation we have selected at random
and apply the formula
E = Delta + Average(e)
where the average is taken over all next-ply nodes and Delta is the
recursively computed estimate of the inaccuracy of e for the
randomly-selected successor.
How does this procedure reduce variance? Let's take a look. The variance of
our estimate of E equals the variance of our estimate of Delta plus the
variance of our estimate of Average(e). But Average(e) is a constant, so its
variance is zero. So the variance of our estimate of E is simply the variance
of our estimate of Delta.
Now, Delta is the estimate of E-e for a single successor position. And since e
is very, very accurate, the variance of Delta is very, very low. The standard
deviation of E-e is 0.035 or so. Compare this low error rate with the typical
standard deviation of 0.5 for a single game rollout. Reducing the standard
deviation by a factor of 0.5 / 0.035 = 14.28 would normally require 14.28 *
14.28 = 203.9 times as many trials. We need vastly fewer trials to get a
specific level of accuracy.
The factor of 203.9 is not pure profit, alas, because we have to do 1-ply
searches at every node when we use variance reduction. By contrast, a normal
rollout does only static evaluation. This typically costs a factor of 20 or
so, which is a shame. But we still should be very happy with a net increase
of a factor of 10 in overall throughput.
As I stated before, I believe that this is the algorithm used in JF. However,
this belief is based upon deciphering the user's manual, which refers only to
certain theorems from the theory of Markov chains. There may be other
formulations of this algorithm, or there may be other ways to apply the theory
to reduce variance.
Warm Regards,
One significant technique is called "striated sampling." The goal is to choose
your samples according to a theoretical model of how samples ought to be
distributed.
For example, when you do a JF rollout with a multiple of 36 trials, JF
allocates trials equally over the 36 possible dice rolls. This ensures that
the variations exactly match the results of the first ply of rolls. And if
the number of trials is a multiple of 1296, then JF will do an exhaustive
enumeration out to 2 ply.
Another technique is to concentrate effort on those rolls that have the
highest variance. For example, suppose that the side-to-move is on the bar,
and 25 numbers fan. Then all 25 of those numbers transpose to the same
variation. Moreover, suppose that we observe the results of fanning to be
consistent: the opponent always wins. Accordingly, this category has 0
variance. Since other rolls have higher variance, we should concentrate our
efforts on other rolls.
To compensate for the fact that we distribute trials non-uniformly, we scale
the results for each roll. To continue our example, suppose we do 50 trials
of the 25 fanning rolls (losing a single point each time), and 1000 trials
for each of the other 11 rolls, losing 0.5 points each time. Our equity is
computed as follows:
average value in 50 trials is -1, times 25/36 = -25/36.
average value in 11000 trials is -0.5, times 11/36 = -5.5/36.
Total value: -30.5/36
Something is seriously wrong if you reduce variance and end up with a
different answer. The goal of variance reduction is to cause the statistical
process to converge more quickly to the same answer as without variance
reduction.
It is true that if you change networks or the dice or the algorithm then you
will not get the same answer over a limited number of trials. However, if you
use the same neural network then any valid variance reduction method will
converge (in the limit of an arbitrarily large number of trials) to the same
answer as without that variance reduction method.
Warm regards,
This is clever. Doubling the number of trials in variations in which a cube is
accepted reduces the variance of our estimate in those variations by 50%.
The impact on run time is reasonable. Even though we double the double/take
trial count, we do not double the run-time, since the variation is only split
*after* the double. Moves prior to the double are common to the two trials.
The bottom line is that the method halves the variance of double/take trials
with considerably less than a doubling of the run-time.
A few implementation details for implementation-minded readers. First, there
are pathological situations where the rollout would explode. (These
situations are the same as the situations where the cube-using equity value
is undefined. Check out the Deja News archive for related threads.) The
number of "splits" of the variation tree must be limited to prevent
explosions.
Second, you have weigh the double/take variations on an equal footing with the
double/drop variations. This means that the pair of double/take games must be
counted as a single game, not two games.
Warm Regards,
I'm curious to know if anyone has done this with one program playing against
another. For example, rollout Jellyfish playing Snowie, then Jellyfish
playing Motif -- or whatever. Then switch sides. Then see if the plays are
at least ranked the same in each rollout set.
Here is why I ask this. Many years ago I had a computer gin game. I was
better than the two top computer players in this game, but I had to use a
different knocking strategy for each. And when I reversed my knocking
strategies with them, I lost more often than won. So considering that
computer backgammon players have their own styles, I wonder if the variable
of playing against different computer players has been tested in rollouts.
I'm not very familiar with the concept of rollouts but my understanding is
you take a position and do rollouts with each move that could be made given
the dice at the time. Then after hundreds of these rollouts one can rate
the possible moves. And in each test it is the computer playing against
itself.
I'm curious to know if anyone has done this with one program playing against
another. For example, rollout Jellyfish playing TD Gammon, then Jellyfish
playing Motif. Then switch sides, plus have TD Gammon go against Motif.
Then see if the plays are at least ranked the same in each rollout set.
Here is why I ask this. Many years ago I had a computer gin game. I was
better than the two top computer players in this game, but I had to use a
different knocking strategy for each. And when I reversed my knocking
strategies with them, I lost slightly more often than I won. So considering
that these computer backgammon players have noticeable styles, I wonder if
But, I am a little confused exactly how this is done and thought an example might
help clear things up.
bshe...@hasbro.com wrote:
> Let E be the true cubeless equity function, and e be our static evaluation
> function (e.g. a neural net). Rollouts estimate E by playing out a number of
> trial games and averaging the results. A simple rollout does not exploit the
> fact that e is a very good approxmation to E. Variance reduction reduces the
> number of trials we need to achieve a specific level of accuracy by exploiting
> the fact that e is a very good approximation of E. How does it work?
>
Okay... lets say we want to to do a rollout of some position called P0 and have a
function e (a neural net) which computes cubeless equity.Suppose e(P0) = 0.60;
> Instead of regarding the rollout as establishing a value for E, let's think of
> it as establishing a value for E - e. In words, the rollout will estimate the
> error in our static evaluation.
>
> Let's begin by doing a 1-ply search, then estimate E as the average of the 1-
> ply values of e. This would be completely correct if e were perfect, but since
> e isn't perfect, we need to estimate the error involved. How can we do that?
>
So there's a function, call it e1, whose value is defined as a 1-ply search over
e.
Suppose e1(P0) = 0.62
> Simple. Let's just pick a 1-ply position at random, and estimate the error
> involved in its evaluation. Note that I have just recreated the problem that
> I posed at the root (i.e., estimating the difference between e and E) but one
> ply deeper into the game tree. I can continue this process until I reach a
> terminal node.
>
Here is where we rollout the position, producing the positions P1, P2, ... till
the game ends and we get some result R.
Suppose it looks like this:
P, e(P), e1(P)
-- --- ---
P0 0.60 0.62
P1 0.50 0.51
P2 0.60 0.58
P3 0.55 0.57
P4 0.72 0.68
result (R) = +1.0
> To estimate E, we start at the end of the variation we have selected at random
> and apply the formula
>
> E = Delta + Average(e)
>
> where the average is taken over all next-ply nodes and Delta is the
> recursively computed estimate of the inaccuracy of e for the
> randomly-selected successor.
>
This is where I get confused.
Is Delta = sum over all positions except P0 of e1 -e ? = ( [0.51 - 0.50] +
[0.58 - 0.60] + [0.57 - 0.55] + [0.68 - 0.72] )= -0.03
So for each trial, this Delta would be added to a running total (call it Dtot),
and
E = e1(P0) + (Dtot / numtrials)
Is this right? (And that according to the above the static eval of the original
position, e(P0), and the final result, R, are irrelevant)
Unless we assume that all variance reduction algorithms give the same
standard deviation after the same number of games, i.e. there is only
one variance reduction algorithm, then the values must be different.
We use variance reduction because after a fixed number of games we'll
get a result that is closer to the theoretically most precise result
from that neural net. If one algorithm is "better" than another, it
will report an even closer value after the same number of games -- in
other words it reports a different value.
As an example, suppose the theoretically most precise equity of a
position for a given neural net (not necessarily the true equity of a
position, but rather the best that a particular net can give us) is
.350, then after 36 games with no variance reduction the standard
deviation might be .150 and the equity reported might be .220. If we use
a variance reduction method that reduces variance by a factor of 10,
i.e. 36 games is equivalent to 360 games, the standard deviation might
be .45, and we'd expect a value closer to .350 than .220 is. If we use
a really good variance reduction so the standard deviation is .025
(equivalent to 1296 games, or a factor of 36), against we'd expect a
value closer to .350.
Since none of the neural nets give you a choice between two different
variance reduction methods this can't be actually demonstrated. However
one can demonstrate the difference between no variance reduction (which
is a valid algorithm, although a poor one!) and some variance
reduction. Try this if you have Snowie:
Pick a position. Set the score to double-match point. Do a 2-game
rollout selecting "rollout with doubling cube in play." Do another
2-game rollout without that option selected but everything else the
same. You'll get rather different results.
-Michael J. Zehr
PS. I think Brian was thinking of a different situation. If you have
two variance reduction algorithms than the equity position they
eventually converge to should be the same, i.e. if after 1000 games
algorithm A reports a standard deviation of .025 and after 2000 games
algorithm B reports a standard deviation of .025, the equity results
should be the same, or at least within .025. I was addressing what A
and B report after the same 1000 games, not what they report when they
eventually converge.