Ideas on computer players

Kevin Whyte

unread,

Feb 20, 1997, 3:00:00 AM2/20/97

to

The recent thread about weaknesses of Jellyfish got
me moving on my plan to train my own program. I know I'm
not the only person to do this: first there was TD gammon,
but then JF, loner, motif, SnowWhite(?), etc. It seems
to me that with a few more good ideas we could have a
program that plays noticably better than humans, and
uniformly well. Its roll-outs would be, for practical
purposes, "the truth". I think this would do a lot for
our understanding of backgammon. Despite the commercial
value of chess programs, the chess people manage to have
open discussions of techniques, and there's even a public
domain, well-commented state of the art program to tweak.
I'd like to see things here too. To start things off,
here's what I'm trying:

The basic idea is temporal difference training, as with
all the neural-net programs I know of. The big difference
is that I'm doing a "parasite driven" pool of starting
positions. What that means is that rather than just doing
training games from the initial position, I have a collection
of positions I train from. This collection evolves over
time, with the fitness criterion being the difference of
1ply look ahead eval and static eval. This paradigm has
had dramatic success in other areas. The hope is that this
will focus the training more effectively, consider:

right now NN programs seem to have trouble with backgames
and outfield primes. Why? Well, the earlier training teaches
them to play in such a way as to avoid having that many men
back, so they rarely get into such a situation. Even when they
do, they play it badly and don't learn the right things. The
problem is that what is needed is for them to arrive at a position
one move from becoming a position they understand as good: i.e.
a backgame position w/ good timing where they're about to hit, or
an outfield prime that has rolled home. These positions are
rare, and very hard to fall into by mistake. However, with the
parasite approach, once one is found it will quickly prosper and
dominate the gene pool of the parasites and hence the program
will spend most of its time training in such positions. Anyway,
it might not work, but I'm hopeful. Whatever happens, I'll make
the program and associated code availible as soon as it's in a
form I don't find embarrassing.

Alright, all you other amateur BG programmers, come on out.
Let's get some interesting discussions going and see what we can
do! It's an exciting time for backgammon, be part of it! One
thing that I think should be relatively easy for chess type
programmers would be selective look ahead - a few ply might do
wonders for playing strength.

Kevin Whyte
kwh...@math.uchicago.edu

Stephen Turner

unread,

Feb 21, 1997, 3:00:00 AM2/21/97

to Kevin Whyte

Kevin,

Your ideas are interesting, but there is one issue which you don't seem to have
considered: that of OVERTRAINING. It is possible to overtrain a neural net on
certain positions, to the detriment of its understanding of others. I'm not
sure how you would get round this.

There is another theoretical issue. You're using human bias to select the
positions. This could (I'm not sure) subvert the point of the neural net,
which is that it teaches itself everything. According to Gerry's papers, he
let it teach itself from random initial weights.

--
Stephen Turner sr...@cam.ac.uk http://www.statslab.cam.ac.uk/~sret1/
Stochastic Networks Group, Statistical Laboratory,
16 Mill Lane, Cambridge, CB2 1SB, England Tel.: +44 1223 337955
"Collection of rent is subject to Compulsive Competitive Tendering" Cam. City

Kevin Whyte

unread,

Feb 21, 1997, 3:00:00 AM2/21/97

to

In article <330D8...@cam.ac.uk>, Stephen Turner <sr...@cam.ac.uk> wrote:
>Kevin,
>
>Your ideas are interesting, but there is one issue which you don't seem to have
>considered: that of OVERTRAINING. It is possible to overtrain a neural net on
>certain positions, to the detriment of its understanding of others. I'm not
>sure how you would get round this.
>
>There is another theoretical issue. You're using human bias to select the
>positions. This could (I'm not sure) subvert the point of the neural net,
>which is that it teaches itself everything. According to Gerry's papers, he
>let it teach itself from random initial weights.
>

I've gotten several replies along these lines, so I thought
I'd post a response. I must have been rather unclear to begin
with. There will be no human input in deciding the pool of
training posititions (except to include the starting position
among them). The point is that the pool will be "evolving"
from one training round to the next. During a round of training
each of the positions in the pool will be rated by how much
the net misjudges it ( one ply versus static eval ) - the position
which are more badly misjudged will get more representatives in
the next round (slighlty mutated, perhaps). This should, I hope,
avoid overtraining because when the net starts to be consistent
about a position, it will quickly "die out", and not be used
for training any more. Position which show up while playing
out positions in the pool which are "misunderstood" (as above),
will be fed in to the pool. Since I'm going to force the
initial position to remain in the pool all the time, this
should make sure the pool of training positions never stagnates.

Anyway, I'll let you know how it works.

Kevin Whyte
kwh...@math.uchicago.edu

Brian Sheppard

unread,

Feb 21, 1997, 3:00:00 AM2/21/97

to

Kevin Whyte <kwh...@zarquon.uchicago.edu> wrote in article
<E5wwp...@midway.uchicago.edu>...

>
> The recent thread about weaknesses of Jellyfish got
> me moving on my plan to train my own program. I know I'm
> not the only person to do this: first there was TD gammon,
> but then JF, loner, motif, SnowWhite(?), etc. It seems
> to me that with a few more good ideas we could have a
> program that plays noticably better than humans, and
> uniformly well. Its roll-outs would be, for practical
> purposes, "the truth".

Dear Kevin,

I'm glad to see you are setting your sights high!

In this forum I have lately been sounding off against blind
acceptance of rollout data. You know the type ("A JF rollout
shows that 22/18 is .00001 better, so that settles the
issue...").

However, I do not have a negative opinion of JF as a player.
Far from it. My complaint is that analysis of positions has
become too mechanical. Positional analysis in this forum has
deteriorated into cut-and-paste of rollout data.

Back to the JF-as-player issue: you will find that there is not
much distance between JF and perfection. I believe that the
practical issue of playing to exploit a weaker opponent is a
far more significant factor than JF's occasional error.

> The basic idea is temporal difference training, as with
> all the neural-net programs I know of. The big difference
> is that I'm doing a "parasite driven" pool of starting
> positions. What that means is that rather than just doing
> training games from the initial position, I have a collection
> of positions I train from. This collection evolves over
> time, with the fitness criterion being the difference of
> 1ply look ahead eval and static eval.

This is a promising approach, which I believe has been used by
other backgammon developers. (Tom Keith uses a similar pattern
to develop Motif, for instance.)

There are a couple of things you should watch out for. I offer
this advice in the hope that you will avoid some pitfalls and
set aside some time to solve problems that you might have
overlooked.

First, straight TD training, as described by Sutton, is about 20
times as fast as doing a 1-ply search on every position. That is,
you can complete 20 times as many games of training in the same
amount of time. 1-ply lookahead training does develop lower
statistical variance as training progresses, so you get some of
that advantage back because you require fewer training games to
reach the same level of play. But nothing like a factor of 20.

Second, training is non-stationary. That is: as your evaluation
trains, the evaluation of each training position changes. This
implies that you need a strategy for recomputing the evaluations
of the parasite pool as evolution proceeds.

Third, straight TD from no initial knowledge has proven to be
sufficient for this task. An advanced training mechanism
might not be the answer.

Fourth, there is a condition attached to the sufficiency of TD:
you must have inputs to the neural network that are sufficient
to capture expert judgment. You might find that constructing
such a set of features is a very difficult task.

> right now NN programs seem to have trouble with backgames
> and outfield primes. Why? Well, the earlier training teaches
> them to play in such a way as to avoid having that many men
> back, so they rarely get into such a situation. Even when they
> do, they play it badly and don't learn the right things.

This is a promising theory, but isn't there an alternative
explanation? JF might not have the positional features it needs
to play such situations tactically well. For instance, does it
have a pattern for slotting the back of a prime? What about
a pattern for bringing checkers to bear on escape points? What
about rules for when it should split an anchor in a backgame?

If a neural network is missing an essential piece of positional
knowledge, then the network has very little chance of synthesizing
that knowledge.

Also, I recall that Tesauro had a trick for inducing TD-Gammon to
train positions that were unlikely to occur. The purpose of this
trick was to make TD-Gammon explore the search space thoroughly. It
was something like "start with a neural network that has a high
opinion of every position." Details are probably in one of his
TD-Gammon papers.

OK, that does it. I am looking forward to the fruits of your research,
and if you ever want to pass an idea by me, feel free.

Warm Regards,
Brian

Gargle Zndflxtzyh

unread,

Feb 24, 1997, 3:00:00 AM2/24/97

to

In <E5yys...@midway.uchicago.edu> kwh...@ford.uchicago.edu (Kevin Whyte) writes:

>In article <330D8...@cam.ac.uk>, Stephen Turner <sr...@cam.ac.uk> wrote:

any ideas of what a good layout of neurons should be?

my initial thought is:
for each slot on the board have
1 neuron to represent empty,
1 neuron to represent 1 piece and
1 neuron to represent 2-3 and perhaps 1 neuron to represent 4+ pieces

(the last one might not be necessary).

is this way off?

too big/too small?

any comments?
--
"Its so easy to laugh at the Americans. I mean, what else can you do about a
country that insists on putting health-warnings on their low-alcohol beer, yet
is willing to sell Kalashnikovs to anyone with a driver's license?"
----- Anders Nielsen (jo...@diku.dk) -----

Brian Sheppard

unread,

Feb 26, 1997, 3:00:00 AM2/26/97

to

Gargle Zndflxtzyh <jo...@diku.dk> wrote in article
<5erhr6$a...@vidar.diku.dk>...

> In <E5yys...@midway.uchicago.edu> kwh...@ford.uchicago.edu (Kevin
Whyte) writes:
> >In article <330D8...@cam.ac.uk>, Stephen Turner <sr...@cam.ac.uk>
wrote:
>
> any ideas of what a good layout of neurons should be?
>
> my initial thought is:
> for each slot on the board have
> 1 neuron to represent empty,
> 1 neuron to represent 1 piece and
> 1 neuron to represent 2-3 and perhaps 1 neuron to represent 4+ pieces
>
> (the last one might not be necessary).
>
> is this way off?

Something about your answer makes me think there might be some
confusion as to what a neuron is, so I figured I should write
a note to define the terms. If you knew this all along, then
I figure that maybe someone else will benefit.

A neural network consists of 3 types of elements:

1) Inputs, which describe the state of a system (like the
board configuration in a backgammon application),

2) Outputs, which are trained to approximate some important
measure of the system (like winning and gammon chances in
a backgammon application), and

3) Neurons, which are computational elements. These transform
the inputs into the outputs by a form of magic that is rarely
humanly understandable.

I have one paragraph for the computationally inclined:
A neuron typically takes the form Logistic(Summation(Constant * Input)),
where Logistic is the function 1 / (1 + exp(-x)). In words: a neuron
is a linear sum of inputs "squashed" by a function whose output is
bounded. If you want to generalize this concept, you can change the
squashing function to something else (tanh is equivalent to logistic,
but an interesting choice is erf(), the Gaussian error function), or
you can allow the linear sum to include the results of other neurons.

Now, the reason that I figure you misunderstand the term "neuron"
is that you are describing how to encode the board position, which
means that you are talking about Inputs, rather than Neurons.

To respond to your actual question about whether this is a good
encoding: Yes, it is. Gerald Tesauro published a paper quite a while
ago (I think it was in the journal "Artificial Intelligence" around
1989) about the precursor to TD-Gammon, named Neurogammon. The
Neurogammon feature set allocated inputs as follows:

1 boolean input, set if Black has a blot on this point,
1 boolean input, set if Black has a block (2 or more men),
1 boolean input, set if Black has exactly 3 men on this point,
1 boolean input, set if Black has exactly 4 men on this point,
1 integer input, set to the number of men beyond 4 on this point.

There are 5 inputs per side per point. There are 24 points, plus the
bar (one bar for each color). I don't remember how Tesauro handled
men borne off, nor how he handled the side to move.

Tesauro supported his method by comparing it to another system of
inputs. He also argued that this system contains a lot of semantic
content (the concepts Blot, Block, Builder, and Overloaded are
inherent in the inputs), and so neural network capacity does not
need to be devoted to identifying those features.

Note that this input set, by itself, is not sufficient to play a
strong game. Tesauro's best program using this feature set played at
a "strong intermediate" level. This basically means that the program
would clobber a novice, but it still makes lots of errors.

Tesauro's feature set for TD-Gammon has never been published, since
IBM regards it to be a trade secret. But you have to figure it
contains a few strategic concepts (e.g. the safe-play-versus-bold play
criteria from Magriel) and tactical factors (like counts of shots
or point-making rolls). There is a lot of scope for creativity here.

Best Wishes,
Brian

Greycat Sharpclaw

unread,

Feb 27, 1997, 3:00:00 AM2/27/97

to

There is an allegation that jo...@diku.dk (Gargle Zndflxtzyh) wrote:

>any ideas of what a good layout of neurons should be?

>my initial thought is:
>for each slot on the board have
>1 neuron to represent empty,
>1 neuron to represent 1 piece and
>1 neuron to represent 2-3 and perhaps 1 neuron to represent 4+ pieces

>(the last one might not be necessary).

Aside from an uncommon use of the word neuron...

The case of 2 pieces and of 3 pieces are *very* different... 3 pieces
mean you can use the slot as a source of a man without breaking the
blot.

I recommend, at the least, using the binary data 0, 1, 2, 3, 4+. The
difference in number of men on the board seems to ignorant me to start
becoming less important when you have at least 4. But every piece up
through 4 is very important in terms as to what it implies for future
positions. I don't think you can even crudely cut into group cases
for bess than that number.

Greycat

Gre...@tribeca.ios.com
Does anyone have any spare tunafish??

David Montgomery

unread,

Feb 27, 1997, 3:00:00 AM2/27/97

to

In article <E5wwp...@midway.uchicago.edu> kwh...@zarquon.uchicago.edu (Kevin Whyte) writes:
> right now NN programs seem to have trouble with backgames
>and outfield primes. Why? Well, the earlier training teaches
>them to play in such a way as to avoid having that many men
>back, so they rarely get into such a situation. Even when they

>do, they play it badly and don't learn the right things. The
>problem is that what is needed is for them to arrive at a position
>one move from becoming a position they understand as good: i.e.
>a backgame position w/ good timing where they're about to hit, or
>an outfield prime that has rolled home. These positions are
>rare, and very hard to fall into by mistake. However, with the
>parasite approach, once one is found it will quickly prosper and
>dominate the gene pool of the parasites and hence the program
>will spend most of its time training in such positions. Anyway,
>it might not work, but I'm hopeful. Whatever happens, I'll make
>the program and associated code availible as soon as it's in a
>form I don't find embarrassing.

This is a good idea, I think. Your approach can probably be
used to get a more uniform level of performance, but that doesn't
necessarily translate into improved performance overall. (In
fact, if the program improves performance in some less common
situation by sacrificing performance in more common situations,
the program will perform less well overall, in real games.)

It is difficult to get a single network to play all sorts of
situations at a very high level, because the strategies are
so different. To get the best performance, you may need to
partition the input space. If you can come up with the right
features, as someone else suggested, that might also work,
even with one network, but finding those features isn't so
easy.

David Montgomery
monty on FIBS
mo...@cs.umd.edu

David Montgomery

unread,

Feb 27, 1997, 3:00:00 AM2/27/97

to

In article <01bc23ed$e5165a80$3ac0...@polaris.mstone.com> "Brian Sheppard" <bri...@mstone.com> writes:
[...]

>Neurogammon feature set allocated inputs as follows:
>
> 1 boolean input, set if Black has a blot on this point,
> 1 boolean input, set if Black has a block (2 or more men),
> 1 boolean input, set if Black has exactly 3 men on this point,
> 1 boolean input, set if Black has exactly 4 men on this point,
> 1 integer input, set to the number of men beyond 4 on this point.
>
>There are 5 inputs per side per point. There are 24 points, plus the

>bar (one bar for each color). [...]
>
>Tesauro [...] argued that this system contains a lot of semantic

>content (the concepts Blot, Block, Builder, and Overloaded are
>inherent in the inputs), and so neural network capacity does not
>need to be devoted to identifying those features.

Actually, Tesauro argues that the encoding of the board he used
for TD-Gammon (before he added features) is "knowledge-free". You're
right that it isn't really, but it's still a very unsophisticated
encoding.

I think you're thinking of the earlier encoding he tried, in which
both the final board state and information on how the board changed
due to the move were encoded. Of that encoding he wrote
that the ideas of "clearing", "slotting", "breaking", "making",
and "stripping" a point were represented. It turns out that this
probably isn't a good approach, because the value of a position
in no way depends on the move that was used to get to it.

(Well, against human opponents it does, but let's ignore that :-).

>Tesauro's feature set for TD-Gammon has never been published, since
>IBM regards it to be a trade secret.

TD-Gammon 1.0 (the one that Robertie originally wrote about in
_Inside_Backgammon_ ) used NeuroGammon's feature set, which is
described in an article by Tesauro and Sejnowski on an earlier
program (Artificial Intelligence vol 39 pp 357-390). They list
the following features:

- pip count
- degree of contact*
- number of points occupied in the inner board
- number of points occupied in the opponent's inner board
- checkers back in opponent's inner board
- presence of a "prime" formation*
- blot exposure (the probability that a blot can be hit)*
- strength of blockade*

There are different ways to calculate the features marked
with an asterisk. Presumably by a "prime" formation they
meant a full 6-prime. For blot exposure you have to define
who a hitter is (and even what a blot is) -- my guess is they
used an approach similar to Berliner. It's clear from the
article that for strength of blockade they used something
similar to Berliner, but Berliner had several different
related blockade strength measurements.

These features are a good place to start. You might also
want to look at Berliner's articles for more ideas. I
agree emphatically with Brian's comments on coming up
with features:

>There is a lot of scope for creativity here.

David Montgomery
monty on FIBS
mo...@cs.umd.edu

Brian Sheppard

unread,

Feb 28, 1997, 3:00:00 AM2/28/97

to

David Montgomery <mo...@cs.umd.edu> wrote in article
<5f4tea$7...@twix.cs.umd.edu>...

> In article <01bc23ed$e5165a80$3ac0...@polaris.mstone.com> "Brian
Sheppard" <bri...@mstone.com> writes:
> Actually, Tesauro argues that the encoding of the board he used
> for TD-Gammon (before he added features) is "knowledge-free". You're
> right that it isn't really, but it's still a very unsophisticated
> encoding.

Yes, it is unsophisticated. But the concepts that he included are the
atoms that backgammon molecules are made of.

A truly knowledge-free encoding (using just 26 integer variables, with
positive values for our guys and negative for the other guys) would
have to use a lot of its capacity just to recognize the number of Blots
that each side has.

> I think you're thinking of the earlier encoding he tried, in which
> both the final board state and information on how the board changed
> due to the move were encoded. Of that encoding he wrote
> that the ideas of "clearing", "slotting", "breaking", "making",
> and "stripping" a point were represented. It turns out that this
> probably isn't a good approach, because the value of a position
> in no way depends on the move that was used to get to it.

Yes, I remember now that Tesauro tried to evaluate moves (rather than
positions) at one time. But I thought that he also tried a truly
knowledge-free encoding. Am I mistaken?

> TD-Gammon 1.0 (the one that Robertie originally wrote about in
> _Inside_Backgammon_ ) used NeuroGammon's feature set, which is
> described in an article by Tesauro and Sejnowski on an earlier
> program (Artificial Intelligence vol 39 pp 357-390). They list
> the following features:
>
> - pip count
> - degree of contact*
> - number of points occupied in the inner board
> - number of points occupied in the opponent's inner board
> - checkers back in opponent's inner board
> - presence of a "prime" formation*
> - blot exposure (the probability that a blot can be hit)*
> - strength of blockade*

These were published, but I still maintain that TD's input features
have never been described in public. It may seem like hair-splitting,
but bear with me for a moment. The discussion will show how large
is the gap between JellyFish and a good idea.

The items listed above (pip count, contact, etc.) are backgammon
concepts. To transform concepts into input features, you need
an algorithm that specifies how to compute that input feature from
a truly raw board description. And that's not easy, as I can show
from a few examples.

Let's start with a simple concept: blockade. The easiest input
feature is to compute the longest sequence of consecutive
points possessed by each side. But this has a clear problem, in
that a broken 5-prime split into groups of 2 and 3 will be rated as
though it were a 3-prime. Clearly inadequate.

So you change your computation: the input feature is the largest
number of points occupied within any six consecutive points. Now
the broken 5-prime is rated as a 5. But there is still a problem.

That problem is scaling. You see, each neuron multiplies this term by
a constant, and so the things represented by this input are considered
by the neuron to be in strictly linear proportion. That means that a
6-prime is considered to be twice as valuable as a 3-prime, for instance.
My sense of backgammon proportionality says that a 3-point blockade (not
necessarily consecutive!) is no better than a sieve, whereas a 6-prime
is simultaneously an irresistible force and an immovable object.

OK, so you decide to have 6 inputs, each boolean, indicating the
presence of a 1-block, 2-block, 3-block, ... 6-block. What's wrong now?

You quickly discover that every position always has at least 1
point with 2 men on it. In fact, every position has 2 such points within
6 pips of one another. So the 1-block and 2-block inputs do not help
your program to differentiate positions. You may as well remove them.

You might think that you are done, but you still haven't distinguished
between solid blockades and broken blockades. And you still have to
figure out whether any enemy men are trapped behind the blockade. And
are those men at the edge, ready to escape? And do you have the leading
or trailing edge of the blockade slotted, so you are ready to extend?

I still haven't described how Berliner computed blockading. If I recall,
he took the contents of the 14 points extending forward from each position
and computed the length of time it would take to move an enemy man
through those points. That's a different way of looking at it! And
maybe it's a better way.

I hope I have convinced you that converting a concept into an input is
a long process. Just about every term you add to your program will
require extensive debugging. My understanding is that strong programs
have over 400 input features. About 250 of these are raw board
description, so that leaves 150 features that have to be composed
and debugged manually. Your understanding of backgammon concepts
will be put to the test.

Let's change focus for a moment. I want to show you a subtle problem
with 4 of the concepts that Tesauro says were included in Neurogammon.
Actually, the problem is not subtle at all. The problem is that the
following concepts are redundant (meaning that they should not
be inputs to a backgammon neural network):

- pip count

- number of points occupied in the inner board
- number of points occupied in the opponent's inner board
- checkers back in opponent's inner board

You might believe that such crucial backgammon concepts could not
possibly be redundant, but they are for a subtle technical reason.
Each of these concepts is a linear combination of the raw network
encoding. That is, each can be computed by summing up multiples of
the state of the raw count of checkers on each point. There is no need
to have such a feature because every neuron computes a linear function
of all inputs, and a linear function of a linear function is...another
linear function. Bottom line: since each neuron can synthesize these
concepts as needed, there is no reason to include them. And they should
be removed since including them just wastes time and space.

So what am I to conclude from this? Did Tesauro (who is one of the
preeminent figures in the neurocomputing field, who serves on the
editorial panels of leading journals and conferences, who has devoted
at least a decade of research into neural nets and backgammon) really
include 4 completely redundant features in his neural network, then
publish that fact in the leading journal of artificial intelligence?

It is possible that he did. It is no big deal to make an error,
especially one of such small significance, when dealing with a
problem this hard. On the other hand, there is an alternative
theory: the measures he included in his program were not the simple
linear functions that you would conclude on the basis of reading
his paper, and the true identity of those functions is being hidden
because IBM regards it to be proprietary information. This latter
explanation is consistent with the tradition of game-program authors,
who will tell you _what_ was accomplished, but not _how_.

I hope that I have shown that even simple backgammon concepts can
be difficult to code. In fact, just deciding _what_ to compute is
hard, since backgammon concepts often do not map directly to easily
computable functions. Even after you decide what to compute you have
to watch out for redundancy arising from highly related concepts and
from linear combinations.

Brian

Stephen Turner

unread,

Feb 28, 1997, 3:00:00 AM2/28/97

to

Brian Sheppard wrote:
>
> The

> following concepts are redundant (meaning that they should not
> be inputs to a backgammon neural network):
>
> - pip count
> - number of points occupied in the inner board
> - number of points occupied in the opponent's inner board
> - checkers back in opponent's inner board
>
> You might believe that such crucial backgammon concepts could not
> possibly be redundant, but they are for a subtle technical reason.
> Each of these concepts is a linear combination of the raw network

> encoding. [...] since each neuron can synthesize these

> concepts as needed, there is no reason to include them. And they should
> be removed since including them just wastes time and space.
>
> So what am I to conclude from this? Did Tesauro (who is one of the
> preeminent figures in the neurocomputing field, who serves on the
> editorial panels of leading journals and conferences, who has devoted
> at least a decade of research into neural nets and backgammon) really
> include 4 completely redundant features in his neural network, then
> publish that fact in the leading journal of artificial intelligence?
>
> It is possible that he did. It is no big deal to make an error,
> especially one of such small significance, when dealing with a
> problem this hard. On the other hand, there is an alternative
> theory: the measures he included in his program were not the simple
> linear functions that you would conclude on the basis of reading
> his paper, and the true identity of those functions is being hidden

I conclude neither of the above. I think that the pip count is the normal pip
count, but that Tesauro included it deliberately so as to give the network a
'helping hand' in finding the pertinent functions of the input. As you've
pointed out, the best backgammon neural nets rely on human judgement as to
what are the right inputs, beyond the simple board descriptors. Although the
network could (would, probably) learn to calculate the pip count, including it
as an input tells it directly that this feature is important, and lets the
network use its neurons to calculate more subtle things.

--
Dr Stephen Turner sr...@cam.ac.uk http://www.statslab.cam.ac.uk/~sret1/

Brian Sheppard

unread,

Feb 28, 1997, 3:00:00 AM2/28/97

to

Stephen Turner <sr...@cam.ac.uk> wrote in article
<331705...@cam.ac.uk>...

> Brian Sheppard wrote:
> > It is possible that he did. It is no big deal to make an error,
> > especially one of such small significance, when dealing with a
> > problem this hard. On the other hand, there is an alternative
> > theory: the measures he included in his program were not the simple
> > linear functions that you would conclude on the basis of reading
> > his paper, and the true identity of those functions is being hidden
>

> I conclude neither of the above. I think that the pip count is the normal
pip
> count, but that Tesauro included it deliberately so as to give the
network a
> 'helping hand' in finding the pertinent functions of the input. As you've
> pointed out, the best backgammon neural nets rely on human judgement as
to
> what are the right inputs, beyond the simple board descriptors. Although
the
> network could (would, probably) learn to calculate the pip count,
including it
> as an input tells it directly that this feature is important, and lets
the
> network use its neurons to calculate more subtle things.

The issue raised in this post has a fairly technical answer, so if you are
just interested in TD-Gammon's feature set, you can skip to the last
paragraph.

Technical answer: my calculations indicate that gradient-descent training
(as used by Tesauro) learns the relevant functions just as quickly without
redundant features. You can verify that the backpropagation speed of
gradient-
descent training is not affected by the addition of linear combinations of
input parameters. I conclude that such features are simply irrelevant.

I agree with you that Tesauro might have thought that adding such features
was helpful. Or he might have evolved his program from previous
architectures
in which they actually were useful.

On the subject of TD-Gammon's feature set, the bottom line is this:
IBM has publicly stated that TD-Gammon's neural network is a trade secret.
This is a legal term which includes the requirement that such information
has never been publicly divulged. Nothing you read in any description of
TD-Gammon (or Neurogammon) constitutes a description of the input features.

Brian

Stephen Turner

unread,

Mar 3, 1997, 3:00:00 AM3/3/97

to

Brian Sheppard wrote:
>
> Technical answer: my calculations indicate that gradient-descent training
> (as used by Tesauro) learns the relevant functions just as quickly without
> redundant features. You can verify that the backpropagation speed of
> gradient-
> descent training is not affected by the addition of linear combinations of
> input parameters. I conclude that such features are simply irrelevant.
>
> I agree with you that Tesauro might have thought that adding such features
> was helpful.

That's an interesting observation, of which I was not aware. Do you have a
reference in the literature to any proof of such a result?

>
> On the subject of TD-Gammon's feature set, the bottom line is this:
> IBM has publicly stated that TD-Gammon's neural network is a trade secret.
> This is a legal term which includes the requirement that such information
> has never been publicly divulged. Nothing you read in any description of
> TD-Gammon (or Neurogammon) constitutes a description of the input features.
>

Agreed. Though Tesauro has certainly told us quite a lot of things about the
feature set.

--

Brian Sheppard

unread,

Mar 3, 1997, 3:00:00 AM3/3/97

to

Stephen Turner <sr...@cam.ac.uk> wrote in article

<331AC5...@cam.ac.uk>...

> Brian Sheppard wrote:
> >
> > Technical answer: my calculations indicate that gradient-descent
training
> > (as used by Tesauro) learns the relevant functions just as quickly
without
> > redundant features. You can verify that the backpropagation speed of
> > gradient-
> > descent training is not affected by the addition of linear combinations
of
> > input parameters. I conclude that such features are simply irrelevant.
> >
> > I agree with you that Tesauro might have thought that adding such
features
> > was helpful.
>
> That's an interesting observation, of which I was not aware. Do you have
a
> reference in the literature to any proof of such a result?

I have not seen a proof published, but it's not hard. Expand the gradient
of
the error-function of a hidden node when one of the inputs is a linear
function
of the other inputs. The resulting gradient is a linear function of the
gradient
of the error with the linear input removed. It follows that by changing the
bias or the learning rate you can make the two formulations exactly
equivalent.

Brian

Peter Fankhauser

unread,

Mar 8, 1997, 3:00:00 AM3/8/97

to

Brian Sheppard wrote:
>
> There is no need
> to have such a feature because every neuron computes a linear function
> of all inputs, and a linear function of a linear function is...another
> linear function. Bottom line: since each neuron can synthesize these
> concepts as needed, there is no reason to include them. And they should
> be removed since including them just wastes time and space.
>

hmmm, afaik, Tesauro used the sigmoid function (f(x)=1/(1+e^-x), x being
the
weighted sum of all inputs) to compute the activation of neurons. This
isn't
very linear, is it? In fact, if one wouldn't use a nonlinear function
here
(e.g. smooth treshold like sigmoid), a neural net could not approximate
nonlinear functions.

Tesauro mentioned that inputs such as pipcount, priming factor, etc.
improve the
overall performance (i.e. a net trained with these inputs beats a net
trained
without them). My own experiments verify this (and if you ask any of the
fibs-nn creators like Wittmann, Egger - the first advice they give you
is to
add as many additional inputs as you can think of.

funk

Brian Sheppard

unread,

Mar 10, 1997, 3:00:00 AM3/10/97

to

Peter Fankhauser <fank...@darmstadt.gmd.de> wrote in article
<332198...@darmstadt.gmd.de>...

> Brian Sheppard wrote:
> > There is no need
> > to have such a feature because every neuron computes a linear function
> > of all inputs, and a linear function of a linear function is...another
> > linear function. Bottom line: since each neuron can synthesize these
> > concepts as needed, there is no reason to include them. And they should
> > be removed since including them just wastes time and space.
>

> afaik, Tesauro used the sigmoid function (f(x)=1/(1+e^-x), x being the

> weighted sum of all inputs) to compute the activation of neurons. This
> isn't very linear, is it? In fact, if one wouldn't use a nonlinear
> function here (e.g. smooth treshold like sigmoid), a neural net could
> not approximate nonlinear functions.

As you point out, the input to the sigmoid (x, that is) is a linear
function
of the inputs to the net. And therefore: is pointless to include inputs
that are linear combinations of other inputs.

> Tesauro mentioned that inputs such as pipcount, priming factor, etc.
> improve the overall performance (i.e. a net trained with these inputs
> beats a net trained without them).

Tesauro has stated that the *totality* of all advanced features improve
the play. He has not stated that pipcount changes the play.

My calculations show that pipcount cannot improve the play of a network
that
already has the raw input features. Neither can it improve the training
time.

> My own experiments verify this (and
> if you ask any of the fibs-nn creators like Wittmann, Egger - the first
> advice they give you is to add as many additional inputs as you can think
of.

I also advise you to add as many inputs as you can think of, with the
sole exception that it is pointless to add something that is already
there. A linear combination of other inputs is a case in point.

Brian