MuZero beats AlphaZero

2,786 views
Skip to first unread message

Weber Yan

unread,
Nov 21, 2019, 10:25:32 PM11/21/19
to LCZero
DeepMind published a paper about MuZero, a new approach to learning, which they evaluated on several board games and Atari video games: https://arxiv.org/pdf/1911.08265.pdf

From what I understand from a quick browse of the paper, the innovative part compared to AlphaZero type of approach is that MuZero doesn't "know" the rules in advance, therefore is a more general learning algorithm, which can be used in more open-ended domains.

They tested it against AlphaZero for go and MuZero won, this is an exact quotation:

"In Go, MuZero slightly exceeded the performance of AlphaZero, despite using less computation per node in the search tree (16 residual blocks per evaluation in MuZero compared to 20 blocks in AlphaZero)"

Very interesting news, I hope they will publish some game records too!

A new approach of learning without knowing any rules!

Warren D Smith

unread,
Nov 22, 2019, 10:15:01 AM11/22/19
to LCZero
On 11/21/19, Weber Yan <zhub...@gmail.com> wrote:
> DeepMind published a paper about MuZero, a new approach to learning, which
> they evaluated on several board games and Atari video games:
> https://arxiv.org/pdf/1911.08265.pdf
>
> From what I understand from a quick browse of the paper, the innovative
> part compared to AlphaZero type of approach is that MuZero doesn't "know"
> the rules in advance, therefore is a more general learning algorithm, which
>
> can be used in more open-ended domains.
>
> They tested it against AlphaZero for go and MuZero won, this is an exact
> quotation:
>
> *"In Go, MuZero slightly exceeded the performance of AlphaZero, despite
> using less computation per node in the search tree (16 residual blocks per
> evaluation in MuZero compared to 20 blocks in AlphaZero)"

--I presume that the part of the neural net which learned, in go,
which moves are legal,
essentially caused the self-creation of a sub-network for detecting
"simple ko," an important go concept. The rules of go are very
simple. It then proceeded to learn go in normal alphazero fashion
from then on, so essentially from then on, it was alphazero plus an
added ko-detector. So it then is not surprising it would outperform
alphazero without a ko detector.

But more generally in go, it is illegal to repeat prior positions.
This can occur in more ways than "simple ko" (call these "complex ko")
but in practical games complex ko arises much less often than simple
ko. I doubt the muzero system was able to comprehend the full rules
of go including all possible complex ko situations. I.e. it probably
never fully learned the rules of the game. This presumably did not
appreciably hurt its strength because (1) complex ko arises rarely and
(2) they artificially prevented muzero from playing illegal moves.

This "mu > alpha" phenomenon was not seen in any game other than go
(in particular not in chess or shogi). In chess, it presumably would
need to learn a pinned-piece detector and check-detector, which might
be of some use (if they are playing full-legal-move chess rather than
pseudo-legal -- which they did not tell us); learning to detect
promotion, en passant and castling would not seem useful because plain
alphazero already had info about that encoded in its board
representation.

I think the more-useful lesson to learn from this is:
if you manually added to leela's chessboard representation, detectors
of certain useful chess concepts and features, and then leela's neural
net used that "enhanced chessboard" as its inputs, not merely a "bare"
board, then I would expect an increase in leela's strength.



--
Warren D. Smith
http://RangeVoting.org <-- add your endorsement (by clicking
"endorse" as 1st step)

Warren D Smith

unread,
Nov 22, 2019, 10:32:07 AM11/22/19
to LCZero
> I think the more-useful lesson to learn from this is:
> if you manually added to leela's chessboard representation, detectors
> of certain useful chess concepts and features, and then leela's neural
> net used that "enhanced chessboard" as its inputs, not merely a "bare"
> board, then I would expect an increase in leela's strength.

--so for example: suppose leela's chessboard representation included not only
"bit planes" for each piece type X, e.g. the bit-plane for "white
rooks" has 1s in the locations of the white rooks, but ALSO bit planes
for "locations piece-type X could reach in M moves"
(for e.g. M=0,1,2,3; right now only has M=0). Or bits saying "this
piece is pinned."

Those might help. Leela probably already is self-learning something
like that, but providing
it directly would (a) be computed faster, speeding up game-play (b)
bypass the need to learn this, reducing the number of training games
needed to learn.

Shawn S.

unread,
Nov 22, 2019, 5:51:46 PM11/22/19
to LCZero
If illegal moves cause an adjudicate resulting in immediate loss, Lc0 will learn the rules quickly.  But it seems like it would unnecessarily take up space in the networks weights.

asterix aster

unread,
Nov 22, 2019, 6:36:15 PM11/22/19
to LCZero
dont think illegal moves lead to instanct loss. they are simply masked at the root, so get zero visits. down the tree it is not so it has to learn on its own, but that still does not lead to any illegal moves at any point during an actual game

Warren D Smith

unread,
Nov 22, 2019, 7:53:32 PM11/22/19
to LCZero
> --I presume that the part of the neural net which learned, in go,
> which moves are legal,
> essentially caused the self-creation of a sub-network for detecting
> "simple ko," an important go concept. The rules of go are very
> simple. It then proceeded to learn go in normal alphazero fashion
> from then on, so essentially from then on, it was alphazero plus an
> added ko-detector. So it then is not surprising it would outperform
> alphazero without a ko detector.

--also, if they are playing go rules in which "suicide is illegal" it
had to learn a suicide detector, which probably means it had to learn
a "liberties counter" or something close to it.
Again, these are simple and useful go features to know about, and it
makes sense alphazero plus a ko-detector plus a liberties counter,
would outperform plain alphazero.

And if I am reading their graph right, it outperformed it by 200-300 elo, which
is pretty serious.

Cristian Garcia

unread,
Dec 3, 2019, 3:47:32 PM12/3/19
to LCZero
The "learning to play without knowing the rules" is good PR but not accurate. A0's NN doesn't know the rules either, only the game engine used by MCTS does. On the other hand, Mu0 learns the "game engine" (the g function in the paper) and uses it to plan and predict, in other word, it has to learn the policy AND the rules.

Opposite to what is stated in the thread, having the model learn the rules might let Mu0 learn these complex and simple ko stuff much "better", since its in fact forced to learn about it (because they are rules).

DBg

unread,
Dec 4, 2019, 3:27:52 AM12/4/19
to LCZero
If illegal moves cause an adjudicate resulting in immediate loss, Lc0 will learn the rules quickly.  But it seems like it would unnecessarily take up space in the networks weights.
 
How can you measure the "unnecessarily" in you statement?  how do you judge, measure or rate architecture modification space necessity? or its subsequent computation cost?

Zeta

unread,
Dec 4, 2019, 3:47:23 AM12/4/19
to LCZero
It's interesting this paragraph: "in chess we increased the history to the last 100 board states to allow correct prediction of draws."

Does Lc0 make something like this?

Shawn S.

unread,
Dec 4, 2019, 5:52:53 AM12/4/19
to LCZero
Well, I mean it will simply depend on whether it plays chess better I suppose.  I just don't understand how.

Warren D Smith

unread,
Dec 4, 2019, 10:34:23 AM12/4/19
to Shawn S., LCZero
On 12/4/19, Shawn S. <zad...@gmail.com> wrote:
> Well, I mean it will simply depend on whether it plays chess better I
> suppose. I just don't understand how.
>
> On Wednesday, December 4, 2019 at 2:27:52 AM UTC-6, DBg wrote:
>>
>> If illegal moves cause an adjudicate resulting in immediate loss, Lc0 will
>>
>>> learn the rules quickly. But it seems like it would unnecessarily take
>>> up
>>> space in the networks weights.

--it doesn't just "take up space." The point is that what the network
learns while learning to understand the rules, actually can be USEFUL
knowledge for the purpose of figuring
out which moves are good, and for evaluating positions.

If so, then mu-zero will actually play stronger than alpha zero,
because it effectively
is alphazero that has been pre-taught useful stuff -- which (without
that pre-teach) it would
never have learned, or would only have learned to a bad crude clumsy
approximation in some inefficient way that would have wasted more neurons.

However, this mu>alpha phenomenon was not observed in chess. It was only
observed in go. In chess, mu and alpha seemed of equal strength so the mu
"pre-teaching" ultimately was neither helpful nor harmful.

ER

unread,
Dec 4, 2019, 1:22:58 PM12/4/19
to LCZero
The difference in mu0 is how the board is represented; in alpha0 the board is represented exactly as it is each piece and it’s location for example there is a rock on a1. In mu0, mu0 learns attributes (the might have no human equivalent value) about the board and only represent the board by those many attributes (e.g the opponent king safety is 1.23 for example but even that example is human centric.) The moves are chess moves but the board and search space boards are only represented by NN attributes.

DBg

unread,
Dec 4, 2019, 2:11:43 PM12/4/19
to LCZero
Well, I mean it will simply depend on whether it plays chess better I suppose.

Exactly.  One could consider toy chess variants as a way to make comparisons.  These would cut down on resource load, and allow to ask such question that you have and is a good questions.

  I just don't understand how.

I have read that 3x3, 4x4 maybe even 5x5 board variants have been solved, since lc0 does not have the google megaflops or teraflops to throw out the window or at the problem, (or at a harder than necessary training), these would be nice test-beds, 

the whole lc0 design of experiement would not have to be applied, for these tests, they could be a side-project TBD.  

Also, other type of simplifications could be used, such as filters for the input variables to leave only pawns and king, and make engines do that game of chess (more draws probably) but it would still be an interesting experiment.  
This can also be done only by initializing the weight corresponding to the missing pieces,  

I have some trouble myself toward how the sequence should be encoded whether as context (which would kick in, as the game advances, during self-play), where its effect could influence the policy training and end up in its storage, not needed during jousting.  Or whether it should be an integral part of the chess state space definition.  But, i don't see this as an argument to not try.

Other, experiments could be to distill the policy learned data (however stored) into sequence oriented machine architectures, or better yet, neural nets hybridized with HMMs (Hidden markov models) where the hidden node(s) (not chess move tree node) could learn the insertions and deletion of quietest moves and the rest (NN?) learned the sequence patterns or motifs modulo those quiet move interludes.  that could perhaps be obtained from learned policies. I have not thought through all these options, but these could be how to go about considering sequence-outcome dependence,  There are various entry points in the exiting training scheme that can be modified toward including such concerns.

if you meant, how including such information, anywhere in the training, could improve the results, that is a good discussion, to develop. if not already approached in the rest of this thread, which i intend to finish reading.

ER

unread,
Dec 4, 2019, 10:22:43 PM12/4/19
to LCZero

This is a great article that can articulate the difference between α0 and μ0 more clearly, as a bonus it uses chess as an example

https://medium.com/applied-data-science/how-to-build-your-own-muzero-in-python-f77d5718061a

From the article:
“This may indicate that it [MuZero] is finding more efficient ways to represent a position through its hidden representation than AlphaZero can find when using the actual board positions.”

Dariouch Babaï

unread,
Dec 5, 2019, 6:12:35 PM12/5/19
to ER, LCZero
Thank you.  Hidden representation, that's what hidden layers do, but that's also what makes some people skeptical, because of the blackbox aspect.
I have not read yet.  just my quick reflex at hearing hidden respresentation.  Also, i would like to remind, that it was Go. not Chess that showed improvement.
still need ot read the paper also, but second hand read that their experiment with chess looked more like a control problem type, not an objective.

meaning, they may not have left all the hidden layers or variables loose in all possible dimensions (like space)... i will read carefully and bash myself back here, for having spoken too soon.

DBg

unread,
Dec 6, 2019, 2:15:58 PM12/6/19
to LCZero
Just adding to my previous post, as promised, but sorry, no self-bashing today.
 
i went and made a first reading of the mu paper, only the text for any claim of improvement with mu over alpha-zero when applied to chess. 
 so, i did not dream what i read , about google downplaying the use of chess as a target problem to a control problem, some kind of baseline.  
they were happy it was in the same ballpark, i would guess, but nowhere is there any claim of improvement,   

Perhaps, by having statements about mu  improving on Go, next to a non-committing statement about alpha-zero (hear chess),an enthusiast reader would drift toward thinking they also improved on the chess front. I don't think it was intentional, negative results are never published anyway,  so no need to shoot oneself in the foot for the sake of clarity.  Chess was for calibration.

Perhaps, i missed some spots.  But, i am glad that the previous post gave the link to the medium DIY articles. That will allow me to better understand the algorithm, i hope, and how suited it might be to an immobile chessboard.  The video game, could not use the well described move mechanics of the board games, because the "player" is making moves in a independent dynamic environment, hence the need for more more temporal complexity, i think this was well explained, a bit buried in mathematical or technical language, but i am pretty sure to have understood at least that bit.

Perhaps, if you were playing rodeo-chess, then have a dynamic map of the environment would be really good.  If my emphasis on independent moving environment is warranted (a valid hypothesis, not a thought-spam), then that leaves an enigma for why GO was handled better and more economically in computer time.  This would point to some non-optimal aspect in the policy part of the alpha algos.  I think, it is worth at least considering as a possibility.  waiting for counter-arguments. of fact corrections.  Note that this non-optimal possibility would not have mattered when flops were not a limiting resource.


On Thursday, December 5, 2019 at 6:12:35 PM UTC-5, DBg wrote:
Thank you.  Hidden representation, that's what hidden layers do, but that's also what makes some people skeptical, because of the black-box aspect.
I have not read yet.  just my quick reflex at hearing hidden representation.  Also, i would like to remind, that it was Go. not Chess that showed improvement.

still need ot read the paper also, but second hand read that their experiment with chess looked more like a control problem type, not an objective.

meaning, they may not have left all the hidden layers or variables loose in all possible dimensions (like space)... i will read carefully and bash myself back here, for having spoken too soon.

Le 04/12/2019 à 22:22, ER a écrit :

ER

unread,
Dec 6, 2019, 5:03:12 PM12/6/19
to LCZero
Yes, you are right and I should have clarified that that quote about the representation being more efficient was for Go where there was an improvement. But I think that even though it has not improved over the state of the art in chess, it so incredible that it matches the state of the art; I can’t fully wrap my mind around the fact that it actually works. It’s is a breakthrough in AI, and attest to how much deep learning has achieved and how far it can go. There are many open problems in AI that have no immediate payoff and no full representation of states (like a robot placing an item in a box without breaking it) I will not elaborate on this because it’s far from the subject of chess. I think the paper is definitely worth reading

DBg

unread,
Dec 8, 2019, 11:47:00 AM12/8/19
to LCZero
yes, worth reading, and worth criticizing, and worth getting sad over the compromises scientist have to go, and conflict pulls are stretching them in opposite directions.
at least read it, and think about the different tools that have been proposed and the difference in context.  The information is there, not always, in a well defined contrasting table, but for chess, we might need to make one.  model building with carefully defined assumptions.  I am actually glad, that this paper open the window of representation, in a lc0 forum.

Paul Berger

unread,
Dec 8, 2019, 3:40:16 PM12/8/19
to LCZero
AFAIK, there is still no published over Github (re)implementation of this algorithm yet.
Also, DeepMind has the "OpenSpiel" framework to support (deep) reinforcement learning experiments, see https://github.com/deepmind/open_spiel but I didn't find anything relevant to MuZero...
Reply all
Reply to author
Forward
0 new messages