KataGo

850 views
Skip to first unread message

Warren D Smith

unread,
Mar 16, 2021, 12:06:40 PM3/16/21
to LCZero
Have you seen "KataGo" public-source go program?
https://github.com/lightvector/KataGo
It is leela-like. However it has various tricks which cause it to learn
much faster, certainly over 10X faster. Running on a single PC with a
single GPU is adequate to cause KataGo to reach superhuman level in 1
year starting from zero knowledge. KataGo is currently the top rated
player on CGOS (computer go server) for both 19x19 and 9x9 board sizes.
It is far superior to Leela-go. KataGo plays on different board sizes
with same program and same evaluator, and it
works well for handicap go and does not suffer from usual "slack play" syndrome,
which is the go equivalent of leela-chess "trolling" in won games.
Leela-go is now taking the view that (1) it is a finished project and
(2) is now obsoleted, and hence (3) is recommending developer effort
instead be devoted to the better programs KataGo and SAI.

My point is that the same sort of tricks KataGo uses, could be used
for leela-chess.
If they sped up your learning 10X, well...

--
Warren D. Smith
http://RangeVoting.org <-- add your endorsement (by clicking
"endorse" as 1st step)

Warren D Smith

unread,
Mar 16, 2021, 12:20:22 PM3/16/21
to LCZero
Katago explanations:

David J. Wu:
Accelerating Self-Play Learning in Go
https://arxiv.org/abs/1902.10565

https://github.com/lightvector/KataGo/blob/master/docs/KataGoMethods.md

dka...@gmail.com

unread,
Mar 16, 2021, 1:01:38 PM3/16/21
to LCZero
The MLH (moves left head) could be viewed as one of those katago “tricks.” Unfortunately leela chess doesn’t have the kind of metrics and experiments to determine if it helped.

Thomas Spark

unread,
Mar 19, 2021, 3:47:22 AM3/19/21
to LCZero
Warren made a good point!
KataGo project uses some remarkable and very efficient approaches.

A section from the paper (https://arxiv.org/pdf/1902.10565) from https://arxiv.org/abs/1902.10565 is worth considering:

"In 2017, DeepMind’s AlphaGoZero demonstrated that it was possible to achieve superhuman performance in Go without reliance on human strategic knowledge or preexisting data [18]. Subsequently, DeepMind’s AlphaZero achieved comparable results in Chess and Shogi. However, the amount of computation required was large, with DeepMind’s main reported run for Go using 5000 TPUs for several days, totaling about 41 TPU-years [17]. Similarly ELF OpenGo, a replication by Facebook, used 2000 V100 GPUs for about 13-14 days, or about 74 GPU-years, to reach top levels of performance [19].

In this paper, we introduce several new techniques to improve the efficiency of self-play learning, while also reviving some pre-AlphaZero ideas in computer Go and newly applying them to the AlphaZero process. Although our bot KataGo uses some domain-specific features and optimizations, it still starts from random play and makes no use of outside strategic knowledge or preexisting data. It surpasses the strength of ELF OpenGo after training on about 27 V100 GPUs for 19 days, a total of about 1.4 GPU-years, or about a factor of 50 reduction. And by a conservative comparison, KataGo is also at least an order of magnitude more efficient than the multi-year-long online distributed training project Leela Zero [14]. Our code is open-source, and superhuman trained models and data from our main run are available online."

What are thoughts of Lc0 creators?

Warren D Smith

unread,
Mar 19, 2021, 11:55:18 AM3/19/21
to Thomas Spark, LCZero
And actually KataGo has improved since that quote repeated by Thomas
Spark, so it now requires considerably LESS than 1.4 GPU years. I.e.
TS understated the truth.

Here would be some ideas for leela-chess that would be somewhat
analogous -- the "Lc1 project" (?):

1. KataGo does NOT use bare board position as input to its neural net,
it uses board position enhanced with bunch of very cheap information
that humans believe relevant to go play. So for leela-chess: add
various cheap features such as material counts to the board rep as
additional net inputs. ("Nonzero" -- but still is "zero" in the
weaker sense KataGo's learning starts from a totally clueless neural
net.)

2. learning to predict more than just game-outcome:
KataGo does not predict only game outcome (1 bit), it predicts
ownership at game-end of every location (361 bits to learn from each
training game). It seems obvious this must
speed up learning tremendously. Leela-chess when it invented "moves
left" predictor went in this direction. But more could be done. E.g.
you could predict which pawns will promote. You could predict which
pieces will still be there at game-end (or, just say, 20 ply ahead).
You could predict on which square checkmate will occur. You could
predict the
TYPE of game-end (checkmate, stalemate, perpetual, 50move, repeat3;
and for the first 3, who did it to who). All of these things would be
predicting a lot more than 1 bit per game. If we predict N bits, then
learning speed presumably about N times greater.

3. KataGo does not merely predict "prob of win." The game-end value
is not just "win/lose"
1 sign-bit. It is something like
sign(GameOutcome)+0.1*arctan(FinalScore). That was not the exact
formula Wu used, but anyhow it regards a win by 73 as worth more than
a win by 5, etc. This completely disagrees with AlphaGo which had
insisted on just 1 bit and insisted people like Wu were dead wrong.
So anyhow for chess, do not regard the final score as merely
{-1,0,+1}; add something saying "winning with more material is worth
more than merely winning." E.g. if I mate your bare king with my 2
queens, that is worth more than if you have lots of material and I
have only a rook (but still mating you). The reason this is a good
idea (if it is) is it helps prevent "slack play" aka "trolling."
Another version of this idea would be regard me mating you on move 20
as worth more than a mate on move 197.

4. KataGo trains a lot from high randomization and handicap gamestarts.
The chess version would be to play a lot of chess960 and chess(960^2)
Fischerrandom and doublerandom chess gamestarts, plus handicap-starts
where some pieces randomly removed, as well as usual start+openings.
The point of this is more learning with
less worry about overfitting and more understanding of how to play more kinds
of positions.

5. KataGo trains a lot using altered-Komi games.
The closest thing we have in chess to "komi" is: "time odds" games.
Add to the net input, info about the remaining clock time for you &
opponent, and do a lot of games using time odds. And you also can
try, when training, if one side gets ahead, then
artificially give the other side time-odds advantage for rest of game.
The amount
of time odds advantage to give should not be too large (i.e. the
ahead-side still needs to
have greater chances) but enough to even up the chances enough so the
training game still provides information. I.e. if I have a winning
position against you, prob(win)=0.99999, then the rest of the game
would normally provide zero info teaching the players how to play from
then onward. That is idiotic waste of time. But if I have an 0.99999
position against you and you have 5 times more time on your clock,
that is a different matter. You now have chances to trick me, and
will learn about how to do that, while I will learn how to stop you
doing it.

6. KataGo plays on many different board-sizes, not just 19x19.
I doubt there is a worthwhile chess analogue since 8x8 fits so well
with 64-bit computers, and so if you made variable board you'd lose a
lot of juice?

=======================

Stockfish now uses 2 or 3 evaluators. The slowest+smartest is NNUE.
There are also faster+dumber evals. It uses the fast ones if one side
is ahead by enough.

Obvious analogue: Make leela have several neural nets: large+smart and
small+fast.
Use the fast one if it thinks 1 side is enough ahead. Related idea
is: nets that only
are used in, e.g. the endgame.

=======================

Finally, the following is not in KataGo or anything else, but I
personally believe it would be a very very very valuable thing to
have:
Make neural net learn to predict its ERROR.
That is, suppose neural net evaluates position as X.
But 10 ply later, the evaluation is Y. The error was E=|X-Y|.
Add a new output to the NN that predicts E. Or several outputs
predicting a probability distribution of E, e.g. bits predicting that
0<E<1, 1<E<2, 2<E<4, 4<E<8, etc.

Not only would this be very very useful to know for many reasons --
even if this info were
just thrown in the garbage it STILL would be useful to do this because
it would boost the learning rate (see above about learning N bits not
1, boosts learning rate by factor N).

Charles Roberson

unread,
Mar 20, 2021, 1:22:17 PM3/20/21
to Warren D Smith, Thomas Spark, LCZero
As I suggested years ago, everybody should read the research of Gerald Tessauro. It is from the 90s using neural networks to play backgammon. There are several papers and some years apart with various advancements including "... any good knowledge that can be quickly precalculated should be part of the input layer ...". 
 
  Charles Roberson

--
You received this message because you are subscribed to the Google Groups "LCZero" group.
To unsubscribe from this group and stop receiving emails from it, send an email to lczero+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/lczero/CAAJP7Y0Js1COuyxOrZYvTYdhjEEGcY9KYJ1W0rXraN3R77FUOQ%40mail.gmail.com.

Thomas Spark

unread,
Mar 20, 2021, 2:02:59 PM3/20/21
to LCZero
Ideas worth considering!

Lets enhance Lc0 with domain specific additional information as suggested by Warren and others.

Thank you Warren to point out the KataGo approach. I looked into the project https://arxiv.org/abs/1902.10565 https://katagotraining.org/ and found out there are some really efficient improvements incorporated that greatly accelerate self-play learning in Go, achieving a 50x reduction in computation over comparable methods.

Whereas AlphaZero required thousands of TPUs over several days and ELF required thousands of GPUs over two weeks, KataGo surpasses ELF's final model after only 19 days on fewer than 30 GPUs. Much of the speedup involves non-domain-specific improvements that might directly transfer to other problems. Further gains from domain-specific techniques reveal the remaining efficiency gap between the best methods and purely general methods such as AlphaZero.

I would also like to point out "KataGo approach is also at least an order of magnitude more efficient than the multi-year-long online distributed training project Lc0" !!

DBg

unread,
Mar 21, 2021, 9:11:07 AM3/21/21
to LCZero
 Domain specific verified knowledge.  There should be a way to allow conflicting knowledge to be disambiguated by the resulting monster.  The whole point of the general approach, was not to fall into biases, or at least the minimal amount of it. Legal chess rules, can't be considered human biases.

I think instead of wanting to instill "knowledge" into a maximally empirical chess machine, one could learn from such a machine and go the other was around.  not inject but test human knowledge hypostheses, and combined them, while still maintaining some ways to distinguish the relative contribution of such human knowledge elements when combned.

SF had a go at it whith its (still) predominently big tree search with some evaluations here or there.  But they jumped in the sea.
Does lc0 want to lose what it had going for it, least bias?  Even before SF12 (NNue), the autotuning of the domanin knowledge heuristic parameters upon a pipeline of tests (should be called trainng, and training scrutinity principles should be applied to that).

So I personally think that condiering domain knoweldge is great, and I am curious about it. But care should be taken about the direction of the flow of information.  Interpretability.  Fun to say. But a pure empirical measure system is more interpretable than one that claims to containin domain knowledge, but except for a few principal "component", has no way to deceid which does what, and if there is no waste from the functional set-up of combination of the elevemts.

Also, isn't Go a very symmetric game. with increasing territories, while chess is laterally asymettric (and full of domain knowledge with many mobile unt tyupes),  and with generally decrasing material. What are the belief degree in human go knowledge?  Can their knights go out before bishops? or should they?  I don,t know about Go.  Also, I care too much about human chess knowledge to consider using it as it as input.  it should be output.   

Warren D Smith

unread,
Mar 21, 2021, 1:37:12 PM3/21/21
to DBg, LCZero
On 3/21/21, DBg <dariouc...@gmail.com> wrote:
> Domain specific verified knowledge. There should be a way to allow
> conflicting knowledge to be disambiguated by the resulting monster. The
> whole point of the general approach, was not to fall into biases, or at
> least the minimal amount of it. Legal chess rules, can't be considered
> human biases.

--well, look, this "zero" thing is religion, not science. It was
interesting that one could
really create superhuman players from zero. But it also seems obvious
you are going to do better with nonzero. And the scientific thing to
do is to investigate what works better.

> Does lc0 want to lose what it had going for it, least bias?

--bullshit. What Lc0 should want, is (1) to be better and (2) to
learn faster to
be better faster. Now certainly you could screw it up by adding too
much biased human crap. But that is a question that must be resolved
by experiment, not by religion.
If you look at what KataGo put in in terms of "human inspired"
knowledge beyond the board position, it was very little, quite cheap,
and all stuff Wu was >99% sure ought to be beneficial. I certainly
would have put in more than Wu did, if it were me.

Returning to chess: obviously, material and some material-related
things like "the 2 bishops" are pretty useful to know about and
extremely cheap. If we tell it
material then it will gain a considerable learning head start vs Lc0 with zero.

>Also, I care too much
>about human chess knowledge to consider using it as it as input. it should
>be output.

--you wish. The biggest trouble with Lc0 (and even SF NNUE) is that
whatever it learned
is encoded in a form totally incomprehensible to humans. These
projects have totally failed in that respect.

Dariouch Babaï

unread,
Mar 21, 2021, 1:51:12 PM3/21/21
to Warren D Smith, LCZero
Agreed.  but you may listen to yourself as well.
The zero knowledge can be formulated mathematically as the least informed prior as initial condition of the global training process.

Not having a proper mathematical form as the lc0 evolve of the whole process, may make it difficult to see. 
And in your email, you do seem aware of the notion of bias. So we agree to watch for that.

My point was not to keep RL as it. But to learn the maximum from it. (not done with that, in my current knowledge and opinion, which needs some updating, working on that).

Actually my point beyond that quarrel , is that we need a measure system to test where this boundary is (of bias knowledge).

I even wonder if using only net-2-net performance measure might not co-evolve into a red-queen (?) behavior across chess space?  or settling on a sub-set, where all engine compete, and never challenge each other outside they best now common leanrning ground.

The uniform prior, is a garantee that all the result is the maximal empirical measure of chess evaluation function taking position as input and outcome as output.  as training and size increases or improves.

Now that does not prevent this to be used side by side with less empirical engines. but using only very constrained tournament conditions (question the constraints, and think about where biases can come from, as part of the wheel of science, and you might see less religion behind the "zero" branding).  The biases should be measured, before jumping on this direction. or while doing it.

Dariouch Babaï

unread,
Mar 21, 2021, 1:51:53 PM3/21/21
to Warren D Smith, LCZero
Nobody tried.


Le 21/03/2021 à 13:37, Warren D Smith a écrit :

Warren D Smith

unread,
Mar 21, 2021, 2:22:32 PM3/21/21
to Dariouch Babaï, LCZero
Actually now I think about it, we *do* have some evidence.

Suppose we aded mobility / attacked square / pin info to the bare
board as NN input.
Would that help?

Well I claim we KNOW it would help. Proof:
the DeepMind team had a followup paper where the NN actually was not
taught the rules
and had to learn them itself, i.e. it learned to predict which moves
were legal and which illegal. This kind of is "even more zero than
zero" and it worked, yielding actually more
strength. But why? Well the answer is it was learning MORE than just
"game result" it was learning "which move legal/not" which is
tremendously more learning going on, it is like
learning from 1000 bits of info each move rather than 1 bit per entire
game, which is over 10000X info-flow speedup. And once it learned
move-legality it automatically already knew about mobility, pins,
checks, and attacked squares. It then USED that (now built in)
knowledge as effectively extra NN inputs and continued on from there
learning now from game results.

So OBVIOUSLY if that pin + check + mobility-to-from-sq/attack info
were provided from
that start as NN inputs, we would have done better still.
Q.E.D.

Well... maybe that was not really a "proof" but its pretty good
evidence and a pretty good story at the very least.

esch...@gmail.com

unread,
Mar 21, 2021, 4:49:11 PM3/21/21
to LCZero
Isn't that "proving" the opposite?  If mu-zero had less information (nothing about move legality) and performed better than alpha zero, that implies that giving the move legality to alpha zero was actually harmful.  So giving even more information about pin/check/mobility might give additional "harm".

I can see it both ways, and I don't think we'll know until we try.  Informed inputs might accelerate learning and allow more of the net to focus on more abstract benefits.  Or they might drive the net very quickly to a good local minimum, but not allow the net a chance to learn useful subtleties about the game.

Dariouch Babaï

unread,
Mar 21, 2021, 4:50:06 PM3/21/21
to Warren D Smith, LCZero
Well, if it can learn from the problem of legality of position to develop such emergent principle, why does it automatically follow that any formulation of the non formalized and non quantified "information" associated to the concept you named provide for a boost.

what if you chose your formulation wrong, and were actually making it harder to learn, because it would have to compensate for a priori incomplete or wrong information (not that the concepts are not important, but just putting arbitraty functions with free parameters to tune alongside a Neural network, and giving them arbitrary labels that remind of chess concepts, may not be enough preparation to do what you propose.

I understand that it could help.  but I am not sure that the chess knowledge has been studied from a quantitative point of view that can be injected right away in NN training.

There are many ways to write mathematical functions involving the notions you refer too, and I am not sure that a linear combination, or some ad-hoc functions with how many game phase partitioning (if position could determine phase, when training with position), would give improvement.

One would have to progress component per component, and establish a method to measure that the contribution is meangingful.
next step: figure out how component combine.  Is putting a bag of individually helpful components, going to sort itself out, for overlaps or conflict (positions where they are all subtly involved at same time, maybe).  Could it not actually make not only their respective tuned parameters as non-distinguisable and are individual NN units within a random choice of layer form another random choice of unit in the same neural net, but also be making training more difficult?

So perhaps comparing Go and Chess level of knoweldge quantitative validation, might be a good discussion, if both training and testing performance are improved so much (both or either).   But, in anycase.  There are some things to iron out before.

I would look into maia, for the kind of statistics that might be needed (not the engine themselves, as not your objectives, but how they use chess databases and find characteristics). 

I am actually ignorant of the state of knoweldge, about chess knowledge at the quantitative (does not have to be precise, otherwise if already known why train anything, but being of a flexible enough functional family to dock with the empirical NN, and other feature components.

Thanks for the information about early experements.  I was curious about that legal training question.  Why did they skip that? in subsequent experiments?

Worth trying, but with full context spelled out and methodology for testing coming with it.

dka...@gmail.com

unread,
Mar 21, 2021, 6:03:35 PM3/21/21
to LCZero
Scorpio NN has tried that some while ago. I don’t think additional inputs helped all that much.

On Sunday, March 21, 2021 at 1:22:32 PM UTC-5 warre...@gmail.com wrote:

Deep Blender

unread,
Apr 5, 2021, 8:12:10 AM4/5/21
to LCZero
1. As far as I have seen in the paper, the additional input channels contain objectively correct information. Similar ideas for chess might be: Number of attackers for certain pieces or possibly squares, the number of defenders, ... . There could also be a movement map for each piece containing all the legal moves.
On the other hand piece values are not an objective measure which is always true.

2. Go is a lot more static than chess, because the stones don't move around like the chess pieces. Making those kinds of predictions in Go makes a lot of sense and it is for sure a useful feature to learn to improve the neural network. I don't see how this idea could be translated to chess end positions. What might be a useful feature is to predict how likely it is for a piece to be moved within the next X moves, how likely it is for a piece to be captured within Y moves, how likely is it for a square to have mover/fewer attackers/defenders within the next Z moves, ... .
Having auxiliary targets to learn can definitely accelerate the learning process, but it is impossible to predict its impact, as it may even harm the overall performance.

3. The objective goal of the game is to win. That's AlphaGo's target and that is objectively correct. In Go, you need more territory than your opponent to win. Having a lot more territory than your opponent will never decrease your chances of winning. In chess, you can loose a game, even though you have more material. In chess, there isn't a material to winning correlation as it exists for territory and winning in Go.
Valuing a win in fewer moves higher could be useful from my point of view, as it might help to uncover more tactics during the training. Though, it is difficult, because when it harms the win chances, it is not worth much.



On Friday, March 19, 2021 at 4:55:18 PM UTC+1 warre...@gmail.com wrote:

DBg

unread,
Apr 5, 2021, 1:50:50 PM4/5/21
to LCZero
 Good thinking d'après moi. But it points to try and testing.  keeping some pure empirical measure from RL as best as it can be to keep track of what improves. It does not have to be of Go nature (thanks for the synthetic formulation, i lack that while struggling through the fog). But those Go high level beliefs input should have similar auxiliary versions in Chess to try first.

Maybe not enough statistics at low level done (or the past questions were limited in objective)... i don't know, just that I saw a lot of good work in the maia paper using the lichess online data (which may need some fixing at certain rating levels), and their statistical work prior to maia, seems to show the quality of the data base at levels below 2000, (and more complex model than glicko needed above, although I don't think the paper requires that, only maybe assumptions priori to training design, if their objective included high level ratings, of pairing averages --- btw, why keep looking only at pairing average.  p(a,b,c, ...., ?)).   

We could compensate for the less consensus value in chess knowledge (or less tested, that might just be what is needed to find good helpers), which some search for what of chess theory to try first, and with best statiscally explaining formulation of the gamut of possibilities across various levels of play)..  again going tangential but related. I don't know enough about Go­, about the community and its theoretical stance, attitude, and degree of certainty. 

But certainly, or hopefully, human knowledge can be made to be compatible with some sort of testing ground that lc0 might provide (which itself, can be pondered by looking at endgame, see other threads).  I will keep re-reading your post. for its components..
Reply all
Reply to author
Forward
0 new messages