[Computer-go] CGOS source on github

Hiroshi Yamashita

unread,

Jan 18, 2021, 3:28:29 AM1/18/21

to computer-go

Hi,

I have published current CGOS source on github.

https://github.com/yssaya/CGOS

There are some changes. Like

1. Recent 300 games on cross-table page.
2. WGo viewer
3. 7.0 komi and handling draw for rating calculation.
4. Shorter pgn file for BayesElo (cgosBayes).
5. Forbid number only account.
6. Bug fixed to send info all 'catch {puts $soc "info $msg"}'
7. badusers.txt for not removing dead stones or too many timeout.

Thanks,
Hiroshi Yamashita
_______________________________________________
Computer-go mailing list
Compu...@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Rémi Coulom

unread,

Jan 18, 2021, 9:41:56 AM1/18/21

to computer-go

Hi,

Thanks to you for taking care of CGOS.

I have just connected CrazyStone-57-TiV. It is not identical, but should be similar to the old CrazyStone-18.04. CrazyStone-18.04 was the last version of my program that used tensorflow. CrazyStone-57 is the first neural network that did not use tensorflow, running with my current code. So it should be stronger than CrazyStone-18.04, and I expect it will get a much lower rating.

A possible explanation for the rating drift may be that most of the old MC programs have disappeared. They won easily against GNU Go, and were easily beaten by the CNN programs. The Elo statistical model is wrong when different kind of programs play against each other. When the CNN program had to get a rating by playing directly against GNU Go, they did not manage to climb as high as when they had the MC programs between them and GNU Go. I'll try to investigate this hypothesis more with the data.

Rémi

uurtamo

unread,

Jan 18, 2021, 10:35:19 AM1/18/21

to computer-go

It's a relative ranking versus who you actually get to play against.

Sparsity of actual skill will lead to that kind of clumping.

The only way that a rating could meaningfully climb by playing gnugo or your direct peers is going to happen exponentially slowly -- you'd need to lose to gnugo twice less often (or win all the time over twice as many games) to get more points. So although it would eventually increase, it would flatten out pretty quickly.

Good point about mcmc. A more dramatic approach would be to remove gnugo altogether.

Hiroshi Yamashita

unread,

Jan 18, 2021, 3:22:09 PM1/18/21

to computer-go

Hi,

> The Elo statistical model is wrong when different kind of programs play against

I have a similar experience.
I calculated Japanese Shogi women pro rating before.
The strongest woman, Ichiyo Shimizu, her rating is 1578 Elo.
Her winrate against men pros is 18%(163 games), and against women pro is 65%(523 games).
Her rating without women pros game is 1286 Elo.
There is 292(=1578 - 1286) Elo difference.
It is because usually women pros play with women pros. Women pros vs men pros are rare.

I think similar thing happens on CGOS. There are three eras, Zen, LeelaZero and KataGo.
Number of Zen vs LeelaZero games are a little.
CrazyStone-18.04 rate maybe depend on Zen-15.7-3c1g.
http://www.yss-aya.com/cgos/19x19/cross/CrazyStone-18.04.html

Zen's absence is maybe a reason of this drift.

Thanks,
Hiroshi Yamashita

Rémi Coulom

unread,

Jan 21, 2021, 8:20:35 AM1/21/21

to computer-go

I checked, and CrazyStone-57-TiV is using the same neural network and hardware as CrazyStone-18.04. Batch size, cuDNN version, and time management heuristics may have changed, but I expect that strength should be almost identical. CrazyStone-57-TiV may be a little stronger.

So it seems that the rating drift over 3 years is about 450 Elo points, and the "All Time Ranks" are a bit meaningless.

Can you produce a list where CrazyStone-57-TiV is renamed to CrazyStone-18.04? It may be enough to fix the drift.

I need the machine for something else, so I disconnected the GPU version. CrazyStone-81-15po is running 15 playouts per move on the CPU of a small machine, and will stay.

Rémi

Hiroshi Yamashita

unread,

Jan 21, 2021, 11:20:58 AM1/21/21

to computer-go

Hi,

This is original BayesElo. I updated manually. This is latest.
http://www.yss-aya.com/cgos/19x19/bayes.html
CrazyStone-18.04 4065
CrazyStone-81b-TiV 4032
Zen-15.7-3c1g 3999
CrazyStone-57-TiV 3618

This renames CrazyStone-57-TiV to CrazyStone-18.04.
http://www.yss-aya.com/cgos/19x19/bayes_20210121_rename_CrazyStone-57-TiV_to_CrazyStone-18.04.html
CrazyStone-81b-TiV 4051
Zen-15.7-3c1g 3968
CrazyStone-18.04 3778

Would you like to continue to rename CrazyStone-57-TiV?
The drift looks fixed, but a little for other programs.

Rémi Coulom

unread,

Jan 21, 2021, 12:01:11 PM1/21/21

to computer-go

Thanks for computing the new rating list.

I feel it did not fix anything. The old Zen, cronus, etc.have almost no change at all.

So it is not a good fix, in my opinion. No need to change anything to the official ratings.

The fundamental problem seems that the Elo rating model is too wrong for this data, and there is no easy fix for that.

Long ago, I had thought about using a more complex multi-dimensional Elo model. The CGOS data may be a good opportunity to try it. I will try when I have some free time.

Rémi

David Wu

unread,

Jan 21, 2021, 7:45:18 PM1/21/21

to compu...@computer-go.org

One tricky thing is that there are some major nonlinearities between different bots early in the opening that break Elo model assumptions quite blatantly at these higher levels.

The most noticeable case of this is with Mi Yuting's flying dagger joseki. I've noticed for example that in particular matchups between different pairs of bots (e.g. one particular KataGo net as white versus ELF as black, or one version of LZ as black versus some other version as white), maybe as many as 30% of games will enter into this joseki and the preferences for the bots may happen by chance to line up such that consistently they will play down a path where one side hits a blind spot and begins the game with an early disadvantage. Each different bot may have different preferences such that arbitrarily each possible pairing randomly runs into such a trap or not.

And, having significant early-game temperature in the bot itself doesn't always help as much as you would think because this particular joseki is so sharp that a particular bot could easily have such a strong preference for one path or another (even when it is ultimately wrong) so as to override any reasonable temperature. Sometimes, adding temperature or extra randomness simply only mildly changes the frequency of the sequence, or just varies the time before the joseki and trap/blunder happens anyways.

If games are to begin from the empty board, I'm not sure there's an easy way around this except having a very large variety of opponents.

One thing that I'm pretty sure would mostly "fix" the problem (in the sense of producing a smoother metric of general strength in a variety of positions not heavily affected by just a few key lines) would be to semi-arbitrarily take a very large sampling of positions from a wide range of human professional games, from say, move 20, and have bots play starting from these sampled positions, in pairs once with each color. This would still include many AI openings, because of the way human pros in the last 3-4 years have quickly integrated and experimented with them, but would also introduce a lot more variety in general than would occur in any head-to-head matchup.

This is almost surely a smaller problem than simply having enough games mixing between different long-running bots to anchor the Elo system. And it is not the only way major nontransitivities can show up, (e.g. ladders). But to take a leaf from computer Chess, playing from sampled forced openings seems to be a common practice there and maybe it's worth considering in computer Go as well, even if it only fixes what is currently the smaller of the issues.

Hiroshi Yamashita

unread,

Jan 22, 2021, 3:45:58 AM1/22/21

to computer-go

Hi,

> The most noticeable case of this is with Mi Yuting's flying dagger joseki.

I'm not familiar with this.
I found Hirofumi Ohashi 6d pro's explanation half year ago in HCCL ML.
The following is a quote.
-------------------------------------------------------------
https://gokifu.net/t2.php?s=3591591539793593
It seems that it is called a flying dagger joseki in China.
This shape, direct 33 to lower tsuke (black 9th move B6) is researched
jointly with humans and AI, but still inconclusive. After kiri (black
15th move E4), mainstream is white A, but depending on the version of
KataGo, white B may be recommended. By the way, KataGo I'm using now
is 1.3.5, which is just a short time ago.

This kind of joseki is not good for Zero type. Ladder and capturing
race are intricately combined. In AlphaGo(both version of AlphaGoZero
and Master) published self-matches, this joseki is rare.
-------------------------------------------------------------

I found this joseki in kata1_b40s575v100 (black) vs LZ_286_e6e2_p400 (white).
http://www.yss-aya.com/cgos/viewer.cgi?19x19/SGF/2021/01/22/733340.sgf

Mi Yuting wiki has this joseki.
https://zh.wikipedia.org/wiki/%E8%8A%88%E6%98%B1%E5%BB%B7
KataGo has special option.
https://github.com/lightvector/KataGo/blob/4a79cde56e81209ce4e2fd231b0f2cbee3a8354b/cpp/neuralnet/nneval.cpp#L499

> a very large sampling of positions from a wide range
> of human professional games, from say, move 20, and have bots play starting
> from these sampled positions, in pairs once with each color.

This sounds interesting.
I will think about another CGOS that handle this.

Thanks,
Hiroshi Yamashita

Rémi Coulom

unread,

Jan 22, 2021, 8:08:02 AM1/22/21

to computer-go

Hi David,

You are right that non-determinism and bot blind spots are a source of problems with Elo ratings. I add randomness to the openings, but it is still difficult to avoid repeating some patterns. I have just noticed that the two wins of CrazyStone-81-15po against LZ_286_e6e2_p400 were caused by very similar ladders in the opening:

http://www.yss-aya.com/cgos/viewer.cgi?19x19/SGF/2021/01/21/733333.sgf

http://www.yss-aya.com/cgos/viewer.cgi?19x19/SGF/2021/01/21/733301.sgf

Such a huge blind spot in such a strong engine is likely to cause rating compression.

Rémi

Claude Brisson via Computer-go

unread,

Jan 22, 2021, 8:31:29 AM1/22/21

to compu...@computer-go.org, Claude Brisson

Hi. Maybe it's a newbie question, but since the ladders are part of the well defined topology of the goban (as well as the number of current liberties of each chain of stone), can't feeding those values to the networks (from the very start of the self teaching course) help with large shichos and sekis?

Regards,

Claude

David Wu

unread,

Jan 22, 2021, 9:30:00 AM1/22/21

to compu...@computer-go.org

I agree, ladders are definitely the other most noticeable way that Elo model assumptions may be broken, since pure-zero bots have a hard time with them, and can easily cause difference(A,B) + difference(B,C) to be very inconsistent with difference(A,C). If some of A,B,C always handle ladders very well and some are blind to them, then you are right that probably no amount of opening randomization can smooth it out.

David Wu

unread,

Jan 22, 2021, 9:50:29 AM1/22/21

to compu...@computer-go.org

On Fri, Jan 22, 2021 at 3:45 AM Hiroshi Yamashita <y...@bd.mbn.or.jp> wrote:

This kind of joseki is not good for Zero type. Ladder and capturing
race are intricately combined. In AlphaGo(both version of AlphaGoZero
and Master) published self-matches, this joseki is rare.
-------------------------------------------------------------

I found this joseki in kata1_b40s575v100 (black) vs LZ_286_e6e2_p400 (white).
http://www.yss-aya.com/cgos/viewer.cgi?19x19/SGF/2021/01/22/733340.sgf

Hi Hiroshi - yep. This is indeed a joseki that was partly popularized by AI and jointly explored with humans. It is probably fair to say that it is by far the most complicated common joseki known right now, and more complicated than either of the avalanche or the taisha.

Some zero-trained bots will find and enter into this joseki, some won't. The ones that don't play this joseki in self-play will have a significant chance to be vulnerable to it if an opponent plays it against them, because there are a large number of traps and blind spots that cannot be solved if the net doesn't have experience with the position. And even having some experience is not always enough. For example, ELF and Leela Zero have learned some lines, but are far from perfect. There is a good chance that AlphaGoZero or Master would have been vulnerable to it as well. KataGo at the time of 1.3.5 was also vulnerable to it too - it only rarely came up in self-play, and therefore was never learned and correctly evaluated, so from the 3-3 invader's side the joseki could be forced and KataGo would likely mess up the joseki and be losing the game right at the start. (The most recent KataGo nets are much less vulnerable now though).

The example you found is one where this has happened to Leela Zero. In the game you linked, move 34 is a big mistake. Leela Zero underweights the possibility of move 35, and then is blind to the seeming-bad-shape move of 37, and as a result, is in a bad position now. The current Leela Zero nets consistently makes this mistake, *and* consistently prefer playing down this line, so against an opponent happy to play it with them, Leela Zero will lose many games right in the opening all the same way.

Anyways, the reason this joseki is responsible for more such distortions than other joseki seems to be because it is so sharp, and unlike most other common joseki, contains at least 5-6 enormous blind spots in different variations that zero-trained nets variously have trouble to learn on their own.

> a very large sampling of positions from a wide range
> of human professional games, from say, move 20, and have bots play starting
> from these sampled positions, in pairs once with each color.

This sounds interesting.
I will think about another CGOS that handle this.

I'm glad you're interested. I don't know if move 20 is a good number (I just threw it out there), maybe it should be varied, it might take some experimentation. And I'm not sure it's worth doing, since it's still probably only the smaller part of the problem in general - as Remi pointed out, likely ladder handling will be a thing that always continues to introduce Elo-nontransitivity, and probably all of this is less important than generally having a variety of long-running bots to help stabilize the system over time.

David Wu

unread,

Jan 22, 2021, 10:07:44 AM1/22/21

to compu...@computer-go.org

Hi Claude - no, generally feeding liberty counts to neural networks doesn't help as much as one would hope with ladders and sekis and large capturing races.

The thing that is hard about ladders has nothing to do with liberties - a trained net is perfectly capable of recognizing the atari, this is extremely easy. The hard part is predicting if the ladder will work without playing it out, because whether it works depends extremely sensitively on the exact position of stones all the way on the other side of the board. A net that fails to predict this well might prematurely reject a working ladder (which is very hard for the search to correct), or be highly overoptimistic about a nonworking ladder (which takes the search thousands of playouts to correct in every single branch of the tree that it happens in).

For large sekis and capturing races, liberties usually don't help as much as you would think. This is because approach liberties, ko liberties, big eye liberties, shared liberties versus unshared liberties, throwin possibilities all affect the "effective" liberty count significantly. Also very commonly you have bamboo joints, simple diagonal or hanging connections and other shapes where the whole group is not physically connected, also making the raw liberty count not so useful. The neural net still ultimately has to scan over the entire group anyways, computing these things.

David Wu

unread,

Jan 22, 2021, 10:45:38 AM1/22/21

to compu...@computer-go.org

@Claude - Oh, sorry, I misread your message, you were also asking about ladders, not just liberties. In that case, yes! If you outright tell the neural net as an input whether each ladder works or not (doing a short tactical search to determine this), or something equivalent to it, then the net will definitely make use of that information, There are some bad side effects even to doing this, but it helps the most common case. This is something the first version of AlphaGo did (before they tried to make it "zero") and something that many other bots do as well. But Leela Zero and ELF do not do this, because of attempting to remain "zero", i.e. free as much as possible from expert human knowledge or specialized feature crafting.

uurtamo

unread,

Jan 22, 2021, 11:03:03 AM1/22/21

to computer-go

also frankly not a problem for a rating system to handle.

a rating system shouldn't be tweaked to handle eccentricities of its players other than the general assumptions of how a game's result is determined (like, does it allow for "win" and "draw" and "undetermined" or just "win").

s.

Dan Schmidt

unread,

Jan 22, 2021, 12:01:28 PM1/22/21

to compu...@computer-go.org

The primary purpose of a rating system is to predict the results of future games accurately (this is the usual axiom, at least).

In a one-dimensional rating system, such as Elo, where each player's skill is represented by a single number, it is impossible to have a (non-wacky) system where A is expected to beat B in a two-player match, B is expected to beat C in a two-player match, and C is expected to beat A in a two-player match.

So if the players are eccentric in that respect, a one-dimensional rating system is always going to have real problems with accurate predictions.

Dan

Darren Cook

unread,

Jan 23, 2021, 5:34:24 AM1/23/21

to compu...@computer-go.org

> ladders, not just liberties. In that case, yes! If you outright tell the
> neural net as an input whether each ladder works or not (doing a short
> tactical search to determine this), or something equivalent to it, then the

> net will definitely make use of that information, ...

Each convolutional layer should spread the information across the board.
I think alpha zero used 20 layers? So even 3x3 filters would tell you
about the whole board - though the signal from the opposite corner of
the board might end up a bit weak.

I think we can assume it is doing that successfully, because otherwise
we'd hear about it losing lots of games in ladders.

> something the first version of AlphaGo did (before they tried to make it
> "zero") and something that many other bots do as well. But Leela Zero and

> ELF do not do this, because of attempting to remain "zero", ...

I know that zero-ness was very important to DeepMind, but I thought the
open source dedicated go bots that have copied it did so because AlphaGo
Zero was stronger than AlphaGo Master after 21-40 days of training.
I.e. in the rarefied atmosphere of super-human play that starter package
of human expert knowledge was considered a weight around its neck.

BTW, I agree that feeding the results of tactical search in would make
stronger programs, all else being equal. But it is branching code, so
much slower to parallelize.

Darren

Brian Lee

unread,

Jan 23, 2021, 9:19:54 AM1/23/21

to compu...@computer-go.org

DeepMind has published a number of papers on how to stabilize RL strategies in a landscape of nontransitive cycles. See https://papers.nips.cc/paper/2018/file/cdf1035c34ec380218a8cc9a43d438f9-Paper.pdf

I haven't fully digested the paper, but what I'm getting from it is that if you want your evaluation environment to be more independent of the population of agents that you're evaluating against, you should first compute a max-entropy Nash equilibrium of agents, and evaluate against this equilibrium distribution.

To give a concrete example from the paper, imagine the CRPSS - the Computer Rock Paper Scissors Server. Imagine there are currently 4 bots connected: a Rock-only bot, a Paper-only bot, and two Scissor-only bots. The max-entropy Nash equilibrium is 1/3, 1/3, 1/6, 1/6. So the duplicated Scissor bots are naturally detected and their impact on the rating distribution is negated. With CGOS's current evaluation scheme, the Rock bot would appear to have a higher Elo score, because it has more opportunities to beat up on the two Scissors bots.

The paper also proposes a vector extension to Elo that can better predict outcomes under these nontransitive cycles.

Given that what we have is (at a macro level) duplication of various bot lineages, and (at a micro level) rock-paper-scissors relationships between bots in sharp openings, this paper seems quite relevant.

David Wu

unread,

Jan 23, 2021, 10:40:36 AM1/23/21

to compu...@computer-go.org

On Sat, Jan 23, 2021 at 5:34 AM Darren Cook <dar...@dcook.org> wrote:

Each convolutional layer should spread the information across the board.
I think alpha zero used 20 layers? So even 3x3 filters would tell you
about the whole board - though the signal from the opposite corner of
the board might end up a bit weak.

I think we can assume it is doing that successfully, because otherwise
we'd hear about it losing lots of games in ladders.

Unfortunately, we can't assume that based on that observation.

If you observe what is going on with both Leela Zero and ELF, and MiniGo and SAI as well - all of which are reproductions of AlphaZero with different hyperparameters and infrastructure that do not include a ladder feature, I think you can find *all* of them have at least some trouble with ladders. So this is empirical evidence that the vanilla AlphaZero algorithm when applied to Go with a convolutional resnet, often has ladder problems.

And by seeing how these reproductions behave, it also becomes clear how your observation can still be true at the same time.

Which is: with enough playouts, for all these bots MCTS is able to solve ladders well enough at the root position and the upper levels of the tree to avoid losing outright - usually a few tens of thousands of playouts are plenty. So it just affects the strength by causing harm to the evaluation quality deeper in the tree in ways that are harder to see. The kind of thing that might cost you more like 20-50 Elo (pure guess, just my intuition for the *very* rough order of magnitude with this much search on top), rather than losing you every game.

The bigger problem happens when you try any of these bots on weaker hardware that only gets few playouts - low-end GPUs, mobile hardware, etc. for example.... or the numbers of playouts that people often run CGOS bots with, namely 200 playouts, or 800 playouts, etc. You will find that they are still clearly top-pro-level or superhuman at almost all aspects of the game... except for ladders! And now at these low numbers of playouts, it does include outright losing games due to ladders, or making major misjudgments about a sequence that will depend on a ladder in 1-3 moves in the future.

Sometimes, this even happens in the low-thousands of playouts. For example, attached SGF shows such a case, where Leela Zero using almost the latest 40-block network (LZ285) with 2k playouts per move (plus tree reuse) attempted to break a ladder, failed, and then played out the ladder anyways and lost on the spot.

It is also true that neural nets *are* capable of learning judgments related to ladders given the right data. Some time back, I found with some visualizations for KataGo's net that it actually is tracing a width-6 diagonal band across the board from ladders! But the inductive bias is weak enough, plus the structure of the game tree for ladders is hard (it's like the classic "cliff walking" problem in RL turned up to the max), that it's a chicken-and-egg problem. Starting from a net that doesn't understand ladders yet, the "MCTS policy/value-improvement operator" is empirically very poor at bootstrapping the net into understanding them.

> something the first version of AlphaGo did (before they tried to make it
> "zero") and something that many other bots do as well. But Leela Zero and
> ELF do not do this, because of attempting to remain "zero", ...

I know that zero-ness was very important to DeepMind, but I thought the
open source dedicated go bots that have copied it did so because AlphaGo
Zero was stronger than AlphaGo Master after 21-40 days of training.
I.e. in the rarefied atmosphere of super-human play that starter package
of human expert knowledge was considered a weight around its neck.

The PR and public press around AlphaZero may give one this impression generally - it certainly sounds like a more impressive discovery if not only can you learn from Zero, but doing so is actually better! But I'm confident that this is not true in general, and that it also depends on what "expert knowledge" you add, and how you add it.

You may note that the AlphaGo Zero paper makes no mention of how long or with how many TPUs AlphaGo Master was trained (or if it does, I can't find it) - so it's hard to say what Master vs Zero shows. Also, it claims that AlphaGo Master still made use of handcrafted Monte-Carlo rollouts, which I can easily believe that jettisoning could lead to a big improvement. And it's at least plausible to me that not-pretraining on human pro games might give better final results (*but* this is unclear - at least I don't know of any paper that actually runs this as a controlled test)..

But there are other bits of "expert knowledge" that do provide an improvement over being pure-zero if done correctly, including:

* Predicting the final ownership of the board, not just the win/loss.

* Adding a small/mild term for caring about score, rather than just win/loss.

* Seeding a percentage of the self-play training games to start in positions based on external or expert-supplied game or board positions (this is the main way KataGo went from being highly vulnerable to MiYuting's flying dagger like other zero bots, to playing it decently well and now often winning games based on it depending on if the other side happens to shoot themselves in the foot with one of the trap variations or not).

And yes, for now it also includes:

* Adding ladder status as an input to the neural net.

0_94.sgf

Reply all

Reply to author

Forward