Self Elo - What to expect!

1,096 views
Skip to first unread message

Peter Borrmann

unread,
Nov 10, 2018, 8:41:30 PM11/10/18
to LCZero
I would strongly suggest to remove the self-elo of the last 100 nets from the front page. At least in the - long lasting - phase with small elo increases it is more or less telling nothing. 
It is a random walk with huge noise and small signal and it tends to confuse people. 

The easiest is to change the default display to full selfelo-figure or the elo estimates. Long term it would be nice to display interesting statistics. 
---------------------------
I made a simple model to sample the behaviour of the self elo with the following parameters: 
  • Underlying distribution is multinominal  with (40% draws and win an loss adjusted to 1 ELO point gain per net)
  • Simulation of 100 nets which is about the number displayed in main window 
  • 450 Test  games per net
Source code in R is below. 

Here a some random samples (on average with lots of samples selfelo climbs by 100 points)

Rplot12.jpeg









Rplot11.jpeg
















Rplot10.jpeg














Rplot08.jpeg
















library(ggplot2)
p1 <- 0.302
p2 <- 0.298
draw <- 0.4
n <- 450

# rerun this part to get random samples
x<-rmultinom(100,n,c(p1,p2,draw))
y <- t(x) %*% c(1,0,0.5)
z  <- elo(y/n) 
selfelo <- cumsum(z)
df <- data.frame(run=i,net=1:100,selfelo=selfelo)
ges <- rbind(ges,df)
ggplot(df,aes(net,selfelo))+geom_line()+ facet_grid(run~.)

Trevor G

unread,
Nov 11, 2018, 4:17:36 PM11/11/18
to Peter Borrmann, LCZero
Yes, completely agree with this. As somebody who has played a lot of poker, I am quite familiar with the way accumulating variance works. If every net were actually increasing by some fixed Elo amount, you could expect a “self-play Elo” graph to look like a simulated graph of poker winnings by a player who has some fixed win rate (in the long run walks like this are subject to the central limit theorem and outcomes become normally distributed). In either case, it is easy for these graphs to go down for a while, even when “down” does not reflect the real truth. Or vice versa.

Since there are free poker variance simulators online, I thought I’d plug in numbers to approximately reflect the self-Elo graph given constant Elo growth.

I am using:
Elo gain per net of 1 Elo point.
Standard deviation of 12 Elo (I didn’t know exactly what this should be — just saw confidence intervals are mostly around +-24, so I used half of that as an approximation).
200 nets simulated.

Elo is translated to 1 “BB/100 hands”
Standard deviation is translated to 12 “BB/100 hands”.
This represents 200 nets, so “20,000 hands” (so there are 100x points on the graph).







--
You received this message because you are subscribed to the Google Groups "LCZero" group.
To unsubscribe from this group and stop receiving emails from it, send an email to lczero+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/lczero/50ca5991-a17b-458c-be35-a8fe1d0e3e38%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

MathAndreas

unread,
Nov 12, 2018, 8:14:16 AM11/12/18
to LCZero
Thanks for your posts and the very convincing graphs. In fact, this method of obtaining self-play Elo numbers and making graphs was
very confusing and nearly useless from the beginning. I have followed LCZero very intensively since the beginning but haven't participated in the discussion
until recently. On the other hand, the self-play Elo numbers and graph of Leela Zero Go are much more
useful, because they use gating, where the next best net must get 55% against the last best. And this has worked very well, so that any Go NN is in most cases stronger than any previous net. With no gating it makes nearly no sense to compare the newest net to the last one. I am not against no gating though.
But we need other methods to monitor playing strength. I wonder why this hasn't been done already.

Here I propose a very simple method which has disadvantages but is much much better than the status quo:
1. Compare every new net with the current reference net (current RN) to get a self-play Elo.
2. Every reference net is assigned a more or less realistic ELO by testing it against AB engines and the older RNs.
3. From time to time, choose a new current RN if the Elo diff from the best net becomes too large.
That's all. Not much additional work compared to the current method, is it?

Using a method like this would have avoided so much confusion, disappointement and discussions, wouldn't it?
Andreas (yes, I am a mathematician)

Trevor G

unread,
Nov 12, 2018, 1:30:07 PM11/12/18
to MathAndreas, LCZero
There would still be the same sort of accumulating variance with gating (re: Leela Zero Go). Except the graph is guaranteed to go up... They say they accept a network only if they have 95% confidence that the next network is better than the last. But if they are only accepting 1 network in 20, then it would be reasonable to assume most of the “95% confidence” comes from variance (I don’t know what percentage of networks pass gating, but judging by the fail marks - it looks like by far most fail). There’s explicit selection bias in this scheme - so a “true” 95% confidence would require many more games to pass gating. In the current scheme, I’d expect that way more than 1 out of 20 networks that pass gating are actually weaker than the last (but you never see any marks lower than the previous with gating).

I’m not extremely familiar with what they’re doing, so maybe I’m missing something. But my take is that the Leela Zero Go graph as well as lczero’s has quite a lot of variance, and the underlying Elo reality is not represented well in either case.

BUT - if the Elo gain expectation to variance ratio is big  enough, then you would see more signal and less noise. Maybe this is still the case with Leela Zero Go.



Al Z

unread,
Nov 13, 2018, 2:55:12 PM11/13/18
to LCZero
so now that 3XXX elo is going up its not accurate?   can someone explain exactly how is this elo calculated? 

Joseph Ellis

unread,
Nov 13, 2018, 3:21:10 PM11/13/18
to LCZero
It isn't an issue of calculation but the margin of error being quite large relative to the typical amount of change being measured.

MathAndreas

unread,
Nov 13, 2018, 7:42:51 PM11/13/18
to LCZero
The big problem is that the margin of error becomes larger and larger. The more nets are tested, the larger the error.
This is because each net is only compared to the previous one, so the error is accumulated.
This can be avoided easily by using a better method like I proposed, by always comparing to some fixed net of known strength.
Surely almost any other AB engine does this to find stronger versions.

But for Leela Zero Go it works much better because of the gating, so much less nets are compared (only 1 testrun, only 189 nets in 12 months!).
But of course there is inaccuracy and "self-play elo inflation" of approx. factor 3.
@Trevor: Yes, surely this is the case with Leela Zero Go. You see more signal and less noise.
                But my point is that we can easily do much better even than Leela Zero Go by using my method, which enables us to get approx. real Elo!
                This is made possible just because we don't use gating, so no need to compare with previous net.
                That is the point: LCZero abandoned gating but didn't change the method for estimating Elo.

Well, I should post my proposal in discord to the developer team. Surely they have known the problem all the time, but didn't fix it.
Probably because manpower was lacking to implement the change. I understand that. Well, probably I should offer to implement the change myself.
Being a software developer, that should be possible.
Reply all
Reply to author
Forward
0 new messages