Statistical Analysis of the evaluation quality and implications

816 views
Skip to first unread message

Peter Borrmann

unread,
Nov 8, 2018, 1:09:34 PM11/8/18
to LCZero
I was wondering what are the differences between the networks. The analysis below uses three measures:
  • Kendall Tau Korrelation between eval and result: Measures if evalations have the right order - disregarding the absolute value. Quite similar to Area under curve, but allowing non binary outcome (in this case with draws)
  • Pearson Correlation between Q-value and result:  Measures, how consistent prediction and game result are 
    • Q-Values are calculated from eval as: Q = atan(Eval*100/290.680623072)/1.548090806
  • Q-Value itself
Q1: What makes the difference between T20 and T30?

We tested the network numbers given in the graphs with 200 to 500 games at TC 15s and 60s per 40 move with random openings from Silvers book (played twice with reversed colors). In this case with tablebase.

Rplot01.jpeg















Rplot02.jpeg














  • Both engines seem to have almost equal predictive power. 
  • ktau increase as expected with higher move numbers 
  • Test20 seems to have some advantage in opening and midgame while T30 seems to perfom better in endgame
  • More time removes some engame trolling (less moves on average)

Rplot03.jpeg














The Q-Values complete the picture: 
  • T20 is indeed quite a bit better in midgame
  • T30 catches up in endgame (if it reaches more that 70 moves)
  • Evaluations are almost symmetric
Hypothesis: 
  • The (unexpected) strength of T30 is merely due to better endgames trhough tablbase rescoring! 
Questions: 
  • Does the strength of T20 stem from the "tiny" better midgame evaluations or from a better policiy network - suggesting better trial moves?
  • Could T10 and T20 improved by continuing training with TB rescoring (going back to higher learning rates)?
BTW: The elo differences are consistent with other tests:  ~150 ELO points for T30 at 14s/40 and ~70 at 60s/40


Q1: Is T20 even better than T10 in midgame?

Rplot04.jpeg














T20 seems to have a bette evaluation function that T10 til move 30. Than T10 takes control. 

Rplot05.jpeg














  • Both engine almost perfectly agree, that T20 is better than T10 in midgame
  • Finally T10 is almost 250 ELO points ahead due to better  endgame!
Implications/Hypos:
  • High C_PUCT and large iteration numbers seems to be well perfoming in midgame
  • Lower C_PUCT favors  T10 due to better endgames (and probably transition into endgame)
  • Tablebase rescoring seems to improve endgame quite a bit 
  • A better understanding of policy (selection of nodes in MCTS) and evaluation (selection of the next move) will be the key point in improving the game quality
  • An optimal C_PUCT should be adjusted to the complexity of the game position: 
    • Simple: Depending on the pieces left on board
    • Advanced:  Switching within one MCTS search depending on how tactical or strategical the subtree is 
Further ideas for simple testing:
  • Understanding C-PUCT:  

    • Repeat the analysis above with different C_PUCTs and time controls and probably some more moves
    • Retrain some networks for a few millions games with a C_PUCTs depending on the number of pieces
  • Rescore the last 10 million games of T20  with TB-rescoring (No need for new games)
  • Retrain a 10-20 networks  of T20 or T30 with new games C_PUCT=3.0 
Goodie:

I am running a batch between 30950 (before the drop in self elo) and 30999 (after the drop).

Up to now:  Despite the Self-Elo numbers 30999 is 43 elo points ahead!  (Score of lc0_30950 vs lc0_30999: 46 - 71 - 86 [0.438])


Here is what the q-value tells us. Up to 120 moves the new net is better!


Rplot06.jpeg













Hypo: 
  • With short time controls (or large c_PUCT)  the endgame capabilities dramatically drop and increase. 

Thanks, you made it all the way. Sorry for this lengthly post.






Dave Whipp

unread,
Nov 8, 2018, 2:00:42 PM11/8/18
to Peter Borrmann, LCZero
Thank you for this analysis: really interesting; and hopefully actionable!

--
You received this message because you are subscribed to the Google Groups "LCZero" group.
To unsubscribe from this group and stop receiving emails from it, send an email to lczero+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/lczero/278d610c-6557-4f24-b35d-f674297a21d3%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Joseph Ellis

unread,
Nov 8, 2018, 3:28:16 PM11/8/18
to LCZero
I don't suppose you could plot a random sample of 250 - 500 training games?

Gökcan AKALIN

unread,
Nov 8, 2018, 5:07:12 PM11/8/18
to LCZero
Thank you for your great dedication and effort to analyze the situation at hand.

8 Kasım 2018 Perşembe 21:09:34 UTC+3 tarihinde Peter Borrmann yazdı:

LuckyDay

unread,
Nov 8, 2018, 5:29:18 PM11/8/18
to LCZero
very interesting analysis. It seems that higher cpuct values are better for opening-midgame strength (likely due to greater exploration) but lower cpuct values are better for endgame strength (likely due to greater depth of visits). This can be partially mitigated by TB rescoring it appears.

However it does make me wonder, would it be possible to vary cpuct in self-training games in a decay-like fashion? so rather than having something like temp decay, you would have cpuct starting at a high value such as 5, and then gradually decreasing in relation to game move number down to say 1 or 2?

David Rodríguez Sánchez

unread,
Nov 8, 2018, 5:56:16 PM11/8/18
to LCZero
Amazing piece of work.
Thank you! And please keep the good testing

Daniel Rocha

unread,
Nov 8, 2018, 6:17:10 PM11/8/18
to lcz...@googlegroups.com
Is it possible to use both engines where they are the best? Like passing the torch at some point?

Em qui, 8 de nov de 2018 às 20:56, David Rodríguez Sánchez <davidro...@gmail.com> escreveu:
Amazing piece of work.
Thank you! And please keep the good testing

--
You received this message because you are subscribed to the Google Groups "LCZero" group.
To unsubscribe from this group and stop receiving emails from it, send an email to lczero+un...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.


--
Daniel Rocha - RJ

Lito

unread,
Nov 8, 2018, 10:29:51 PM11/8/18
to LCZero
Very enlightening analysis. Lots of gratitude to you.

On Thursday, November 8, 2018 at 10:09:34 AM UTC-8, Peter Borrmann wrote:
I was wondering what are the differences between the networks...
    .......... 

Enqwert

unread,
Nov 9, 2018, 2:01:12 AM11/9/18
to LCZero
Thank you for the effort, if you can repeat the test with different NNs in the future, it would be nice to see progress in different areas.

S. W.

unread,
Nov 9, 2018, 3:32:56 AM11/9/18
to LCZero
great stuff, thank you

John D

unread,
Nov 9, 2018, 5:18:45 AM11/9/18
to LCZero
On Thursday, November 8, 2018 at 4:29:18 PM UTC-6, LuckyDay wrote:
> very interesting analysis. It seems that higher cpuct values are better for opening-midgame strength (likely due to greater exploration) but lower cpuct values are better for endgame strength (likely due to greater depth of visits). This can be partially mitigated by TB rescoring it appears.
>
>
> However it does make me wonder, would it be possible to vary cpuct in self-training games in a decay-like fashion? so rather than having something like temp decay, you would have cpuct starting at a high value such as 5, and then gradually decreasing in relation to game move number down to say 1 or 2?
>
> On Friday, November 9, 2018 at 5:09:34 AM UTC+11, Peter Borrmann wrote:
> I was wondering what are the differences between the networks. The analysis below uses three measures:
> Kendall Tau Korrelation between eval and result: Measures if evalations have the right order - disregarding the absolute value. Quite similar to Area under curve, but allowing non binary outcome (in this case with draws)Pearson Correlation between Q-value and result:  Measures, how consistent prediction and game result are Q-Values are calculated from eval as: Q = atan(Eval*100/290.680623072)/1.548090806Q-Value itself
> Q1: What makes the difference between T20 and T30?
>
>
> We tested the network numbers given in the graphs with 200 to 500 games at TC 15s and 60s per 40 move with random openings from Silvers book (played twice with reversed colors). In this case with tablebase.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> Both engines seem to have almost equal predictive power. 
> ktau increase as expected with higher move numbers Test20 seems to have some advantage in opening and midgame while T30 seems to perfom better in endgameMore time removes some engame trolling (less moves on average)
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> The Q-Values complete the picture: 
> T20 is indeed quite a bit better in midgameT30 catches up in endgame (if it reaches more that 70 moves)Evaluations are almost symmetric
> Hypothesis: 
> The (unexpected) strength of T30 is merely due to better endgames trhough tablbase rescoring! 
>
> Questions: 
> Does the strength of T20 stem from the "tiny" better midgame evaluations or from a better policiy network - suggesting better trial moves?Could T10 and T20 improved by continuing training with TB rescoring (going back to higher learning rates)?
> BTW: The elo differences are consistent with other tests:  ~150 ELO points for T30 at 14s/40 and ~70 at 60s/40
>
>
>
>
> Q1: Is T20 even better than T10 in midgame?
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> T20 seems to have a bette evaluation function that T10 til move 30. Than T10 takes control. 
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> Both engine almost perfectly agree, that T20 is better than T10 in midgameFinally T10 is almost 250 ELO points ahead due to better  endgame!
> Implications/Hypos:
> High C_PUCT and large iteration numbers seems to be well perfoming in midgameLower C_PUCT favors  T10 due to better endgames (and probably transition into endgame)Tablebase rescoring seems to improve endgame quite a bit A better understanding of policy (selection of nodes in MCTS) and evaluation (selection of the next move) will be the key point in improving the game qualityAn optimal C_PUCT should be adjusted to the complexity of the game position: Simple: Depending on the pieces left on boardAdvanced:  Switching within one MCTS search depending on how tactical or strategical the subtree is 
> Further ideas for simple testing:
> Understanding C-PUCT:  
>
> Repeat the analysis above with different C_PUCTs and time controls and probably some more movesRetrain some networks for a few millions games with a C_PUCTs depending on the number of piecesRescore the last 10 million games of T20  with TB-rescoring (No need for new games)Retrain a 10-20 networks  of T20 or T30 with new games C_PUCT=3.0 
> Goodie:
>
>
> I am running a batch between 30950 (before the drop in self elo) and 30999 (after the drop).
>
>
> Up to now:  Despite the Self-Elo numbers 30999 is 43 elo points ahead!  (Score of lc0_30950 vs lc0_30999: 46 - 71 - 86 [0.438])
>
>
>
>
> Here is what the q-value tells us. Up to 120 moves the new net is better!
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> Hypo: 
> With short time controls (or large c_PUCT)  the endgame capabilities dramatically drop and increase. 
>
>
> Thanks, you made it all the way. Sorry for this lengthly post.

I think it should be variable in some way but maybe the method could be much cruder. Eg high CPUCT in first phase, lowered along with (or some time after?) the learning rate, then perhaps lowered again. Basically ‘seeding’ the net with knowledge acquired from the wide tree, then narrowing and deepening the focus on crucial opening lines and endgame techniques that may be getting lost with the higher value. .

Something akin to your proposed method would seem to make sense if test10 isn’t ultimately shown to be superior in the opening phase...right now I can say for certain that test30 is nowhere close, nor was test20 a week or so ago, the latter of which is what makes me pessimistic about the high value there.

Peter Borrmann

unread,
Nov 9, 2018, 11:30:50 AM11/9/18
to LCZero
John, 

the final outcome of the match  presented was consistent with previous tests (around +250 ELO for test10). Does nowhere (see below)  mean, that test20 was in no game phase close or better than test10? The figure represents the mean q-values for 10 move buckets. I was very surprised too by the result and it seems to be worth to drill down deeper. I have two assumptions on this: 
  • Moderate average positional advantages can not be converted into wins by test20
  • The distributions might be different and test10 could find tactical clues in midgame (sort of a fat tail), which can bei converted into a win.
I am currently running more games to investigate the game dynamics a bit further. The obvious idea is to look at the deltas of the qvalues to classifiy the underlying process, but this requires a lot more games. 

Peter Borrmann

unread,
Nov 9, 2018, 11:32:38 PM11/9/18
to LCZero
I repeated the match between T10 and T20 with 120s/40 time control. Under this condition T10 ist better in all game phases and the ELO difference is increasing as well. 

As stated previously this seems to confirm: 
  • tactical qualities (low cpuct) are important in all games phases
  • to leverage better tactical training the node count has to allow a certain depth
  • within endgame bad tactical quality is shown early on

Rplot07.jpeg

Jupiter

unread,
Nov 14, 2018, 8:32:34 AM11/14/18
to LCZero
Thanks for sharing your work.
 
  • Both engines seem to have almost equal predictive power.  
  • ktau increase as expected with higher move numbers 
  • Test20 seems to have some advantage in opening and midgame while T30 seems to perfom better in endgame
  • More time removes some engame trolling (less moves on average)

In the ktau plot at TC 60, it seems like both are underperforming in ending if ending > 140 moves (considering both engines uses egt). But at TC 15 the ending is much better. BTW what hardware did you use to run the engine match? How many egt men did you use? Could you show the average number of pieces remaining from move 140 and up?

Peter Borrmann

unread,
Nov 14, 2018, 9:10:31 AM11/14/18
to LCZero
@Jupiter  

The statistics is getting worse which higher move numbers. I should have added the move obersvation counts or error bars.  I would not take that to serious. 

Hardware ist  Intel Xeon Skylake 2 * 10 cores, 128 GB RAM + Nvidia 1080
Message has been deleted

adazacom

unread,
Nov 14, 2018, 9:58:14 AM11/14/18
to LCZero
Would it ever be possible to plot against number of pieces on the board (rather than move number) or is that too expensive?
Reply all
Reply to author
Forward
0 new messages