On Thursday, November 8, 2018 at 4:29:18 PM UTC-6, LuckyDay wrote:
> very interesting analysis. It seems that higher cpuct values are better for opening-midgame strength (likely due to greater exploration) but lower cpuct values are better for endgame strength (likely due to greater depth of visits). This can be partially mitigated by TB rescoring it appears.
>
>
> However it does make me wonder, would it be possible to vary cpuct in self-training games in a decay-like fashion? so rather than having something like temp decay, you would have cpuct starting at a high value such as 5, and then gradually decreasing in relation to game move number down to say 1 or 2?
>
> On Friday, November 9, 2018 at 5:09:34 AM UTC+11, Peter Borrmann wrote:
> I was wondering what are the differences between the networks. The analysis below uses three measures:
> Kendall Tau Korrelation between eval and result: Measures if evalations have the right order - disregarding the absolute value. Quite similar to Area under curve, but allowing non binary outcome (in this case with draws)Pearson Correlation between Q-value and result: Measures, how consistent prediction and game result are Q-Values are calculated from eval as: Q = atan(Eval*100/290.680623072)/1.548090806Q-Value itself
> Q1: What makes the difference between T20 and T30?
>
>
> We tested the network numbers given in the graphs with 200 to 500 games at TC 15s and 60s per 40 move with random openings from Silvers book (played twice with reversed colors). In this case with tablebase.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> Both engines seem to have almost equal predictive power.
> ktau increase as expected with higher move numbers Test20 seems to have some advantage in opening and midgame while T30 seems to perfom better in endgameMore time removes some engame trolling (less moves on average)
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> The Q-Values complete the picture:
> T20 is indeed quite a bit better in midgameT30 catches up in endgame (if it reaches more that 70 moves)Evaluations are almost symmetric
> Hypothesis:
> The (unexpected) strength of T30 is merely due to better endgames trhough tablbase rescoring!
>
> Questions:
> Does the strength of T20 stem from the "tiny" better midgame evaluations or from a better policiy network - suggesting better trial moves?Could T10 and T20 improved by continuing training with TB rescoring (going back to higher learning rates)?
> BTW: The elo differences are consistent with other tests: ~150 ELO points for T30 at 14s/40 and ~70 at 60s/40
>
>
>
>
> Q1: Is T20 even better than T10 in midgame?
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> T20 seems to have a bette evaluation function that T10 til move 30. Than T10 takes control.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> Both engine almost perfectly agree, that T20 is better than T10 in midgameFinally T10 is almost 250 ELO points ahead due to better endgame!
> Implications/Hypos:
> High C_PUCT and large iteration numbers seems to be well perfoming in midgameLower C_PUCT favors T10 due to better endgames (and probably transition into endgame)Tablebase rescoring seems to improve endgame quite a bit A better understanding of policy (selection of nodes in MCTS) and evaluation (selection of the next move) will be the key point in improving the game qualityAn optimal C_PUCT should be adjusted to the complexity of the game position: Simple: Depending on the pieces left on boardAdvanced: Switching within one MCTS search depending on how tactical or strategical the subtree is
> Further ideas for simple testing:
> Understanding C-PUCT:
>
> Repeat the analysis above with different C_PUCTs and time controls and probably some more movesRetrain some networks for a few millions games with a C_PUCTs depending on the number of piecesRescore the last 10 million games of T20 with TB-rescoring (No need for new games)Retrain a 10-20 networks of T20 or T30 with new games C_PUCT=3.0
> Goodie:
>
>
> I am running a batch between 30950 (before the drop in self elo) and 30999 (after the drop).
>
>
> Up to now: Despite the Self-Elo numbers 30999 is 43 elo points ahead! (Score of lc0_30950 vs lc0_30999: 46 - 71 - 86 [0.438])
>
>
>
>
> Here is what the q-value tells us. Up to 120 moves the new net is better!
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> Hypo:
> With short time controls (or large c_PUCT) the endgame capabilities dramatically drop and increase.
>
>
> Thanks, you made it all the way. Sorry for this lengthly post.
I think it should be variable in some way but maybe the method could be much cruder. Eg high CPUCT in first phase, lowered along with (or some time after?) the learning rate, then perhaps lowered again. Basically ‘seeding’ the net with knowledge acquired from the wide tree, then narrowing and deepening the focus on crucial opening lines and endgame techniques that may be getting lost with the higher value. .
Something akin to your proposed method would seem to make sense if test10 isn’t ultimately shown to be superior in the opening phase...right now I can say for certain that test30 is nowhere close, nor was test20 a week or so ago, the latter of which is what makes me pessimistic about the high value there.