How to follow test20 experiment

3,048 views
Skip to first unread message

ovi...@gmail.com

unread,
Sep 11, 2018, 5:13:28 AM9/11/18
to LCZero
I have been following this project since May. I started enjoying Leela's play in old main nets. Then, test10 started growing in strenght, TCEC, cccc, etc.

I play chess and can follow games, etc, but I have no background whatsoever in AI. Therefore I could not understand well what was happening with the project.
 
When test20 started, I decided to try to understand it better and try to enjoy how it evolves... Most of the decisions and technical discussions are in the discord channel so 
I started reading carefully trying to extract clues about what was happening. I realized many people have the same problem; for many of us the only parameter to follow was self-elo
but it can be totally misleading.

In a recent thread I wrote some messages about what I have learned about test20...

This is the thread:


I will copy those messages here and continue reporting some news about test20, etc, trying to explain what is going on with it.
I must stress from the very beginning I am not an expert so take these explanations, news with caution (maybe anyone with more knowledge can correct my mistakes).

Below I copy my previous messages... I apologize for repeating them but I want this thread to be self contained.

First message:

"  The main "experiment" in test20 is to change cpuct. Test10 was using close to 1 and now is 5. That one is the important parameter.
 The point is, according to other studies, that cpuct low have fast gains but then it slows down. Test10 gained about 80-85% of its final strenght very soon, as you mention.
On the other hand a cpuct of 5 would start slower but then will have a much steadier growth and a final higher ceiling (btw, test10 is about 3500 elo, we should not expect a 4000 elo ceiling either).
 So, although it seems test20 is going too slow, there will be a point when it reaches test10 in the future. The experiment is set to last for about 40-44M games so it would not be very wise to stop right now or we will learn nothing.
 There are other parameters such as accuracy, tactical sharpness, etc that are supposed to improve as well (again I am not an expert).
 Other parameter that is being changed is resign threesold. That causes spikes when it changes...

 Finally, there was decided not to do many learning rate (LR) changes. In previous tests, every time self-elo moved, people claimed for a LR change. That proved not to be very useful. Now has been decided to do only 3 LR changes, The first is about 11M games (very soon). That will probably will make elo grows faster (now, with a LR high, exploring is maximized but also the net "forgets" easily.
Again, not very wise to stop the experiment even before reaching the first LR.

 So, maybe, do not look self-elo graph (totally misleading). Look Mgotostark ccrl sheet instead. Be patient.
 
 Anyway, nobody garantees the experiment will be successful (as usual in experiments).
 Other parameter to be tested in the future is the number of nodes in training games (800 at present). Effect of increasing them (again a long run). Patience!!

I would like somebody who really knows will write something in the blog to explain these things. People naturally looking at self-elo graph gets worried.

I hope this helps. "

Second message:

 " It is true that enter into discord is like to fishing!! (you sometimes get something)...

Somebody sent this graph... I find it very ilustrative.


Test10 was something like the blue line (cpuct about 1) and test20 is like the orange one... (this I think is from other work, other publication). In our case it should show elo strenght and going up instead of down but you get the idea from it.

Many people ask for things to be proved... however, every experiment can take several weeks, since you need to change only one parameter (one experiment is the control, test10 and you change one parameter (test20) and see how it affects the results (elo strenght). It is slow, sometimes bored, but it is the way science goes on.

Again, I think it would be nice to have a (not very deep) explanation of how these experiments (the project) proceeds. I feel devs think explanations will produce a lot of technical questions that in turn will generate more questions, explanations, etc, etc, etc. And they work part time... Understable.

A final remark; people who understand (not me) are all saying test20 is going well and is promising, Since they were able to produce a 3500 engine, they deserve some time to develop things... "

Third message:

 "Again fishing in discord waters (I apologize for using the word "fish" in this forum)..

LR at present is 0.2 it will be lowered probably to 0.02, 0.002, 0.0002

Not is a symmetrical way (I mean not equidistant). It seems the first LR lowering is gonna be delayed a bit because of spikes of self elo. Devs are waiting for some tecnical parameters (well beyond me) to get stabilised. I would say it would be for 16-18 M games....

I will inform if I got more news... "


ovi...@gmail.com

unread,
Sep 11, 2018, 5:24:56 AM9/11/18
to LCZero
Now time for more news:

 elo rating reached a record in MTGOstark list:

 ID 20460   2489 elo



 elo rating reached a record in lc0 vs SF list:

 ID 20435  2521 elo



 These are the best values for elo BEFORE the first LR drop. Therefore, the strenght of test20 nets is about 2500 after 12 M games

 As a comparison, for test10, nets with the same amount of games (about NN 10161, the one sent to TCEC) was about 3100 elo (in MTGOstark list).

As you see, growth is faster in experiment test10 than in test20.

Ludo

unread,
Sep 11, 2018, 6:08:38 AM9/11/18
to LCZero
Thank you for the update. Has the LR been lowered yet? I did not see any annoucement about that.

Cheers.

Graham Jones

unread,
Sep 11, 2018, 6:19:46 AM9/11/18
to LCZero
Yes, just. From dev-log on 10  Sep:
 "LR drop for test20 - 0.02 is the new LR - first net produced using the new LR will be Number 20493 (id 20497"

Curious

unread,
Sep 11, 2018, 6:21:44 AM9/11/18
to LCZero
I got the impression from #dev-log yesterday that it has (just) been lowered. Looks like it gave an immediate increase in self-play Elo.

ovi...@gmail.com

unread,
Sep 11, 2018, 6:22:52 AM9/11/18
to LCZero
Yes, that is the next message  ;-)

 Finally, LR has been lowered from 0.2 to 0.02 startong from network NN 20497 (about 13 M games).

 (You can notice the self elo graph is smoother from then on since changes in net weights are smaller...)

 It has been marked by a black dot (the LR drop) in the "Elo estimate" graph (the alternative to Self-play elo in the front web page).


 When LR is high, changes in weights are so big that sometimes is difficult to retain knowledge (it is coloquial but it could be said she learns and forgets easily...). Now, with a lower but
still big LR the gains are expected to be faster.

 In fact, we already have a elo estimation "post-LR reduction":

 From "Elo vs SF list"

 NN 20435   elo  2521
 NN 20517   elo  2648  (120 elo gain!!)

 We expect (or hope) for a faster growth now. Maybe test20 could rise to 3000 soon.

 I will write when I know more...

Ludo

unread,
Sep 11, 2018, 7:43:07 AM9/11/18
to LCZero
Okay thank you!

pw31

unread,
Sep 11, 2018, 4:07:54 PM9/11/18
to LCZero
Thank you ovi..., that was an extremely useful and informative post!
Can I ask again about the definitions and effects of parameters CPUCT and LR?  Where exactly
are they used and what do they do?  Are there some formula which would explain their meaning?

My understanding is that CPUCT is something like the width of the search.  For small CPUCT, the
search will be very focussed and deep, whereas for large CPUCT the search tree will include more
side branches, hence the search-tree becomes wider and shallower.  For small CPUCT, the playing
strength depends a lot on the accurateness of the move heatmap. If the net shows a low priority for a
good move, that move might never ever be considered.  For large CPUCT, the search will more likely
correct those inaccurate move priorities, but the search will be shallower.  So CPUCT is a parameter
for the tree search. 

LR seems to be something like an inverse memory timescale.  For large LR, training games will
sooner be forgotten, and new training games have a larger impact on changing the net.  For
small LR, lc0 will better remember old training games, but one has to be patient to see any changes.

Does that sound correct to you?

Thank you! Peter

Nathan Blaxall

unread,
Sep 11, 2018, 5:48:41 PM9/11/18
to LCZero
Thank you very much for sharing your findings ovi !
I too am trying to piece things together.
From what I've picked up (and agrees with what pw31 just posted):
  • LR = learning rate.  As I understand it (I've played around with some optimization stuff but I'm still learning this stuff), the learning rate is how much the NN is changed by training.  If it's too high, the NN will diverge (and be all over the place in terms of learning).  If it's too low, the NN will learn but the steps will be so small it'll take for ever to be be optimized (take a looooong time to be at a decent playing strength).  So a good LR can get the NN to learn and not take ages (and there'll be only a little jumping around).  Why start with LR of 0.2 for a while and then decrease it to 0.02 etc?  I think there are two good points for this (someone who knows feel free to correct me) ... one is it's difficult to know what the LR is (so better to start a bit high and then decrease it), but more importantly, a higher LR helps to avoid local minimums.  A low LR would find it difficult to escape a local minimum (eg it would only get to a certain playing strength, eg 2000 elo - just an example), where as a high LR would be more "jumpy" and would be more likely to (after a while) fall into an area with a deeper minimum.  Once there, a lower LR would let it converge to that minimum (it too might be a local minimum, but a much better than the first one, and this process might get the NN to 4000 elo strength).
  • cpuct (as I understand it) is a value that is used to work out how likely the search will explore a move from a position.  MCTS has two parts that "fight" each other: exploitation (ie search the strongest looking moves deeper) vs exploration (ie search weaker looking moves, just in case it turns out to be strong - eg a queen sac).  The lower cpuct is, the more only the strongest moves are searched, and the higher you have cpuct, the more other moves are searched too.  What tends to happen with a good cpuct is the strongest looking moves are searched deepest (which is good because Leela needs to make sure she is making a move that is still strong when looking several moves deep) and some other moves are checked (not as deep) just in case.  Interestingly, the search formula is arranged so that at the beginning of the search, lots of moves are searched, but as the number of nodes searched increases, the strongest moves have a greater bias to be searched instead.  Pretty neat huh?
  • test20 is an experiment, and as you said ovi, it might work or it might not.  From what I've read on discord, this it my understanding: test20 might have gotten off to a bad start, or some of the variables that are different to test10 might not work out well.  There is concern that test20 might not work out even by the devs, but it's early days and the consensus is to wait and see if test20 will overtake test10.  If at some point that doesn't look likely, then the NN will be reset.  Note that there is already test30 that is up & running to (from what I've read) test a minor bug fix which might have affected how well test20 started (and to test a few other cool ideas).  test30 is there at this stage to test a few things, but it might take over test20 is test20 doesn't work out.  On the other hand, if test20 works out then everyone's patience will be rewarded.

Nathan Blaxall

unread,
Sep 11, 2018, 6:42:03 PM9/11/18
to LCZero
I should say (for those that don't know), optimization techniques try to find a good "minimum" because it tries to minimize the error (which will maximize the playing strength).  A good minimum will be a pretty low one - it may not be the lowest, but hopefully fairly low (and certainly much better than a local minimum that it might fall into at the start if the settings aren't right).

Also there is a lot of talk about "temperature" on discord.
A higher temperature means that it'll occasionally randomly override it's strongest move and try some other move.  Thing of atoms moving around move when hotter.  This is good because otherwise the NN will always tend to do the same thing and wont learn from now sorts of moves (new ideas).
From what I've read of discord (from what I can understand):
  •  During self-play training games - this is that stage and generates data for the NN to use to adjust itself (hopefully closer to the optimum).  During this stage, temperature is on to try new things occasionally (temperature=1).   There is debate weather temperature affects Leela in the endgame and ability to detect perpetual check draws etc because she might be learning that if she waits then the other side (during self-play) will eventually make a sub-optimum move (because temperature is on) and she'll win.  Some think that the temperature should be a decaying one (so that it doesn't affect endgames so much), but this is not currently used, mainly because google didn't decay theirs for A0, but also some thing if it's decaying then Leela wont learn new ideas in the endgame.
  • During training-match testing - this is the stage to see if an adjustment to the NN from the above stage is any good or not, and it plays matches against the previous NN.  Here, temperature is initially on, but it decays quite fast, and this has the effect for these matches to play different openings, but then play as if temp=0 for the rest of the game (ie to make what it thinks is are the best moves).
  • During competition matches (like the current CCCC), temperature is always zero (so it plays what it considers to be the strongest moves right from the start.
It's really interesting to think about this stuff, and the knowledgeable people on discord are helpful, and I'm impressed how helpful & patient the devs are.  It's a really good community.

ovi...@gmail.com

unread,
Sep 12, 2018, 4:38:46 AM9/12/18
to LCZero
Good news!!

LR lowering is working quite well

According to ccrl list:

NN 20460   2491
NN 20500   2657
NN 20520    2887

(that is more than 300 elo in just 60 nets!!)

According to Leela vs SF list:

NN 20435  2525
NN 20517  2650
NN 20546  2740

(about 200 elo in 110 nets!!)

Note that these tests were made with different hardware, time controls, etc, so the numbers are not directly comparable, only tendencies...

Btw, in this case, as elo increases so much, even self elo shows a big raise.

Self-elo
  
 NN 20490   3205
 NN 20560   3678

(400 self-elo increase in 70 nets).

In general, self-elo shows too many fluctuations so be useful...

Another way to watch Leela is improving is in twitch

For instance;   twitch/edosani     often shows matches between latest nets and external engines of about the same strenght. Now opponents are about 3000 elo!

 I hope this tendency can last for several days and reach maybe 3100-3200 elo. 

 After the raising is over, there will be time for another LR drop.

A am a bit busy these days at work but as soon as possible I will try to write about what I learned for these parameters:  LR and training batch ; CPUCT and training available nodes  (they are closely related).

bjua...@gmail.com

unread,
Sep 12, 2018, 4:53:21 AM9/12/18
to LCZero
My question may seem naïve but would not it be wiser to lower LR by 5% every 100000 games to smooth the results ?

ovi...@gmail.com

unread,
Sep 12, 2018, 5:20:26 AM9/12/18
to LCZero
It is not naive at all, it is clever. There are a lot of proposals to have a "dynamic" LR or CPUCT, changing along the experiment. The problem is each experiment takes about 2 months!!!

For instance, should LR change linearly, exponentially, second order polynomical? Which parameters for those functions?
It experiments last for 1 day, you can afford to do many trials... in this case, is not possible. Many people complain about why devs try to follow Deep Mind steps with A0 so closely. The answer is they had very powerful hardware and for sure made a lot of tests... unfortunately, they did not publish all they knew...

Now it is necessary to rediscover painfully all these things.

pw31

unread,
Sep 12, 2018, 5:24:42 AM9/12/18
to LCZero
Hello Nathan, thanks for your exlanations, including that temperature during training. 
One question, is the progress of test30xxxx visible / monitored anywhere?

Nathan Blaxall

unread,
Sep 12, 2018, 5:33:52 AM9/12/18
to LCZero
This will show some limited info about the networks for test30 (but no graphs or anything): http://lczero.org/networks/2
... that's all I know.

ovi...@gmail.com

unread,
Sep 12, 2018, 6:22:50 AM9/12/18
to LCZero
I have made a graph for test10 (we can compare to that for test20 when is finished).
I have taken values of elo from cclr list. To simplify the graph I have only include nets that break earlier record. Only record breaking are included.

This is what results:

Test10-elovsnets.JPG


LR drops are marked with red squares (at 10077, 10320 and 11013).

With LR = 0.2 (at the beginning) elo grew from 0 to 2866  (82.2 % of the total elo)
        LR = 0.02                                                  2866 to 3250 (11.0%)
        LR = 0.002                                                3250 to 3420  (4.9%)
        LR = 0.0002                                              3420 to 3488  (1.9%)

 So, with CPUCT 1.2, most of the growth happened at the very beggining. From the graph you also see the first LR drop causes a big raise but next drops produce smaller changes.

 I hope you like it!

pw31

unread,
Sep 12, 2018, 6:31:51 AM9/12/18
to LCZero
Very informative again, ovi..., thank you!

pw31

unread,
Sep 12, 2018, 6:37:02 AM9/12/18
to LCZero
Is it known whether similar manipulation of parameters was applied to Leela Zero (the go program),
see http://zero.sjeng.org/? This visible steps below are accociated with increases of the network size.

blob:null/4c6d8f52-f336-4221-9553-ade46d09adbe

bjua...@gmail.com

unread,
Sep 12, 2018, 7:47:57 AM9/12/18
to LCZero
Thank you OVI for your enlightened answer, I intuitively though that taking 95% of the previous value of LR could eliminate the rebounds that wee see with each change. Divide LR by 10 seemed like a brutal approach...

Ludo

unread,
Sep 12, 2018, 10:43:59 AM9/12/18
to LCZero
I bounced here to have an explanation of the issue in https://groups.google.com/forum/#!topic/lczero/lkVD4bT9fGg, but can't see how it is related. Anyone with a clue?

pw31

unread,
Sep 12, 2018, 12:22:26 PM9/12/18
to LCZero
Ludo, this post has a lot of information about what is going on with the test20xxx series at the moment,
why it was set up in the first place, and some explanations why the progress looks a bit slow at the moment,
but has accelerated somewhat since Id20497.

ovi...@gmail.com

unread,
Sep 12, 2018, 12:55:54 PM9/12/18
to LCZero
Well, I do not know why anyone pointed to this thread because there was not explanation about that... I will try my best.

In test20 I said the main parameter was CPUCT and a changing LR... (but that is not all, there is another parameter that has been changing, I did not say anything to no complicate things...).

The other parameter to look is the resign threeshold
What is that parameter for? Well, in training games, they are played to the very end. That means some games are played in totally desperate positions where there is little to learn. They are wasting resources as well. Deep Mind, with A0, described in their paper they finished some % of the games; that is, they implemented a resign politics.

So you choose a threeshold. If the evaluation is worse than that, the game is finished. The higher the threeshold, the higher %. However, you have to choose it carefully or you will reject games that really were not lost (those are called false positives). 

In previous run (testXX, a fixed threeshold was used and monitored the % of rejected games, etc.
In test10, Tilps (username of the dev that is in charge of this run) is also testing "dynamic resign threeshold". He analyzes how many games have been rejected and get an estimation of how high the resign threeshold should be. 
Every X games there was something like this: "RS is 3%, optimal suggestion is 5%, it will be increased to 3.5%", etc. So RS is adjusted.
Unfortunately, by reasons still not known, that produced spikes in the self-elo graph (it seems the discontinuities affect badly to the training). After every spike, the net needs some time to recover... This is one of the reason the LR dop was delayed a lot (you can see the spikes in the graph "Elo estimates" that is showed below the self-elo graph).

Well, tilps decided to do another experiment to try some patches (I do not know exactly what is he doing). The important thing for the explanation is that this is Test30 run.

For some technical reason, well beyond me, networks from test20 and test30 are put together in 


The first column is just the order the networks are produced. The second one is the important (number column). You can distinguish those which start with 20XXX are from test20
and those from 30XXX are from test30. There are limited resources for test30 and it started recently so these nets are very weak...

Thomas Kaas chose ID 20563  Number  30015 (so this is the 15 net from test30).

Networks from test20 are playing indeed at about 3000 elo...

(see twitch/edosani right now)

net 20548 is playing Naum46 (estimated 3098 elo) result: 3-1 for Leela and crafty24.1 (estimated 2881 elo) result: 2.5-1.5 for Leela.

It is not very good idea to have both sets of nets in the same place but there is no so much room to put everything.

I hope this explanation helps!

Thomas Kaas

unread,
Sep 12, 2018, 1:01:04 PM9/12/18
to LCZero
I didnt even know there was a test 30. Confused at the moment.

Ludo

unread,
Sep 12, 2018, 3:39:54 PM9/12/18
to LCZero
Thank you for clarifying this!

A Thule

unread,
Sep 12, 2018, 6:13:41 PM9/12/18
to LCZero
There’s lots of great explaination in this thread. Can someone take a moment and explain network size (such as A x B, where A,B are interegers). I get that A x B describes a matrix of values sized A by B, but what are A and B: where do they come from?

Nathan Blaxall

unread,
Sep 12, 2018, 9:26:46 PM9/12/18
to LCZero
 This is what I saw on discord...
A x B is referring to the architecture of the NN:
A = number of blocks in the NN
B = number of filters
In a 10x128 network, are the 128 unique filters for each of the 10 blocks.
Also it's two convolutional layers per block, which both have their own unique learned filters, so a rough estimate of number of unique 3x3 filters learned for convolution is 128 * 2 * 10 = 2560 filters in the above case of a 10x128 network.

Someone else can provide more explanations of the above.

ovi...@gmail.com

unread,
Sep 13, 2018, 6:33:20 AM9/13/18
to LCZero
I have been working in doing a graph comparing test10 and test20.

There is a problem in doing a direct comparation: the nets have different number of games no it is not the same (in games) 100 NNs in test10 or test20. Furthermore, even within test10 or test20, the number of games changes from one net to the next one. 

To make the graph:

a) All the data is taken from the ccrl list by MTGOStark (he deserves the merit for the calculations).

b) Only nets that get a elo record are included in the graph. I.e.: Those whose elo is higher than all the previous one. That makes the graph monotonous (always growing). Data usually goes up and down; in this way the graph is clearer.

c) In order to calculate the number of games corresponding to each node:
  Test10:  52129000 games, 1261 nets, 41300 games/net on average.
  Test20:  15398000 games, 555 nets (I removed the first 56 networks made with ramdom data, 27700 games/net on average.

d) I also include position of the LR for both runs.

Here it is! 

ComparisonTest10-20.JPG


 As you can see, things are very promising! First LR leg had a bigger growth for test10 but Second LR leg is recovering fast. From the graph alone, it seems test20 elo might finish higher than test10 elo. (We must be still cautious...). Lets see. 

 I will add some data analysis when it is done.


pw31

unread,
Sep 13, 2018, 9:13:18 AM9/13/18
to LCZero
Excellent visualisation, ovi, thank you very much!

Shawn S.

unread,
Sep 13, 2018, 1:13:24 PM9/13/18
to LCZero
test20 has a shape more like A0 with a long slowdown before the spike.

Nathan Blaxall

unread,
Sep 13, 2018, 8:42:18 PM9/13/18
to LCZero
For those who list self-play graphs (elo is misguiding, and so are up & down jumps, but the general direction long term should still be up):

On Wednesday, 12 September 2018 21:24:42 UTC+12, pw31 wrote:

Ludo

unread,
Sep 14, 2018, 2:09:58 PM9/14/18
to LCZero
Does the training run 2 correspond to network 30xxx?

Anyone has a clue as to why it exhibits such a smooth and quick progress curve compared to the 20xxx? Maybe some kind of progressive LR decrease?

Where can we find the general info about this training run 2?

Thanks

Nathan Blaxall

unread,
Sep 14, 2018, 5:41:45 PM9/14/18
to LCZero
Also click on the new "full Elo graph" which is only for test30.

The graph is deceptive because it's a different scale ... it's doing roughly about as well as test20 from what I can tell.

On discord #announcements there is a post:

Tilps

"Test30 is running in parallel with Test20. Current plans for Test30 is to experiment with new network initialization strategy to see if it solves the spike issue, then experiment with Tablebase rescoring. Let me know if you'd like to contribute to Test30 instead of Test20 and I'll redirect your contributions on the server."

Note that when he says network initialization, he's talking about the random initialization on the NN before it starts.  Apparently test20 had a bus where it wasn't initialized properly.

Also note the Tablebase rescoring hasn't been turned on yet for test30 (it may be turned on in a couple of days - and I'm excited as I think it could make a big difference in endgames and positions leading up to endgames).

Ludo

unread,
Sep 15, 2018, 1:48:54 AM9/15/18
to LCZero
Nice, thank you! You are right with the scake, therey are doing about the same indeed. Excited to see what the use of TB in the learning process will provide as well.

Ludo

unread,
Sep 17, 2018, 4:07:00 AM9/17/18
to LCZero
Nice comparison. Thanks for sharing that.

Right now, after 19.2M training games, self elo has reached 4120 (NN 20695). In the 10xxx, this was reached after 22M training games (NN 10323). I this tendency is confirmed, then we have reached the point where self elo is higher in 20xxx than in 10xxx based on number of training games, meaning it is even more likely that the elo ceiling will be higher with 20xxx.

However, self elo has proved to not be very meaningful / reliable so far. Is there someone here who would be willing to make a test match between IDs 10323 and 20695 to see if this comparison means something? I would do it myself but I do not have the hardware for that...

Thanks!

ovi...@gmail.com

unread,
Sep 17, 2018, 5:25:57 AM9/17/18
to LCZero
Hello Ludo,

 Self-elo is totally misleading... according several tests, NN 2XXXX is now about 3000 elo (but growing well...). On the other hand, NN 10323 was about 3200-3250 elo. 

I personally would like to see a match between latest NN2XXXX net and 10161 (the one that was playing TCEC div4). NN10161 was about 3150 elo (still higher) but it would be interesting to compare styles...

it is supposed test20 nets with higher cpuct should be more "creative", more tactical.

In any case, mind that test20 is "slow cooking". It is not important the number of training games (the experiment is thought to last till 80M), but the ceiling, the maximum elo. Slow growth, higher ceiling...
Message has been deleted

Ludo

unread,
Sep 18, 2018, 2:47:40 AM9/18/18
to LCZero
OK thanks ovi. I did not know where to look for the "true" (non-self) rating of the 10xxx nets before posting this.

Hans Ekbrand

unread,
Sep 18, 2018, 6:50:09 PM9/18/18
to LCZero
Well done summarizing and explaining lc0 training!

I'd like to add another source of information about how the training process evolves: tensorboard. http://testtraining.lczero.org/#scalars

Leelas NN is trained using tensorflow, and tensorflow comes with a utility called tensorboard, where you can follow the training process. To be more precise, each NN is the outcome of one training pass, and in the graphs shown in tensorboard each training pass is one measurement, ie one point.

The first graph on the tensorboard i accuracy, it represents the the accuracy in the predictions about the outcome of the games made by the NN after training has finished. The higher accuracy, the better Leela is at predicting outcomes, ie at evaluating positions. Technically, training aims at minimising the loss, rather than maximising the accuracy, but MSE loss, the second graph on tensorboard, is almost a mirror of accuracy, if the accuracy goes up, MSE loss goes down, albeit not necessarily linearily (I think).

Orange dots represents T20, red dots represents T30. The orange and red dots are based on predictions made on data that the NN was not shown during the training. In the MSE graph there are also blue dots, which represent values for predictions made on data that the NN was shown during the training pass. T10 had a problem with overfittining, indicated by a gap between blue and orange dots, a problem that neither T20 nor T30 have. In general, it is easier to predict outcomes that your have been shown during training, so generally, the blue dots are slightly "better" (lower loss), as long as the difference is small, the NN learns stuff that generalise well to unseen data (new positions).

The fourth graph, "Policy Loss", shows (I think) the difference between the move the NN suggests before search and the move actually choosen after search.

These statistics are instrumental for:

 1. deciding when to lower the Learning Rate,
 2. spoting problems with overfitting,
 3. telling whether or not the network is learning anything (in combination with Self Elo and Elo)

T20 had a problematic initialisation, and there are spikes in the curves which we don't really understand. In short, T20 is not a perfect net. However, T20 was given a rather long training with a high LR, and training with a high LR means that weird stuff that she might have "learnt" in the beginning is likely to be forgotten. That being said, the curves for T30 are gorgeous.

I'm not an expert, so parts of what I've stated might be wrong.

ovi...@gmail.com

unread,
Sep 19, 2018, 3:34:52 AM9/19/18
to LCZero
Thank you very much, Hans, for your explanation. Usually, nobody explains anything about tensorflow but devs comment about these parameters a lot.
 I understand now a lot more!
Message has been deleted

ovi...@gmail.com

unread,
Sep 19, 2018, 4:56:32 AM9/19/18
to LCZero
Just to keep updating news...

Tilps have changed the resign threeshold. This is why the self elo shows a drop. It is expected to recover soon (let's hope).

Ludo

unread,
Sep 19, 2018, 6:12:54 AM9/19/18
to LCZero
Thanks for the update ovi. Where do you find these infos? I feel annoyed about the lack of communication on this type of decision and why they are suddenly applied...

Curious

unread,
Sep 19, 2018, 6:23:09 AM9/19/18
to LCZero
Discord, #dev-log channel contains all such changes.

Daniel Smith

unread,
Sep 19, 2018, 6:57:50 AM9/19/18
to LCZero
If he hadn't changed the design threshold we would have expected to see self elo continue to rise steadily?

ovi...@gmail.com

unread,
Sep 19, 2018, 7:02:16 AM9/19/18
to LCZero
Daniel, just look at the "Full elo graph" at the front web page. You will see this is the time when self-elo has been growing steadily for so long. Will it continue? Who knows...

(anyway, the important thing is real elo, not self-elo...)

Ludo

unread,
Sep 19, 2018, 12:29:55 PM9/19/18
to LCZero
Thanks for the link, I was looking for that page again, but I can't find my way in Leela's labyrinthine website architecture.

The curves for T30 look great indeed! I am impressed by the MSE that is already bettter for T30 than T20. This should mean the eval function of T30 is better!

I would be interested in some match test between both.

Trevor G

unread,
Sep 19, 2018, 12:54:03 PM9/19/18
to smin...@hotmail.fr, LCZero
I don't think it necessarily means the eval is better... It's might be easier to get lower MSE loss from training data where an engine plays completely random than it is for a training data using a stronger engine. Or perhaps vice versa. I don't know how the metaparameters are set for T20 and T30, but it could be that one set of metaparameters yields training data that is simply easier to get a lower MSE loss on.

I guess it's likely to be better, just saying it's not guaranteed based on that single number.

--
You received this message because you are subscribed to the Google Groups "LCZero" group.
To unsubscribe from this group and stop receiving emails from it, send an email to lczero+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/lczero/26897d81-1448-402c-bc49-37ea4f48e955%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Message has been deleted
Message has been deleted
Message has been deleted
Message has been deleted
Message has been deleted

Trevor G

unread,
Sep 21, 2018, 9:18:23 AM9/21/18
to Ludovic Moreau, LCZero
I understand that Leela does self-play training.

I think in general, it is probably mostly true that lower MSE loss = better network... But as you can see in tensorboard, MSE doesn't simply go down as network generation goes up. The reason for this, I think, is that there's a complex dynamic between the strength of the network and the ease at which train targets can be fitted:
(1) Training reduces MSE ==> network strength goes up
(2) Network strength goes up ==> self-play training game quality is better
(3) Training game quality is better ==> more difficult to fit train targets as the target objective function essentially increases in complexity (generally speaking)
(4) More difficult to fit training data ==> MSE goes up

So MSE is not necessarily exactly correlated with network strength. (Plus this doesn't even say anything about the policy head and targets).

I agree that the best way to compare the two is to have a match between the two.


On Thu, Sep 20, 2018 at 3:52 PM Ludo <smin...@hotmail.fr> wrote:
Thank you Trevor.

I actually do no think the situation that you are describe can occur, because Leela trains only with herself. So the eval function is the same for black and white. If one side makes mistakes, then the other side cannot identify them as mistakes at take advantage of them to win the game, because this would mean that it has a better eval function than the side which makes mistakes.

If it happens that a side makes mistakes and the other side manages to win, it should be out of pure luck. The opposite situation would also occur (one side making mistakes and the other side is unable to convert to a win) an even number of times. With the large number of games being played both situations cancel out with no impact on MSE.

What you are describing could happen, though, if Leela was training with a weaker engine. However, since she only trains with herself, what you describe sounds contradictory to me.

Or am I missing something? Anyway the best way to cehck would be to set up a test match between T30 and T20.

Ludo

unread,
Sep 21, 2018, 10:57:50 AM9/21/18
to LCZero
Thanks for posting my reply. I have not been able to post anything in this thread these last days. I am trying to run a match between the 2 nets to check what we were saying. I am still in the process to figure out how to proceed though...  Just created a thread to ask for help
Reply all
Reply to author
Forward
0 new messages