Possible to SWA the last few test10 networks?

MindMeNot

unread,

Nov 6, 2018, 3:55:11 PM11/6/18

to LCZero

I'd like to know if somebody has already tried to average the last networks of test10, or something crazy like that.

Trevor G

unread,

Nov 6, 2018, 8:29:33 PM11/6/18

to MindMeNot, LCZero

That would be a very easy experiment to do. Somebody should try this. Maybe same with some of the latest 2xxxx nets.

One thing, though, is my reading of SWA was that it helps guide the weights to places with better/flatter global optima, and so I *think* a key point of it is that it should be done early in training. With such low learning rates at the end, I’m not sure if averaging would make that much of a difference.

On the other hand, reinforcement learning of games tends to result in cyclical instabilities. Kind of a rock-paper-scissors thing where it’s playing rock for a while, then learns that paper is better, than scissors, and then back to rock.... Deepmind’s idea of having history replay in Deep Q learning, and training on big window sizes for AlphaZero was supposed to help stabilize this. But I’m sure it still exists to some extent. Maybe one SWA at the end can help erase that effect? If there is a benefit in that respect, then my hypothesis is it’s likely best to try to take average a bunch of nets that exactly span the training window size (to help ensure you mix all the rocks, papers, and scissors together). Or maybe there’s a good way to look for cycle periods, and try to match that - it could be as simple as graphing the value output at the start position through a long range of networks.

Either way, yeah somebody ought to try this. I’m curious what would happen.

On Tue, Nov 6, 2018 at 3:55 PM MindMeNot <scoon...@gmail.com> wrote:

I'd like to know if somebody has already tried to average the last networks of test10, or something crazy like that.

--
You received this message because you are subscribed to the Google Groups "LCZero" group.
To unsubscribe from this group and stop receiving emails from it, send an email to lczero+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/lczero/f75ef487-56dd-47d7-9680-b32f7dc61fd1%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Jon Mike

unread,

Nov 6, 2018, 9:10:05 PM11/6/18

to LCZero

In September, I mentioned something crazy like that...on the thread "an idea for manipulating the weights files"

"After downloading Ender and seeing the weights files... I realized the values could be averaged with other networks to "mix" or "marry" them. Also by changing and testing them scientifically we could find the relationships of the locations and values (strengths of connections). Furthermore the larger networks could be squeezed up or down or even distilled keeping only the strongest connections. Distillation of the connections would make a 20xxx or 30xxx network run much faster while keeping the most important connections alive. "

but I was told...

Jon,

I think it would be beneficial for you to spend a bit of time reading in-depth about neural nets and deep learning, as I think you've been given some conflicting information perhaps. The suggestions you've made -- namely averaging different networks together and reducing weight values to concentrate learning -- are not going to be successful as neural nets simply don't work that way.

In short, no one thought it was a good idea then and this is different than SWA averaging, but it is still averaging nonethelss. I think choosing a specific few networks and averaging their weights would be an interesting and telling experiment that should be done. I tried to do it but couldn't get the weights files into numerical form. If you know how to convert the weights to numerical form and would share some famous nets with this community (4049,4052,9049,11248,11250,11258,11262, etc) I for one could use them. (I couldn't get protobuf to do this and gave up frustrated).

Trevor G

unread,

Nov 6, 2018, 9:54:33 PM11/6/18

to Jon Mike, LCZero

I tend to agree with whoever had responded that this is going to be an unsuccessful experiment.. However, given the success of SWA in test30, why not try the closest thing that mimics it on trained nets. So I think it should be done.

I’ve written or adopted (the weights parsing was from original lczero project) some code in python to do things like automatically download all of the weights networks, read weights files and write them (eg one experiment was to reduce the rule50 weights a while back). I think it’s all in my lczero_tools github repo.

To convert from protobuf to a readable text file and back, you can use the net.py tool in the lczero-training repo.

So if anybody here knows some python, or can learn it, and wants to do some of these experiments, I can help show you some code to get you started. But I don’t have the time to do this right now.

--

You received this message because you are subscribed to the Google Groups "LCZero" group.
To unsubscribe from this group and stop receiving emails from it, send an email to lczero+un...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/lczero/0b255a9f-00b7-4089-91fe-bf852095e70e%40googlegroups.com.

LuckyDay

unread,

Nov 7, 2018, 1:14:36 AM11/7/18

to LCZero

I think it's a worthwhile experiment, if it hasnt been done before.

It's still not entirely clear to me why SWA works (even after reading the paper). It seems that weights tend to cluster around a local minimum, and that the averaging of the weights tends to approximate the local minimum a bit better than the individual weights themselves do, but it is not clear if that is proven or just a coincidence which seems to bear out.

I do think a SWA could be done for the last couple test10 nets; i think it could possibly end up being marginally stronger than the rest of the nets including 11248 but i wouldn't expect a huge improvement.

MindMeNot

unread,

Nov 7, 2018, 3:34:20 AM11/7/18

to LCZero

If the test30 elo graph by aloril is anything to go by, it took 20 nets for swa to re-plateau. So, either 20 nets worth of additional test10 self-play are required or a way is needed to average networks 11242-11262 to mimic that.

Enqwert

unread,

Nov 7, 2018, 3:59:39 AM11/7/18

to LCZero

I have also read the paper, I think it is intuitive that with SWA the model will generalise better, as it takes averages of weights. It makes sense to visualise loss space as local minimums clustered at some areas. Using weighted averages ensures that model moves around areas that has denser local minimums. After the cyclic LR change, new weights are created and averaged with the previous weights to create a small pull. Weighted average ensures that the model is still dominated by the weights before LR change. In a 3d space SWA makes sense very well to me.
However for SWA to work well, the new NN to be averaged should be sufficiently "different" to gain the benefit of generalisation or they will be like average of the same thing. Last test 10 NNs can be too similar in that respect. Averaging them by leaving some NNs between can be an interesting experiment, perhaps can bring a few elo. Actually if test 30 can become as strong as test 10 it would be very interesting to average them, as they are trained in completely different ways. Averaging some old strong NNs like 390, 395, 11089 with NNs with similar strenght, but sufficiently different NNs can also be interesting.

Art Shoe

unread,

Nov 7, 2018, 4:45:09 AM11/7/18

to LCZero

In the original thread by Jon Mike, Veeno posted this explanation of why averaging the weights of dissimilar networks won't work:

{Begin quote}

That might produce viable neural networks if the two you're mixing are from the same learning line (i.e. one is the ancestor of the other) and they're relatively close to each other in the genealogy. The resulting neural network will almost certainly not be better than the newer one of the two however, because the averaging will just undo some of the learning that happened since the older network.

In general, this procedure will not produce viable neural networks, even for networks with identical topology (i.e. number of layers and neurons in each layer). The reason is that every time a neural network starts learning from scratch, it starts with tiny random weights for each neuron. Although incredibly small, these initial weights are tiny biases which predispose certain neurons to slowly be specialised into certain roles as the neural network learns. For example, the 15th neuron in the 10th layer might slowly end up being specialised to evaluate the worth of sacrificing a knight (I have no idea whether that is the sort of abstract role a neuron could end up playing in this sort of neural network, but that's irrelevant). In the other network, this same role could end up being performed by the 45th neuron in the 12th layer, while the 15th in the 10th layer has a completely different job. The more a neural network learns, the more it relies on certain neurons being specialised for performing certain roles - and performing them effectively - for the entire network to be able to function properly. Averaging the weights of these two neural networks would be akin to attempting to "average" the role the heart and the kidneys play in a human body.

{End quote}

If Veeno is correct, which I think he is, it seems clear that only average values of the weights from the same network run that are reasonably close to each other chronologically could produce anything of value. If we picture LR as the distance of weight movement between two consecutive iterations of NN, then the value of the weight will keep bouncing around, fluctuating near the minimum, but it will never settle on it. Averaging the fluctuating values approximates the minimum to negligible differences if the number of fluctuating weights is high enough, as, for example, in the case of run 20, whose weights have been fluctuating without progress for dozens millions of games.

Enqwert

unread,

Nov 7, 2018, 6:30:19 AM11/7/18

to LCZero

"Art Shoe"
If roles of the nodes are decided randomly at the beginning of a run, you are right that we can not average different NNs from different runs. Still averaging NNs from the same run can be interesting. There can be paths connecting local minimums that are even deeper than both. Handpicking some some strong NNs and averaging them should be an easy experiment once we can obtain weights.

Art Shoe

unread,

Nov 7, 2018, 2:55:55 PM11/7/18

to LCZero

By definition, a local minimum cannot have a "deeper path" connected. However, climbing over surrounding elevations can lead to another, better minimum. Averaging weights for strong NNs from the same run does sound promising, but I wouldn't try it with NNs too many generations away from each other because of possible node drift (neuron clusters slowly changing specialization), and also because averaging would only make sense if we tried to average weights hovering around the SAME minimum. If a particular weight in your network run has already switched over to a new local minimum, averaging those values will mostly likely end up on an elevation instead of a better minimum.

Another problem with this approach would be that the averaging affects all weights. It would improve minimums for some weights, while hurting others. Overall, a stronger NN can be produced, but unlikely.

Art Shoe

unread,

Nov 7, 2018, 3:20:22 PM11/7/18

to LCZero

All the weights evolve at the same time, and an adjustment of any weight propagates like ripples throughout the network. Averaging weights seems to me a good tool for yanking the NN out of balance completely, while retaining most of what it has learned. It could be a great way to shake up a stagnating run. Making several such attempts with different NNs from the run could provide multiple starting points for future retraining. However, if we averaged without subsequent retraining with dozens of millions of games, we would just get a crippled NN.

Another way to shake things up from time to time would be to cycle LR. But the scope of the two is not the same, and they can be used concurrently. While a cyclic LR would always affect only the latest NN, the averaging could start a new branch using any two NNs in the run that fit the profile.

If we had unlimited training resources, that could be an interesting thing to try.

On Wednesday, November 7, 2018 at 4:30:19 PM UTC+5, Enqwert wrote:

Enqwert

unread,

Nov 7, 2018, 4:41:57 PM11/7/18

to LCZero

But local minimums are defined by us. That means we define some points in the loss surface as "local minimums" with the information available to us. We do not know exactly if they are true local minimums as we dont know the shape of loss surface. So there can be "actual" lower minimums connecting our "defined" minimums. Devs say SWA used for test 30 is a weighted average of 30 weights ! so it seems averaging is not as destructive as we believe.

As you said the NNs should not be too far apart to have a meaningful average. Actually this is an easy experiment, if we can get the weights. We will see what happens.

"Art Shoe wrote:By definition, a local minimum cannot have a deeper path connected."

Trevor G

unread,

Nov 7, 2018, 4:52:28 PM11/7/18

to enqwe...@gmail.com, LCZero

You can get the weights. They're all available on the lczero website. You just need to take some and average them to do your experiment.

--
You received this message because you are subscribed to the Google Groups "LCZero" group.
To unsubscribe from this group and stop receiving emails from it, send an email to lczero+un...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/lczero/d5263cc9-3b7e-4d63-9cdf-4998d5ea697d%40googlegroups.com.

Enqwert

unread,

Nov 7, 2018, 5:01:16 PM11/7/18

to LCZero

Do you mean the weight files ? Do you know how to get weight values from the files?

Jon Mike

unread,

Nov 7, 2018, 5:22:19 PM11/7/18

to LCZero

I have been wanting the same thing for some time. The current weight files are non-numerical. I think net.py tool probably enables the reversion to numerical form (haven't had the time to try). Once in numerical form (using enders weight's, as example) I could not find a convenient way to manage the wanted manipulations (for starters simple averaging). I tried using excel but couldn't get the export file to work because of added spaces and lines...

Trevor G

unread,

Nov 7, 2018, 6:11:14 PM11/7/18

to enqwe...@gmail.com, LCZero

Yes...

Use this to convert from protobuf to text files and back again: https://github.com/LeelaChessZero/lczero-training/blob/master/tf/net.py

Once in text fromat, here's an example where I had updated rule-50 weights by a constant coefficient: https://gist.github.com/so-much-meta/c048da7c8c1c654be344714d5f2bb60c

Just make some changes to this gist. You'd need:

1. A function to read the weights from a single file -- note you'd want to replace lines 64-65 in that gist with something like... weights = [float(weight_str) for weight_str in line.split()] and make a big list of lists of weights

-- The gist as is only reads the first weights string because it's only interested in rule-50 input weighting, but this you want all weights so get rid of line 64 (if idx==0).

2. A function to average the weights... Example for two weights files, it could look like this....

import numpy as np

def average_weights(weights1, weights2):

result = []

for weight_line1, weight_line2 in zip(weights1, weights2):

line_result = np.array(weights_line1) + np.array(weights_line2

result.append(line_result)

return result

3. A function to output the weights. Refer to line 39 in the gist for changing a numerical weights line to a string, and lines 67-69 for outputting to a file.

Again... This is really not a very difficult thing to do. And I hope some of the people here who offer all of these ideas take the time to actually do the things they think *somebody* should try. If that means needing to learn some basic programming skills, then take the time to do that.

To view this discussion on the web visit https://groups.google.com/d/msgid/lczero/c8a5af1d-d799-46a6-a656-2ae0793d944b%40googlegroups.com.

Trevor G

unread,

Nov 7, 2018, 6:21:56 PM11/7/18

to enqwe...@gmail.com, LCZero

The weights file format in text looks like this...

——

2

123.456 123.456 123.456

123.456 123.456

...

——

That is, it’s the number “2” by itself in a single line representing the version of the weights file.

Then it’s followed by a bunch of lines where each line is a bunch of floating point numbers separated by a space (each line of numbers represents the weights for a single operation - like convolution or adding a bias).

The number of lines in the file, and the number of numbers in each specific line need to remain consistent. To average weights files, you just average corresponding numbers.

Jon Mike

unread,

Nov 8, 2018, 12:51:44 PM11/8/18

to LCZero

@Anyone who can help,

I would gladly be the "somebody" to do the work, but unfortunately I could not get "protocol buffers editor" or net.py to translate the weights to numerical form. I don't know if protobuf is the right program for the translation and I don't know how to "use" net.py. Upon downloading and executing net.py it flashes a black cmd console screen for a milisecond.

I don't know what specific steps I should take.

My goal is to be able to execute the conversion of the weights to numerical form. How do I use net.py to do this? Please help me in this and I will gladly share the results of my future experiments!

On Wednesday, November 7, 2018 at 5:21:56 PM UTC-6, Trevor wrote:

...

gvergh...@gmail.com

unread,

Nov 9, 2018, 11:33:30 AM11/9/18

to LCZero

Network id 31029 extracted to 592 mb !!

2
0.0004703402519226074 0.002410292625427246 0.001045137643814087 -0.001787811517715454 0.00011108815670013428

Trevor G

unread,

Nov 9, 2018, 12:28:41 PM11/9/18

to gvergh...@gmail.com, lcz...@googlegroups.com

With protobuf, it is 4 bytes per weight (I think it's using 32-bit floats).

In decompressed text files, as you can see, these can easily be like 20 characters, which is 20 bytes. So yeah a lot bigger. But it's easy to keep the text compressed (net.py can read .txt.gz, and I think the example code I shared can read/write .txt.gz).

It's perfectly possible to open in protobuf, make the changes, and write back to protoobuf... But as a first run for these types of experiments, I think it would easier to work with text files.

--

You received this message because you are subscribed to the Google Groups "LCZero" group.
To unsubscribe from this group and stop receiving emails from it, send an email to lczero+un...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/lczero/bdb552e1-c3b2-4680-b70b-e51f9c73a016%40googlegroups.com.

Jon Mike

unread,

Nov 9, 2018, 1:45:54 PM11/9/18

to LCZero

@gvergh,

Interesting to compare the specificity increasing in the numbers over training time (previous weights were much less specific with many less decimals). If the numbers specify to more exacting degrees over time of training, it seems to reveal a few things:

The network can be reduced

The network can be rounded to any number (1,2,3 etc) places beyond the first non-zero point. (Perhaps this operation would only increase strength, keeping base function but reducing size and increasing evaluation speed very significantly.

The network seems temperamental to randomization at base level connection structure but quickly becomes set in its ways through specificity yield much less important changes.

This means starting multiple smaller networks with same parameters would yield random base structures very different from one another, some much closer to "correct" than others.
I believe this can be seen early on through height of the initial spike.

gvergh...@gmail.com

unread,

Nov 13, 2018, 8:14:32 PM11/13/18

to LCZero

@Trevor

btw, does your " train_to_pgn.py " work on protobuf training data ?

If not, do you know of a way to convert the data to pgn ?

Trevor G

unread,

Nov 13, 2018, 8:35:33 PM11/13/18

to gvergh...@gmail.com, LCZero

Hmmm... didn’t realize training data was ever put in protobuf format... so the answer to that is no. However, I think there was an email on here a while back where Alexander shared a location that has both training data and training data PGNs.

--

You received this message because you are subscribed to the Google Groups "LCZero" group.
To unsubscribe from this group and stop receiving emails from it, send an email to lczero+un...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/lczero/16b685c2-e89a-44e7-a77b-2fceaf31edcb%40googlegroups.com.

gvergh...@gmail.com

unread,

Nov 13, 2018, 9:18:18 PM11/13/18

to LCZero

Yes, I'm aware of the training pgns -- http://data.lczero.org/files/

I wanted the selfplay games done on my pc in pgn too.

And the latest binary supports SE and so the training data is unique.

Here's a sample...

gameready trainingfile ./data-lqykyuuweljj/game_000000.gz gameid 0 player1 white result whitewon moves d2d4 d7d5 g1h3 g8f6 e2e3 b8c6 g2g3 a7a5 h3g5 e7e5 c1d2 e5d4 f1d3 a5a4 e1g1 a4a3 b1a3 h7h6 g5f3 d4e3 d2e3 c6d4 e3d4 f8a3 b2a3 e8g8 f1e1 d8d6 d4e5 d6b6 d1d2 f6e4 d2e3 b6a5 h2h4 f8e8 e3f4 a5a3 h4h5 c8d7 e5c7 a3c5 d3e4 d5e4 c7d6 c5h5 f3e5 d7e6 f4h4 h5f5 h4e4 f5h5 e4f3 h5h3 c2c4 f7f6 e5g6 e6c4 e1e8 a8e8 g6e7 g8h7 f3e4 h7h8 e4c4 h3h5 d6a3 h8h7 c4b3 h7h8 g1g2 b7b5 a1h1 h5g4 h1h4 g4e2 b3f7 e8a8 e7f5 a8d8 f7g7
tournamentstatus win 1 0 lose 0 0 draw 0 0

Not a big deal as I output the display to a txt file and scrape the pgns --- but a script would be nice :)

game_000000.gz

Trevor G

unread,

Nov 13, 2018, 9:41:28 PM11/13/18

to gvergh...@gmail.com, LCZero

I'll check to see if I can parse that...

--

You received this message because you are subscribed to the Google Groups "LCZero" group.
To unsubscribe from this group and stop receiving emails from it, send an email to lczero+un...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/lczero/71cd8685-3b56-4f7a-adc0-50c985058ca5%40googlegroups.com.

Trevor G

unread,

Nov 13, 2018, 9:54:21 PM11/13/18

to gvergh...@gmail.com, LCZero

Looks like it works still (well, what I have in lczero_tools - I don't know about the gist).

Do this:

pip install git+https://github.com/so-much-meta/lczero_tools.git#egg=lczero-tools[util]

And then, this in Python:

import gzip

from lcztools.testing.train_parser import TrainingGame

with gzip.open('game_000000.gz') as f:

data = f.read()

game = TrainingGame(data, 'game_000000')

print(game.get_pgn())

And you'll get this:

[Event "game_000000"]

[Site "?"]

[Date "????.??.??"]

[Round "?"]

[White "?"]

[Black "?"]

[Result "1-0"]

1. d4 d5 2. Nh3 Nf6 3. e3 Nc6 4. g3 a5 5. Ng5 e5 6. Bd2 exd4 7. Bd3 a4 8. O-O a3 9. Nxa3 h6 10. Nf3 dxe3 11. Bxe3 Nd4 12. Bxd4 Bxa3 13. bxa3 O-O 14. Re1 Qd6 15. Be5 Qb6 16. Qd2 Ne4 17. Qe3 Qa5 18. h4 Re8 19. Qf4 Qxa3 20. h5 Bd7 21. Bxc7 Qc5 22. Bxe4 dxe4 23. Bd6 Qxh5 24. Ne5 Be6 25. Qh4 Qf5 26. Qxe4 Qh5 27. Qf3 Qh3 28. c4 f6 29. Ng6 Bxc4 30. Rxe8+ Rxe8 31. Ne7+ Kh7 32. Qe4+ Kh8 33. Qxc4 Qh5 34. Ba3 Kh7 35. Qb3 Kh8 36. Kg2 b5 37. Rh1 Qg4 38. Rh4 Qe2 39. Qf7 Ra8 40. Nf5 Rd8 41. Qxg7# 1-0

If it's in tar-file format, you can use the TarTrainingFile class...

from lcztools.testing.train_parser import TarTrainingFile

ttf = TarTrainingFile('file.tar.gz')

ttf.to_pgn('output_file.pgn')

Trevor G

unread,

Nov 13, 2018, 10:11:48 PM11/13/18

to gvergh...@gmail.com, LCZero

One small caveat... The training data does not represent the final board position, though it does represent W/L/D result. Therefore, this script has to guess at what the final move is (it uses the most likely one). For example, if it's a win for white, and the policy says one mating move got 40% of node visits, but there's another mate which got 35% of node visits, this will choose the mating move with 40% node visits. However, actual training game self-play would play each in proportion to node visits. This script does make sure the final move selected does match results per the training target.

gvergh...@gmail.com

unread,

Nov 13, 2018, 11:06:17 PM11/13/18

to LCZero

Thanks !! works perfect...

I wonder if you can do some magic with this...

It's training data from FENs -- selfplay from epd's, using dkappe's Bender.

But the fen is not output at all on the console -- I wonder why :(

Here's an example -- rnb2r2/p4k1p/2pb4/1p3pp1/6N1/2P3P1/P4PBP/R1B2RK1 w - -

gameready trainingfile ./data-ftnervsbegxq/game_000000.gz gameid 0 player1 white result whitewon moves g4e3 f5f4 e3c2 d6e5 c2d4 c8d7 f1e1 f8e8 g2f3 e5d4 e1e8 d4c3 f3h5 f7g7 a1b1 d7e8 h5e8 b5b4 g3f4 g5f4 c1f4 a7a5 g1f1 a5a4 f1e2 b4b3 a2a3 c3d4 e2d3 d4c5 b1g1 g7f8 e8h5 h7h6 f4h6 f8e7 g1g8 e7d6 h6f4 d6e7 g8b8 a8b8 f4b8 c5a3 h5d1 a3b4 b8e5 b4d6 e5d6

game_000000.gz

gtl

unread,

Dec 18, 2018, 8:53:33 AM12/18/18

to LCZero

tirsdag 6. november 2018 21.55.11 UTC+1 skrev MindMeNot følgende:

I'd like to know if somebody has already tried to average the last networks of test10, or something crazy like that.

Has someone tested this now?

Also, is there a python tool to read and write the network files? In that case I could carry out this test.

Reply all

Reply to author

Forward