lc0 progress vs. Spike 1.2

714 views
Skip to first unread message

Raimund Heid

unread,
May 16, 2018, 2:33:50 PM5/16/18
to LCZero
A few weeks ago I selected Spike 1.2 as opponent for lc0 since it roughly has an equal rating (the first matches with NN ids < 230 indicated that Spike was about 30-40 rating point above lc0).

I chose a tc of 30.0+0.5 seconds and played 500 games with cutechess, using a set of 250 standard opening positions (from Nunn, Noomen, Silver). With these settings the complete match lasts about 10 hours on my PC with AMD FX-8350 + Nvidia GeForce 750 1GB.

So far lc0 achieved the best result with the famous NN id 235 (rated 5525 on http://lczero.org/networks):

   # PLAYER             : RATING  ERROR   POINTS  PLAYED    (%)
   1 Spike 1.2 Turin    :   15.1   11.2    271.0     500   54.2%
   2 lczero v0.8 Id253  :  -15.1   11.2    229.0     500   45.8%

Today I played the same match with NN id 299 (rated 5647 on http://lczero.org/networks):

   # PLAYER             : RATING  ERROR   POINTS  PLAYED    (%)
   1 Spike 1.2 Turin    :   44.6   11.3    311.5     500   62.3%
   2 lczero v0.10 Id299 :  -44.6   11.3    188.5     500   37.7%

This result is very strange, isn't it? lc0's match performance is ~60 points weaker with a NN that is supposed to be over 120 (lc0-)points better. Any explanations?

Regards

Raimund

Trevor

unread,
May 16, 2018, 2:44:43 PM5/16/18
to LCZero
I suspect that the ELO graph isn't at all accurate since the regression and fix. There was talk I saw somwhere (Discord or the Github Issues lists) where somebody suggested recreating the graph using non-buggy engines. I think this is the way to go. Preferably, they'd go all the way back to before the network size was increased.

Tadeusz R

unread,
May 16, 2018, 2:46:25 PM5/16/18
to LCZero
Explanation is HERE.

Trevor G

unread,
May 16, 2018, 2:56:12 PM5/16/18
to LCZero
(I don't know, though -- maybe doing this graph reconstruction might be an unreasonable amount of computation for purpose of just fixing ratings; I don't know how many games and nodes-per-move-per-game would be necessary to do this).

--
You received this message because you are subscribed to the Google Groups "LCZero" group.
To unsubscribe from this group and stop receiving emails from it, send an email to lczero+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/lczero/2705a317-f792-44c1-bc72-6cef56f36ca9%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Message has been deleted

Raimund Heid

unread,
May 16, 2018, 3:01:50 PM5/16/18
to LCZero
This page explains why the lc0 rating dropped after id 235. But not why a NN performs weaker although it is rated 120 points higher than another one. Maybe the rating calculation has to be examined as well.

Am Mittwoch, 16. Mai 2018 20:46:25 UTC+2 schrieb Tadeusz R:
Explanation is HERE.

Kevin Kirkpatrick

unread,
May 16, 2018, 3:27:17 PM5/16/18
to LCZero
I believe this is due to the fact that ELO estimates are based on new ID vs current ID; and there is not a linear relationship between corruption of knowledge and drop in ELO.

Overly simplified example: Say ID 100 is totally healthy with ELO 5000.  But a bug comes along that starts to corrupt its knowledge.  From ID 100 to ID 101, bug erases understanding of how knight moves.  From ID 101 to ID 102, it erases knowledge of how bishop moves. At 103, bug is fixed, and over next 10 genations, the knowledge of both bishop and knight is restored.

Obviously, ID 101 is going to be much worse than ID 100: ID 100 knows how to use knight, ID101 does not.  Let's say this drops ELO from 5000 to 4900.
Also, ID 102 is going to be worse than ID 101 (ID 102 can't use bishop or knight; ID 101 only can't use knight).  This might also be ELO loss of -100, dropping from 4900 to 4800.

So at ID 102, we've dropped -200 ELO.  However, were we to compete ID 100 directly with ID 102 (a healthy network vs one that can use neither bishop nor knight), we might see a much larger ELO drop of -800.  Compared to ID 100, the actual performance of 102 might measure at 4200, not the published 4800.

Moving forward, the bug is fixed, and as both knight and bishop information is relearned, ID 103 gains ELO of +50 over 102 (to 4850); and 104 gets +50 over 103 (4900), and so forth.  By ID 112, the ELO has climbed up to 5800.

And yet, ID 100 had "100 IDs-worth" of knowledge of bishop and knight.  ID 112 only has "10 IDs-worth" of bishop and knight.  So even though 112 has published ELO of 5800, it still is much worse than ID 100 (ELO 5000)

Raimund Heid

unread,
May 16, 2018, 3:40:40 PM5/16/18
to LCZero
Kevin - if your explanation is correct (it sounds plausible to me) - wouldn't the best way to continue be using id 100 to get 103, skipping 101 and 102? I know that this discussion has begun a few days ago and I didn't follow it closely. Would it be possible to simply skip 101 and 102?

Raimund

Björn Holzhauer

unread,
May 16, 2018, 3:41:16 PM5/16/18
to LCZero
The problem is that relative Elo differences from version to version do not necessarily give you the difference in strength several versions apart, particularly when odd things happen inbetween. I have a 5+3 blitz tournament of ID 300 and 237 vs. Stockfish 1 and Stockfish 1.5.1 (picked to be slightly weaker and stronger than the two versions) running on my 6 core with an 1080 TI GPU. Overall the results are rather in-line with the CCRL projections given the in the FAQ. 237 is clobbering 300 in their head-to-heads (and in fact Stockfish 1.5.1 is clobbering Stockfish 1). In contrast the results across engines are less extreme, which gives you a hint that - unless you are careful - engines may sometimes end up being optimized for beating other versions of themselves rather than other engines. From everything you report, I have seen and others report, I suspect the network has just not recovered from its previous issues, yet. I would expect it to do so eventually and perhaps it even learns something along the way through this rather random perturbation...


Am Mittwoch, 16. Mai 2018 21:27:17 UTC+2 schrieb Kevin Kirkpatrick:

Frank Müller

unread,
May 16, 2018, 3:46:47 PM5/16/18
to LCZero
You have somehow mixed 235 and 253?

Raimund Heid

unread,
May 16, 2018, 3:53:02 PM5/16/18
to LCZero
yes, id 253 achieved the best result, not 235. Sorry for the typo.

Kevin Kirkpatrick

unread,
May 16, 2018, 4:11:40 PM5/16/18
to LCZero
Yes - this is precisely the argument for "once confidence is gained that all bugs causing corruption are fixed, reset back to pre-corruption net"; and exactly why, "but the self-play ELO has recovered!" is not that strong a counter-argument.

I believe (though this changes with the weather) Leela developers are planning to revert nets once they're confident that they've identified/zapped the root problem.

Stephen Frost

unread,
May 16, 2018, 4:37:17 PM5/16/18
to LCZero
I did some quick matches last night, putting up NN300 against NN251, NN226 and NN200.

Limited data set (too few games and too few openings) but indications were that NN300 was somewhere between NN200 and NN226 in terms of strength, but definitely stronger than NN200.

I also did some quick games with Ruffian (because it defaults to play a lot of different openings) against NN300, NN251 and NN226, but this was less convincing and again, too few games on my hardware.

Would be interested in whether others have also been testing.

Jesse Jordache

unread,
May 16, 2018, 4:48:06 PM5/16/18
to LCZero
I just posted a game on another thread from the latest network.  I'll repost it here: http://lczero.org/match_game/277318

If you think that the pre-freakout Leela would ever produce... that.. as a match game, then you haven't been paying attention.

As far as the elo issue goes, just leave it.  It's rough guide, and the disclaimers are enough.  If you want external engines comparing them, there's plenty of that in the FAQ on the git site.  Besides, you can stack the curve up against DeepMind's famous 700k cycles graph, which was elo generated the same way.

svoi s

unread,
May 16, 2018, 4:55:57 PM5/16/18
to LCZero
yes, of course here and there

среда, 16 мая 2018 г., 23:37:17 UTC+3 пользователь Stephen Frost написал:

Frank Müller

unread,
May 16, 2018, 5:00:21 PM5/16/18
to LCZero
This was just a user with a buggy setup...nothing to worry.

Stephen Frost

unread,
May 16, 2018, 5:10:29 PM5/16/18
to LCZero

On Thursday, May 17, 2018 at 6:55:57 AM UTC+10, svoi s wrote:
yes, of course here and there

Thanks ... those seem to match up pretty well to what I observed. 

Trevor G

unread,
May 16, 2018, 7:59:14 PM5/16/18
to Stephen Frost, LCZero
Match games were played between buggy engines, right? Then the elo graph for a while has represented a bad metric of comparing the strength between networks.

While there’s truth to the idea that there’s a problem when you generate this data only by playing matches between neighboring networks... The simpler explanation is that for a long span of network IDs, none of the data points in the graph are really all that meaningful. That is, unless you’re interested in knowing how the networks perform against each other in a buggy engine, but I’m not sure why anybody would be all that interested in knowing that.


--
You received this message because you are subscribed to the Google Groups "LCZero" group.
To unsubscribe from this group and stop receiving emails from it, send an email to lczero+un...@googlegroups.com.

RexT294

unread,
May 17, 2018, 9:25:01 AM5/17/18
to LCZero
Version 0.10 probably tactical weaker than 0.8. I tested from Spike 1.4 I had such a feeling of four played games yesterday.

Dne středa 16. května 2018 20:33:50 UTC+2 Raimund Heid napsal(a):
Reply all
Reply to author
Forward
0 new messages