[L.e.e.l.a] Requesting for Test Ideas

818 views
Skip to first unread message

Cscuile

unread,
Oct 29, 2018, 8:01:14 AM10/29/18
to LCZero
I have run out of things to test that is possible with my hardware. If you have any ideas or requests please feel free to tell me.

So far I have done the following things:
- Tested Stockfish Dev Aug 18 Scaling
- Tested Leela ID 10480 Scaling (Similar to latest network, Scaling stops at around 400k nodes)
- Recreated the AlphaZero vs. Stockfish 8 Scaling chart with Leela
- Created the LeelaFish Ratio to scientifically compare GPUs to CPUs
- Periodically tested Test20 and Test30
- Searched for the strongest Networks available (ID 11258)
- Estimated AlphaZero's Elo


What else would you like me to test?

Thanks,
Cscuile



LuckyDay

unread,
Oct 29, 2018, 8:26:07 AM10/29/18
to LCZero
fantastic work so far!

not sure if you'd want to embark on another scaling test so soon after the previous one, but I think it would be interesting to know if leela's scaling is affected by higher cpuct values, such as in test20, or any other parameters that might affect scaling that you can think of

Lothar Jung

unread,
Oct 29, 2018, 8:48:54 AM10/29/18
to LCZero
Perhaps the scaling of AlphaZero 2080 (ti)?
Message has been deleted

Matt Blakely

unread,
Oct 29, 2018, 10:44:46 AM10/29/18
to LCZero
First off - great work Cscuile!

A few ideas:
 - Test scaling of the new 20-series nets - it seems to me they may scale better than 10-series.
 - Take scaling tests further and test with 1M+ nodes to be sure
 - Test scaling of SWA or SE nets, though you'd have to get your hands on one

Let me think about it more...

Jon Mike

unread,
Oct 29, 2018, 10:52:11 AM10/29/18
to LCZero
I too am very interested in the scaling of the 20xxx networks compared to the 10xxx networks.  According to theory, the larger networks should scale better but we need an expert tester to confirm. 

Impressive scientific work!  :)

ovi...@gmail.com

unread,
Oct 29, 2018, 12:27:02 PM10/29/18
to LCZero
First of all, thank you for testing!!

I have got an idea that is not related to scaling (in addition to test elo strength for nets that still is very necessary).

The idea is to test ending strength.
 
In fact, Dietrich is already doing for his specialized net "ender".

He has got a big set of ending positions in pgn format (11 or 12 man positions).
The idea is to make matches against an external engine (usually SF9 with or without TBs).

For instance: 11248 against SF9 with those starting positions (color reversed), etc. Leela can be tested with or without TB to assess its use.

We do not kwnow if test20 plays the endings better or worse than test10. Idem for test30 (for test 30 is specially interesting to play with TB since it has been trained with TB rescoring).

In summary, to make endings strength rather overall strength (endings is Leela's weakest point).

The idea could be testing latest nets but you also can select some of them to follow how evolves with training.

This is a field not very explored and I think much, much cheaper (and different) than scaling.

I hope this helps. Thanks again!!

You can find starting positions and an idea on how ender works here:

Daniel Smith

unread,
Oct 29, 2018, 12:34:16 PM10/29/18
to LCZero
I really like your scaling test. Thank you for doing that.

I would like to see a comparison between AZ and SF that's as true to the AZ paper as possible. I think that when we make the AZ->SF comparison more fair, by allowing EGTB, using the latest SF, using fair cache size, etc, we're not recreating the results from the AZ paper. I would like to see how the results compare to the actual AZ paper, meaning:

1gb table size
SF8
No end game tablebases or opening books
Use same console command which was used to generate move (some say that the deepmind crew used a console command which didn't make sense)

garrykli...@gmail.com

unread,
Oct 29, 2018, 1:07:47 PM10/29/18
to LCZero
I mostly look weekly for your elo update on the main page, helps confidence that improvement is happening!


On Monday, October 29, 2018 at 8:01:14 AM UTC-4, Cscuile wrote:

kamp...@gmail.com

unread,
Oct 29, 2018, 1:31:55 PM10/29/18
to LCZero
Hey Cscuile,

you have done really nice work for the project! I must accord with what others mentioned, that the scaling behaviour of the 2xxxx-nets should be investigated. 
You may have noticed my own Elo-measurements (lc0-2080 vs SF9-4CPU), where the Leela-Ratio is about 2.6, and her strength seems to be much closer to 11248 than for example MTGOStark claims to have measured.
Looking forward to your future work!

Cscuile

unread,
Oct 29, 2018, 3:16:50 PM10/29/18
to LCZero
Thanks everyone! Glad to see you guys enjoy my tests!

It looks like I may be trying test20 scaling directly against SF(4 or 5). I expect these results to be similar to test10 since no major search or network architecture changes have been made. But you can't know for sure without testing. 

I would like to go above 2 million nodes for this one, but it may take weeks or even months depending on the nodes per move on my hardware. If it gets too much for me, I may need help from anyone who has access to a 2080TI. 

Thanks!

Lee Sailer

unread,
Oct 29, 2018, 3:50:03 PM10/29/18
to LCZero
How about testing the lazy hybrid?  This would be a 3 way test between SFdev, LeelaLatest, and a SF-Leela hybrid.

I call it Lazy Hybrid because I have only the simplest idea in mind.  Maybe play Leela to move 30, then switch to SF.  More complex ideas for when to switch are less "lazy". 

ovi...@gmail.com

unread,
Oct 29, 2018, 4:12:15 PM10/29/18
to LCZero
Cscuille, scale testing is always interesting, of course. But we are about changing LR for test20 so maybe having a test before and after the drop would be interesting

Cscuile

unread,
Oct 29, 2018, 8:05:04 PM10/29/18
to LCZero
+Lee If someone could create a working system between SF and Leela I would gladly test it. The main difficulty is deciding whos evaluations, Stockfish's or Leela's, is more accurate in certain positions. 

Cscuile

unread,
Oct 29, 2018, 8:08:31 PM10/29/18
to LCZero
Ovi, to save on time, I may test a direct 2 Million nodes against a 200k-400k. Since we know that around 200k is the point in which Test10 stopped scaling, if 200k/400k yields net 0 Elo then it's safe to conclude 2 million nodes and 200k/400k are the same. To be on the safe side, 400k should probably be used.  

Misha Golub

unread,
Oct 31, 2018, 3:33:56 AM10/31/18
to LCZero
If you test at what number of nodes best move changes over a large number of positions it may contribute to understanding scaling limits.

Cscuile

unread,
Nov 3, 2018, 3:58:46 PM11/3/18
to LCZero
3.6 Million Nodes total per 2 moves is too time-consuming for me. I can't get enough games for anything statistically significant. I will need help for this. 

MindMeNot

unread,
Nov 4, 2018, 10:20:42 AM11/4/18
to LCZero
Could you re-test 10751?

Cscuile

unread,
Nov 4, 2018, 1:57:11 PM11/4/18
to LCZero
Sure, good suggestion. I'll do that in the future.

Cscuile

unread,
Nov 5, 2018, 7:58:11 PM11/5/18
to LCZero
Mindmenot, ID 10751 is around -52ish Elo to SF9. the 112XX series is certainly stronger. 

MindMeNot

unread,
Nov 11, 2018, 5:01:17 AM11/11/18
to LCZero
But according to your own rating list, you calculated 10751 to be -20 elo to sf9. 11258v18 is also -20 to sf9.

Cscuile

unread,
Nov 11, 2018, 9:42:20 AM11/11/18
to LCZero
First off, that is on my old list without fishtest's SF version comparisons. Second that was directly against SF8 not SF9 so the Elo was inflated. Third, the Elo Error bar when including all these factors puts the first test within boundaries.
Reply all
Reply to author
Forward
0 new messages