Testing sf against ls0

520 views
Skip to first unread message

Safirini

unread,
May 26, 2020, 9:44:05 PM5/26/20
to FishCooking

I suggest testing the stockfish not only for self-power against itself, but also against the main opponent - ls0, especially before the tсec championship

jeremiah...@gmail.com

unread,
May 26, 2020, 11:42:41 PM5/26/20
to FishCooking
TCEC 18 has started. 

Михаил Чалый

unread,
May 27, 2020, 12:24:09 AM5/27/20
to FishCooking
1) We don't have hardware suitable for lc0 on most of the machines (80+% of machines are noob machines w/o GPU);
2) testing against external engine will double error bars thus increase amount of games needing for test to be reliable by factor of 4;
3) you will need to write new logic since you a) can't use SPRT, b) you need to normalize leela/sf speed to some value (as we do with TC in normal tests) so you will need assymmetric time controls;
4) there are 0 proof that patches that work in selfplay don't work vs leela. Actually there are opposite proofs like fastgm 16 cores list where sf 11 is ahead of sf10 by 50 elo which is exactly what we measure if you match them against each other.
So tldr - it's hard or almost impossible to do and gains of this are not really existing anyway.
Message has been deleted

tmar...@gmail.com

unread,
May 27, 2020, 2:04:35 AM5/27/20
to FishCooking
Yes, and it will likely be too late for any 2020 championship since Leela is a bad bitch being about 100 ELO points above Stockfish and she will with accurate positional play overshadow the beloved fish despite Stockfish having a fantastic search and being tactical superior (but Stockfish has difficulties putting Leela into situations where this can be used to win). This is currently Jan Ullrich (SF) vs Lance Armstrong (lc0). If we see a surprise it likely will be that Stockfish won't make 2nd place. (I know that we compare gfx and cpu and such a comparison need to be taken with a grain of salt ... but this is what we can expect in the next tournaments).

Now, I am a truly nobody in this community, but I actually support the idea of Stockfish also testing against lc0 (or antifish, though that project seems to be dead). I actually planed to write this as a separate post, but decided to jump in here when somebody else suggested it. It should however be a previous Leela version that Stockfish scores about 50% against (to measure the actual improvements). There may be a limitation on the hardware accessable for these tests (what does Leela do to get access to such hardware?). However, even with fewer games played against Leela there can be an indication of improving (or regressing). The confidence interval could be specified differently and the goal be non-regression.

*I would however also like to suggest to add a test of known positions as a part of CI. Stockfish ought to have a file with positions where it should either play or avoid a certain move (or moves).* 

The positions added to the file should be examples where Stockfish has made a mistake. Maybe depth can optionally be specified (but there can be some defaults). Patches should provide a score comparable to master. It should repeat playing each position x times (could maybe have with a default or depending on depth). That would not just be a benefit but maybe also reduce the amount of work in some situations. E.g vondele did some of this manually in the the anti suicide patch. Positions added to this file should be approved by the maintainers (just like regular patches).

The point in the above is that some yellow/neutral patches may still be improvements if they both improve Stockfish play against lc0 and in known problematic positions. Also 'improving' while moving in the wrong direction against lc0 and known positions may not be actual (or at least not relevant) improvements.

Since this is my first post, I like to thank the developers of Stockfish. You have made an amazing chess program. The code inside Stockfish is build on your great (human) ideas and I hope more of these later on again can bring Stockfish back as being number 1.

Best Regards

Safirini

unread,
May 27, 2020, 10:06:04 AM5/27/20
to FishCooking


1. they certainly work, but testing against Lila showed much more precisely what needs to be worked on and the progress of patches
P/S 
you have a discord
среда, 27 мая 2020 г., 7:24:09 UTC+3 пользователь Михаил Чалый написал:
1) У нас нет аппаратного обеспечения, подходящего для lc0, на большинстве машин (более 80% машин - это noob-машины без графического процессора);
2) тестирование на внешнем движке удвоит количество ошибок, что увеличит количество игр, необходимых для надежности, в 4 раза;
3) вам нужно будет написать новую логику, так как вы a) не можете использовать SPRT, b) вам нужно нормализовать скорость leela / sf до некоторого значения (как мы делаем с TC в обычных тестах), поэтому вам потребуются ассиметричные регуляторы времени;
4) есть 0 доказательств того, что патчи, которые работают в самостоятельной игре, не работают против leela. На самом деле существуют противоположные доказательства, такие как список 16 ядер fastgm, где sf 11 опережает sf10 на 50 elo, и это именно то, что мы измеряем, если сравнивать их друг с другом.
Итак, tldr - это трудно или почти невозможно сделать, и выгоды от этого на самом деле не существует в любом случае.

Adam Kirby

unread,
May 27, 2020, 1:31:04 PM5/27/20
to FishCooking
Leela is nowhere near 100 elo above SF. The recent 4000 game megamatch at CCC is close to zero difference, +6 elo for SV 3010 which is considered the best net.

michel.va...@uhasselt.be

unread,
May 27, 2020, 2:11:38 PM5/27/20
to FishCooking


On Wednesday, May 27, 2020 at 6:24:09 AM UTC+2, Михаил Чалый wrote:
1) We don't have hardware suitable for lc0 on most of the machines (80+% of machines are noob machines w/o GPU);
2) testing against external engine will double error bars thus increase amount of games needing for test to be reliable by factor of 4;
3) you will need to write new logic since you a) can't use SPRT,

I agree with what you write, except this. It is perfectly possible to use SPRT against a third engine. I actually once wrote a program to do SPRT against a collection of foreign engines (to answer a question by Robert Hyatt, who then strangely lost interest in the question).

Safirini

unread,
May 27, 2020, 3:07:26 PM5/27/20
to FishCooking
it’s better to test and find weaknesses in advance than to get painful blows during the championship and lose it

garrykli...@gmail.com

unread,
May 27, 2020, 3:43:43 PM5/27/20
to FishCooking
dont go by those. only by long time controls. specially when compairing ab vs nn

tmar...@gmail.com

unread,
May 27, 2020, 3:47:13 PM5/27/20
to FishCooking
I do apologize for my very weak 100 ELO points guesstimate. Beside it being a weak guess by itself, the amount of draws (and the difficult positions that Stockfish often manages to save) must have confused me. Looking at the recent tournaments (13 CCCC -18 +8 =176 and TCEC 17 superfinal -17 +12 =71) we see a difference at ~17-18 points. It is however still clear that Leela wins many more games against Stockfish than she loses. Even if the recent 4000 game match is better to measure ELO difference with, she is still superior.

I will furthermore add there may not be any proof that selfplay tests does not work against Lc0, but I don't follow the conclusion that it therefore is a bad idea to test play against Leela. Is the claim that since selfplay can cause patches that also improves performance against Leela evidence that there cannot exist "near neutral patches" in selfplay that causes bigger improvements against Leela (and/or solves problematic positions where Stockfish is known to go wrong)?

Михаил Чалый

unread,
May 31, 2020, 12:02:12 AM5/31/20
to FishCooking
Every time stockfish is not number one once/2 weeks people say that it should test vs number one engine.
5 years ago it was komodo, than houdini (sf8+), now leela. There can be patches that are elo neutral but perform good vs leela. Maybe. But we want for stockfish to _play as strong as possible_ and not "play good vs leela".
And whenever it gets to play stronger it's (surprise!) starts to play stronger vs engine that is number 1. Stockfish didn't need testing anything vs komodo to surpass komodo between sf7 and sf8, stockfish didn't need to play vs houdini to surpass it really fast, stockfish doesn't need to test vs leela to surpass it (yes, leela is also developing, but my wild guess is that stockfish dev will win match vs 30 gen leela net much better than +1 point that it did in SuFi they played). And every other point stands - it has to be huge software development, huge hardware problems to even run this match, etc.

Safirini

unread,
Jun 1, 2020, 5:20:04 AM6/1/20
to FishCooking
of course

воскресенье, 31 мая 2020 г., 7:02:12 UTC+3 пользователь Михаил Чалый написал:
Каждый раз, когда вяленая рыба не является номером один раз / 2 недели, люди говорят, что она должна проверять двигатель номер один.
5 лет назад это было комодо, чем хоудини (sf8 +), сейчас Лила. Там могут быть патчи, которые являются elo нейтральными, но работают хорошо против leela. Может быть. Но мы хотим, чтобы вяленая рыба играла как можно сильнее, а не "играла хорошо против Лилы".
И когда он начинает играть сильнее, он (сюрприз!) Начинает играть сильнее против движка номер 1. Stockfish не нужно было ничего тестировать против komodo, чтобы превзойти komodo между sf7 и sf8, stockfish не нужно было играть против houdini, чтобы превзойти его очень быстро, вяленой рыбе не нужно тестировать против leela, чтобы превзойти его (да, leela также развивается, но я дикий предположение, что разработчик stockfish выиграет матч против сети 30 gen leela намного лучше, чем +1 очко, что сделал в суфи они играли). И любой другой пункт стоит - это должна быть огромная разработка программного обеспечения, огромные аппаратные проблемы, чтобы даже запустить этот матч и т. Д.
Reply all
Reply to author
Forward
0 new messages