Ranking of evaluation functions

372 views
Skip to first unread message

Jack Lo

unread,
Oct 27, 2021, 4:32:11 AM10/27/21
to LCZero
Adriaan de Groot has shown that there is not much difference in the depth of calculation of chess variants by an ordinary player and a grandmaster. In both cases it is a little more than 3 moves ahead. What differs is their ability to evaluate the position on the board.

This inspired me to test what difference there is in the quality of chess-playing computer programs' evaluation. I set the analysis depth to 6 plies and ran the matches between the engines. I have taken LC0 and the Maia 1900 network as a reference. This network is trained on the games of players with a 1900 ELO. Here are the results:

No Engine ELO
1 LC0 384x30-t60-4485 (sergio-V) 2650
2 LC0 hanse-69722-vf2 (lczero.org) 2625
3 LC0 384x30-2021_0518_1740_16_793 (lczero.org) 2600
4 LC0 192x15-2021_1016_0414_39_071 (lczero.org) 2500
5 LC0 J13B.2-178 (jhorthos 320x24) 2450
6 LC0 256x20-t40-1541 (sergio-V) 2400
7 LC0 LS15-20x256SE-jj-9-75000000 (leelenstein) 2350
8 LC0 128x10-2021_0726_2120_38_663 (lczero.org) 2300
9 LC0 11258-96x8-se-5 (dkappe) 2150
10 LC0 11258-64x6-se (dkappe) 2100
11 Rybka 2.3.2a 2000
12 LC0 Maia 1900 (maiachess.com) 1900
13 Komodo 3 1750
14 Hiarcs 11.2 1700
15 Komodo 7 1650
16 Houdini 1.5a 1600
17 Komodo 1 1600
18 ProDeo 2.6 (Rebel) 1500
19 Komodo 12 1500
20 Stockfish 4 1500
21 Stockfish 1 1450
22 Stockfish 14 NNUE 1450
23 SlowChess 2.7 1450
24 Fire 8 NNUE 1350
25 Stockfish 7 1300
26 Stockfish 11 1250

First places go to LC0. Stockfish is at the last. The strongest engine with a classic evaluation is Rybka.

In recent years, developers have focused on increasing the depth of analysis. The evaluation function has been simplified to speed it up. At the same time, the accuracy of evaluation has been decreasing. Some improvement was the use of NNUE, although you can see that it didn't help much.

The only exception is the LC0 project. You can see that the larger the network you use, the more powerful the engine play will be. Maybe it would be possible to get an even higher ELO if someone trained a 520x40 net based on the games of grandmasters with rankings >2700. The unsupervised learning process of such a network would be very long. How about using ready-made data? Or use games played by Stockfish?

On the other hand, Stockfish could be improved by using the evaluation function from LC0. Stockfish has the best search function. Combining the two (best evaluation function and best search function) could result in an ELO increase of hundreds of points. Of course, I'm not the first to think of this. But this test shows a big difference in the quality of evaluation by LC0 and Stockfish. It is worth thinking about how to reduce this gap.

Finally, a practical note. Using LC0 and choosing the suitable network, you can get a pretty good sparring partner. To reduce the power of the play, you will also need to reduce the depth of analysis:

LC0 Maia 1600 5 plies 1550 ELO
LC0 Maia 1300 4 plies 1200 ELO




esch...@gmail.com

unread,
Oct 27, 2021, 8:47:27 AM10/27/21
to LCZero
Except that ply for Lc0 isn't really ply at all.  It's just an arbitrary conversion function to make the displayed numbers look reasonable, so you are really comparing apples and oranges.

Even with traditional alpha-beta engines, ply isn't directly comparable between engines.  You could have two engines that search literally the exact same tree:  one is a ply 6 search with many moves extended, while the other is a ply 8 search with the non-extended moves reduced.

Charles Roberson

unread,
Oct 27, 2021, 10:24:26 AM10/27/21
to esch...@gmail.com, LCZero
The issue that you are missing is around the time of stockfish 4 the trend in search algorithms was to try selective forward pruning which didn't work well before. With the newer computers and their speed selective forward pruning started working.I am not talking about LMR. Stockfish pushed this boundary more than most. That is why you see the newer engines playing worse when limited to 6 ply. 

Charles Roberson

--
You received this message because you are subscribed to the Google Groups "LCZero" group.
To unsubscribe from this group and stop receiving emails from it, send an email to lczero+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/lczero/5ad73486-bb9d-4d26-9dc4-6f03005e42e4n%40googlegroups.com.

Jack Lo

unread,
Oct 28, 2021, 8:22:41 AM10/28/21
to LCZero
These results are of course strongly approximate, but there is also no way to directly compare the evaluation functions. I have assumed that a depth-6 search is reasonably good for each engine. Nonetheless, the differences are so large that the overall conclusions I have presented may make sense. Whether they can be implemented is another matter.

The assumption that LC0 had to be trained unsupervised lost its meaning when the Stockfish and LC0 teams began working together. Stockfish's team, on the other hand, may have stuck with the local optimum. And using a better evaluation function will help get out into it and make more progress.

Stockfish 14.1 has just been released. Test results:

Stockfish 14.1 - Stockfish 14 +23-7=10 TP=+147 ELO

urib...@gmail.com

unread,
Oct 28, 2021, 9:53:17 AM10/28/21
to LCZero
part of the difference between strong players and weak players is their ability to calculate forward
It is not only evaluation function

The best players can calculate also 10 moves forward if they need it when many of the weaker players cannot do it

Note that I disagree with the evaluation comparison because different engines have different pruning.

ב-יום רביעי, 27 באוקטובר 2021 בשעה 11:32:11 UTC+3, Jack Lo כתב/ה:

Warren D Smith

unread,
Oct 28, 2021, 1:32:22 PM10/28/21
to LCZero
The strongest players can play over 40 simultaneous blindfold games
with >80% score.
And can in situations calling for it, calculate a combination a long
way ahead. Most chessplayers cannot even play 1 blindfold game, and
cannot correctly calculate a long way ahead even in situations where
the only evaluation needed is "massive material discrepancy and/or
checkmate" so that the ordinary and grandmaster players have no
difference in their ability to evaluate. So whoever thinks de Groot
proved otherwise, is wrong. Either because they did not understand de
Groot, or because de Groot erred. Which? Probably both.


--
Warren D. Smith
http://RangeVoting.org <-- add your endorsement (by clicking
"endorse" as 1st step)

DBg

unread,
Oct 28, 2021, 4:13:31 PM10/28/21
to LCZero
Shouldn't one be talking about scoring function and leaving evaluation to the evaluation of static full position information as input to direct output, without any need to invoke tree search and integrate min-max among many branches in the tree search, usual such integrated optimization has been called score  (min-max among tips of considered branches), I thought. This is just a terminology point.  And also my interest.  Although I do find this discussion interesting anyway.  The effect of pruning heuristics, if I understand well.

Jack Lo

unread,
Oct 29, 2021, 3:49:43 AM10/29/21
to LCZero
Some quotes:

"The studies involve participants of all chess backgrounds, from amateurs to masters. They investigate the cognitive requirements and the thought processes involved in moving a chess piece. The participants were usually required to solve a given chess problem correctly under the supervision of an experimenter and represent their thought-processes vocally so that they could be recorded."

"One of the main themes in Thought and Choice in Chess is the difference between the expert (grandmaster) and the amateur. De Groot demonstrated that this difference does not consist of more and deeper calculation, or of a different method of thinking. It depends on the enormous amount of finely tuned experience and knowledge that the grandmaster activates as soon as he looks at a position, which enables him to, almost in a glance, get to the essence of the position and to see the most promising moves."

"Amongst them were some strong grand masters (Keres, Alekhine, Euwe, Flohr and Fine) and some masters. To investigate what happens in between, De Groot wrote down many oral reports and investigated them with meticulous care."

This research does not rule out the possibility that the player may analyze the position more deeply in some cases. They do, however, indicate that after only a few seconds, a grandmaster can see possibilities that a club player will not see even after a long time. Of course, deep calculations are not done in a few seconds. That's impossible.

I don't know if static position evaluation alone is sufficient. LC0 has such a mechanism, but it's probably not enough. I have experience with unsupervised learning for the Reversi/Othello. The program played with itself (similar to the LC0 project) by analyzing the position 4 moves ahead. Then there was teaching, next games, etc. At one point I tried increasing the depth of analysis to 6 and 8, but this did not improve learning. I did not do experiments with depths of 2 or 0. I assumed that then the program would make tactical errors too often. A similar argument can be made in chess.

More results:
Stockfish 14 - Rybka 2.3.2a +1-38=1 TP = -564 ELO
Stockfish 14.1 - Rybka 2.3.2a +5-31=4 TP = -269 ELO

Stockfish 14 - Houdini 1.5a +9-25=6 TP = -147 ELO
Stockfish 14.1 - Houdini 1.5a +19-15=6 TP = +35 ELO

Stockfish 14 - LC0 Maia 1900 +4-31=5 TP = -285 ELO
Stockfish 14.1 - LC0 Maia 1900 +15-16=9 TP = -9 ELO

Stockfish 14 - LC0 11258-64x6-se +3-34=3 TP = -359 ELO
Stockfish 14.1 - LC0 11258-64x6-se +4-31=5 TP = -285 ELO

Stockfish 14 - LC0 128x10-2021_0726_2120_38_663 +0-40=0 TP = -800 ELO
Stockfish 14.1 - LC0 128x10-2021_0726_2120_38_663 +2-35=3 TP = -407 ELO

Undoubtedly, great progress has been made.
Reply all
Reply to author
Forward
0 new messages