Standarized test designs and centralized test results

338 views
Skip to first unread message

David Rodríguez Sánchez

unread,
Oct 13, 2018, 7:45:38 AM10/13/18
to LCZero
Dear all,

Given the fact that testing is becoming a bit difficult, and it's hard to find correlations between the results, I have made a template with the design of the tests you can perform in a "standarised" way, so that you can all drive the same type of tests, and at least we can find all the results in the same location, with the same structure, and if you fill in the same parameters, we can even draw some graphics.

The templates are here:

You can add yourself by copying a new sheet with your name, which you can use as your workspace.
I have already added usual testers such as Bigler, Blakely or Cscuile.

Let me know if you have access to it or if you have any questions.
Hope this helps organise a bit the testing

Jupiter

unread,
Oct 13, 2018, 2:08:58 PM10/13/18
to LCZero
I have made a suggestion at https://groups.google.com/forum/#!topic/lczero/ZKF8QZdUMNU

We may have a common goal.

Due to huge number of network id's to test, and limited testing resources I am limiting the average nodes search of Lc0 to 2000/move. But will use TC and not the fixed nodes/move games. The target is to generate test games a minimum of 2000 games per network id. When playing against A/B engines the A/B engines should use the TC based on CEGT 40/4 conditions. On my machine i7-2600 3.4 Ghz it is only TC 3m+1s. I don't have nvidia gpu so when I test Lc0 I will use TC 25m + 5s for her. One purpose in doing this is to compare Lc0 2000 against the strengths of A/B engines at CEGT 40/4 conditions. Another purpose is to see if network id's can gain strength even if lc0 is limited to an average nodes/move of 2000, and of course we can also measure the strength gained by the engine alone thru improvments in time management, mcts, parameter optimizations and others.

Shawn S.

unread,
Oct 14, 2018, 12:11:49 AM10/14/18
to LCZero
I am a big fan of the idea that Lc0 needs to be tested not only against AB engines, but also other net IDs from different pipelines.  Test20 has been expected to be more tactical, which might mean it seems weak against AB engines which are way more tactical.  Styles make fights and styles make chess matches.  So test20 vs test10 matches are important to run as well.

Matt Blakely

unread,
Oct 14, 2018, 9:00:43 AM10/14/18
to LCZero
I support the goal and will look at these today.  

I keep an informal testing sheet and can transfer some historical results

My problem is lack of time... it may seem I'm on here alot but thats between other duties.  should have time today though (sunday).



David Rodríguez Sánchez

unread,
Oct 14, 2018, 11:05:07 AM10/14/18
to LCZero
Hi guys,
Thanks for the responses.

I believe the best approach would be if you could somehow coordinate and try to REPEAT the same test cases/parameters, but with your own test conditions/environment


So imagine Jupiter wants to perform a certain test. 
He should specify the parameters in the Test Plan sheet (NN ID to be tested, Time Controls, ...), and inserts his results in the table next to it.
Then Blakely checks the test Jupiter performed, and repeats the same test conditions but under his own environment (CPU, GPU...)


Later on, whenever Blakely for example, wants to test a new NN with some other parameters, Jupiter can again repeat that same test, and we will be able to MEASURE, to COMPARE how the NN is really evolving, as we have different references producing results under same specifications.


This way we will avoid the SMOKE tests (not very meaningful tests) many people are making, which hardly provide any statistical evidence of where we are (we need to set test runs with the same number of samples, i set a fixed number of 50 games) as they are not only not enough in terms of number of samples, but also, we cannot compare against other people's environment.


I hope you all understand that it is not only important to get all the data collected in the same document (so that it is easier to find and validate), but also to actually have a TEST PLAN, so that people willing to contribute with testing, can actually select a certain test case that you design, and others can compare against their hardware.


I know you are all busy and are doing this just to help the project, but if we want to provide some more meaningful data, it would be great if you could make the little effor to check what your other tester budies have done, and simply try to repeat it.


Thanks for the effort!

brian

unread,
Oct 14, 2018, 1:39:26 PM10/14/18
to LCZero
The idea is good. Ideally there should be some record (spreadsheet maybe?) which lists tests that have been completed and tests that need to be done. Also all the re-tests you suggested should be grouped together so we know how frequent some tests has been re-tested.

David Rodríguez Sánchez

unread,
Oct 14, 2018, 1:59:18 PM10/14/18
to LCZero
Definitely, 

This would be registered in Test Plan sheet.
There you can add your test designs and results, so that you can keep track of them, and others can repeat them.

If you then want to add a new test case, you just add it in the table below the ones already recorded.
Reply all
Reply to author
Forward
0 new messages