D Sceviour wrote:Here are the combined results of 1000 games at 12"+.012". Arena only saved the pgn files and debug code for one-half of the tournament, but Popochin time forfeited approximately 24 games. There are still doubts about the quality of results. Would it be useful to complete a larger test for the Blitz 2/0 as a comparison? It may take several days, and that cannot be done every time an engine adjustment is made.
Engine Score
1: Popochin 540/1000
2: Blackburne1.9.0 524/1000
3: Blackburne1.9.0a 436/1000
Name of the tournament: Arena 3.0 tournament
Level: Blitz 0:12/0.12
Here it IS statistically significant that 1.9.0 performs better than 1.9.0a. Significantly so, some 40 ELO points.
Don't run 2 minutes games if not necessary for some particular reasons related to scaling. Try to stabilize the testing framework, for example in Cutechess.
Thank you for looking at this. The biggest problem noticed with testing short time controls is that the system steals clock cycles from one of the processes. I would still recommend 2-minute games to avoid this, but others may have a different opinion. What would be the minimum number of games recommended to establish a useful confidence level?
When comparing two identical engines, the difference in results are supposed to be augmented I have read somewhere. Thus, another control engine was added for comparison. Is a third control engine the correct procedure for comparing a change?
I am experimenting with CuteChess GUI for the first time. Popochin failed to start so something else will be substituted. More results may be posted later for comparison. There is a new world out there for CuteChess with Python scripts and stuff, which is beyond the scope intended for this discussion.
D Sceviour wrote:Here are the combined results of 1000 games at 12"+.012". Arena only saved the pgn files and debug code for one-half of the tournament, but Popochin time forfeited approximately 24 games. There are still doubts about the quality of results. Would it be useful to complete a larger test for the Blitz 2/0 as a comparison? It may take several days, and that cannot be done every time an engine adjustment is made.
Engine Score
1: Popochin 540/1000
2: Blackburne1.9.0 524/1000
3: Blackburne1.9.0a 436/1000
Name of the tournament: Arena 3.0 tournament
Level: Blitz 0:12/0.12
Here it IS statistically significant that 1.9.0 performs better than 1.9.0a. Significantly so, some 40 ELO points.
Don't run 2 minutes games if not necessary for some particular reasons related to scaling. Try to stabilize the testing framework, for example in Cutechess.
Thank you for looking at this. The biggest problem noticed with testing short time controls is that the system steals clock cycles from one of the processes. I would still recommend 2-minute games to avoid this, but others may have a different opinion. What would be the minimum number of games recommended to establish a useful confidence level?
When comparing two identical engines, the difference in results are supposed to be augmented I have read somewhere. Thus, another control engine was added for comparison. Is a third control engine the correct procedure for comparing a change?
I am experimenting with CuteChess GUI for the first time. Popochin failed to start so something else will be substituted. More results may be posted later for comparison. There is a new world out there for CuteChess with Python scripts and stuff, which is beyond the scope intended for this discussion.
My opinion is that not adding other engines is the best. Better use version N+1 of your program against version N of your program. In Cutchess-cli (and in Cutechess GUI?) one can see the error margins. When your difference between engines becomes significantly larger than error margins shown (2SD), you can say that the patch is either good or bad. Then you can stop the test and admit/reject the patch. It may take 100 games or 100,000 games, but if it needs 100,000 games, the patch is not important strength-wise.
Laskos wrote:My opinion is that not adding other engines is the best. Better use version N+1 of your program against version N of your program. In Cutchess-cli (and in Cutechess GUI?) one can see the error margins. When your difference between engines becomes significantly larger than error margins shown (2SD), you can say that the patch is either good or bad. Then you can stop the test and admit/reject the patch. It may take 100 games or 100,000 games, but if it needs 100,000 games, the patch is not important strength-wise.
Here are some temporary results from CuteChess GUI for discussion. Can the Standard Deviation be extracted from this information?
Laskos wrote:My opinion is that not adding other engines is the best. Better use version N+1 of your program against version N of your program. In Cutchess-cli (and in Cutechess GUI?) one can see the error margins. When your difference between engines becomes significantly larger than error margins shown (2SD), you can say that the patch is either good or bad. Then you can stop the test and admit/reject the patch. It may take 100 games or 100,000 games, but if it needs 100,000 games, the patch is not important strength-wise.
Here are some temporary results from CuteChess GUI for discussion. Can the Standard Deviation be extracted from this information?
Rank Name ELO +/- Games Score Draws
1 Counter_10 145 115 33 70% 24%
2 Blackburne1.9.0 -66 109 32 41% 25%
3 Blackburne1.9.0a -75 103 33 39% 30%
49 of 600 games finished.
Level: Tournament 60/1
Yes, look at +/- (2SD). But they are rough for 3 engines, the ELO differences too. You can use BayesElo or Ordo on the PGN. Run head to head, 2 engines, 2 versions of your programs, you will see the correct difference and error margins in Cutechess.
Laskos wrote:Yes, look at +/- (2SD). But they are rough for 3 engines, the ELO differences too. You can use BayesElo or Ordo on the PGN. Run head to head, 2 engines, 2 versions of your programs, you will see the correct difference and error margins in Cutechess.
I assume you saying that the SD can indicate at what point the test has sufficient data to be significant, and thus the test can be stopped. Thus for 1.9.0, +-94 is greater than 69 so there is insufficient data. However, Counter_10 +-96 is less than 145, so the Counter_10 test can be stopped.
Laskos wrote: Play 1000 games at 12''+0.12'', show the results.
There are doubts about the quality of results. System interrupts have a tendency to disrupt one or the other engine with very short time controls. Two identical engines will not be offered the same amount of resources. This results in uneven and biased scores. Longer time controls seem to avoid this. It is possible to adjust system time priorities with individual processes but earlier experiments indicated this made no difference.
A 1000 game 12"+0.12" is being tested now, but it will be several hours before results are available. However, the control engine Popochin is forfeiting on time which will obviously spoil results.
What communication protocol or interface use the engines? You can use Cutechess-Cli for testing, if Xboard/Winboard or UCI. I don't have any problems with this time control in it. You can use move overhead or time margin, if possible, to avoid time losses. Use fixed time per move say 0.2 seconds, with overhead or time margin of 20ms, if time control management is on the extremes with these engines.
LittleBlitzer is also a nice tool for fast testing.