Statistical Interpretation

D Sceviour · Post by **D Sceviour** » Sun Dec 11, 2016 6:20 pm

Laskos wrote:
D Sceviour wrote:Here are the combined results of 1000 games at 12"+.012". Arena only saved the pgn files and debug code for one-half of the tournament, but Popochin time forfeited approximately 24 games. There are still doubts about the quality of results. Would it be useful to complete a larger test for the Blitz 2/0 as a comparison? It may take several days, and that cannot be done every time an engine adjustment is made.
Code: Select all
   Engine           Score
1: Popochin         540/1000
2: Blackburne1.9.0  524/1000
3: Blackburne1.9.0a 436/1000
Name of the tournament: Arena 3.0 tournament
Level: Blitz 0:12/0.12
Here it IS statistically significant that 1.9.0 performs better than 1.9.0a. Significantly so, some 40 ELO points.
Don't run 2 minutes games if not necessary for some particular reasons related to scaling. Try to stabilize the testing framework, for example in Cutechess.

Thank you for looking at this. The biggest problem noticed with testing short time controls is that the system steals clock cycles from one of the processes. I would still recommend 2-minute games to avoid this, but others may have a different opinion. What would be the minimum number of games recommended to establish a useful confidence level?

When comparing two identical engines, the difference in results are supposed to be augmented I have read somewhere. Thus, another control engine was added for comparison. Is a third control engine the correct procedure for comparing a change?

I am experimenting with CuteChess GUI for the first time. Popochin failed to start so something else will be substituted. More results may be posted later for comparison. There is a new world out there for CuteChess with Python scripts and stuff, which is beyond the scope intended for this discussion.

Laskos · Post by **Laskos** » Sun Dec 11, 2016 6:31 pm

D Sceviour wrote:
Laskos wrote:
D Sceviour wrote:Here are the combined results of 1000 games at 12"+.012". Arena only saved the pgn files and debug code for one-half of the tournament, but Popochin time forfeited approximately 24 games. There are still doubts about the quality of results. Would it be useful to complete a larger test for the Blitz 2/0 as a comparison? It may take several days, and that cannot be done every time an engine adjustment is made.
Code: Select all
   Engine           Score
1: Popochin         540/1000
2: Blackburne1.9.0  524/1000
3: Blackburne1.9.0a 436/1000
Name of the tournament: Arena 3.0 tournament
Level: Blitz 0:12/0.12
Here it IS statistically significant that 1.9.0 performs better than 1.9.0a. Significantly so, some 40 ELO points.
Don't run 2 minutes games if not necessary for some particular reasons related to scaling. Try to stabilize the testing framework, for example in Cutechess.
Thank you for looking at this. The biggest problem noticed with testing short time controls is that the system steals clock cycles from one of the processes. I would still recommend 2-minute games to avoid this, but others may have a different opinion. What would be the minimum number of games recommended to establish a useful confidence level?

When comparing two identical engines, the difference in results are supposed to be augmented I have read somewhere. Thus, another control engine was added for comparison. Is a third control engine the correct procedure for comparing a change?

I am experimenting with CuteChess GUI for the first time. Popochin failed to start so something else will be substituted. More results may be posted later for comparison. There is a new world out there for CuteChess with Python scripts and stuff, which is beyond the scope intended for this discussion.

My opinion is that not adding other engines is the best. Better use version N+1 of your program against version N of your program. In Cutchess-cli (and in Cutechess GUI?) one can see the error margins. When your difference between engines becomes significantly larger than error margins shown (2SD), you can say that the patch is either good or bad. Then you can stop the test and admit/reject the patch. It may take 100 games or 100,000 games, but if it needs 100,000 games, the patch is not important strength-wise.

D Sceviour · Post by **D Sceviour** » Sun Dec 11, 2016 6:49 pm

Laskos wrote:My opinion is that not adding other engines is the best. Better use version N+1 of your program against version N of your program. In Cutchess-cli (and in Cutechess GUI?) one can see the error margins. When your difference between engines becomes significantly larger than error margins shown (2SD), you can say that the patch is either good or bad. Then you can stop the test and admit/reject the patch. It may take 100 games or 100,000 games, but if it needs 100,000 games, the patch is not important strength-wise.

Here are some temporary results from CuteChess GUI for discussion. Can the Standard Deviation be extracted from this information?

Code: Select all

Rank Name                          ELO     +/-   Games   Score   Draws
   1 Counter_10                    145     115      33     70%     24%
   2 Blackburne1.9.0               -66     109      32     41%     25%
   3 Blackburne1.9.0a              -75     103      33     39%     30%

49 of 600 games finished.
Level: Tournament 60/1

Laskos · Post by **Laskos** » Sun Dec 11, 2016 6:57 pm

D Sceviour wrote:
Laskos wrote:My opinion is that not adding other engines is the best. Better use version N+1 of your program against version N of your program. In Cutchess-cli (and in Cutechess GUI?) one can see the error margins. When your difference between engines becomes significantly larger than error margins shown (2SD), you can say that the patch is either good or bad. Then you can stop the test and admit/reject the patch. It may take 100 games or 100,000 games, but if it needs 100,000 games, the patch is not important strength-wise.
Here are some temporary results from CuteChess GUI for discussion. Can the Standard Deviation be extracted from this information?
Code: Select all
Rank Name                          ELO     +/-   Games   Score   Draws
   1 Counter_10                    145     115      33     70%     24%
   2 Blackburne1.9.0               -66     109      32     41%     25%
   3 Blackburne1.9.0a              -75     103      33     39%     30%

49 of 600 games finished.
Level: Tournament 60/1

Yes, look at +/- (2SD). But they are rough for 3 engines, the ELO differences too. You can use BayesElo or Ordo on the PGN. Run head to head, 2 engines, 2 versions of your programs, you will see the correct difference and error margins in Cutechess.

D Sceviour · Post by **D Sceviour** » Sun Dec 11, 2016 7:20 pm

Laskos wrote:Yes, look at +/- (2SD). But they are rough for 3 engines, the ELO differences too. You can use BayesElo or Ordo on the PGN. Run head to head, 2 engines, 2 versions of your programs, you will see the correct difference and error margins in Cutechess.

I assume you saying that the SD can indicate at what point the test has sufficient data to be significant, and thus the test can be stopped. Thus for 1.9.0, +-94 is greater than 69 so there is insufficient data. However, Counter_10 +-96 is less than 145, so the Counter_10 test can be stopped.

Code: Select all

Rank Name                          ELO     +/-   Games   Score   Draws
   1 Counter_10                    145      96      43     70%     28%
   2 Blackburne1.9.0               -69      94      41     40%     27%
   3 Blackburne1.9.0a              -76      94      42     39%     26%

63 of 600 games finished.

Karlo Bala · Post by **Karlo Bala** » Sun Dec 11, 2016 8:38 pm

Laskos wrote:
D Sceviour wrote:
Laskos wrote: Play 1000 games at 12''+0.12'', show the results.
There are doubts about the quality of results. System interrupts have a tendency to disrupt one or the other engine with very short time controls. Two identical engines will not be offered the same amount of resources. This results in uneven and biased scores. Longer time controls seem to avoid this. It is possible to adjust system time priorities with individual processes but earlier experiments indicated this made no difference.

A 1000 game 12"+0.12" is being tested now, but it will be several hours before results are available. However, the control engine Popochin is forfeiting on time which will obviously spoil results.
What communication protocol or interface use the engines? You can use Cutechess-Cli for testing, if Xboard/Winboard or UCI. I don't have any problems with this time control in it. You can use move overhead or time margin, if possible, to avoid time losses. Use fixed time per move say 0.2 seconds, with overhead or time margin of 20ms, if time control management is on the extremes with these engines.

LittleBlitzer is also a nice tool for fast testing.

D Sceviour · Post by **D Sceviour** » Sun Dec 11, 2016 8:51 pm

Karlo Bala wrote:LittleBlitzer is also a nice tool for fast testing.

The instructions in the readme file indicate LittleBlitzer is only for UCI engines.

Karlo Bala · Post by **Karlo Bala** » Sun Dec 11, 2016 9:32 pm

D Sceviour wrote:
Karlo Bala wrote:LittleBlitzer is also a nice tool for fast testing.
The instructions in the readme file indicate LittleBlitzer is only for UCI engines.

Yes, however, I think that more then 90% of engines are UCI based. Almost every new engine is UCI.

pedrox · Post by **pedrox** » Mon Dec 12, 2016 12:35 am

Popochin is a winboard engine and time management is the same as TSCP , time / 30.

D Sceviour · Post by **D Sceviour** » Mon Dec 12, 2016 12:41 am

pedrox wrote:Popochin is a winboard engine and time management is the same as TSCP , time / 30.

Sorry, my mistake.

Statistical Interpretation

Re: Statistical Interpretation

Re: Statistical Interpretation

Re: Statistical Interpretation

Re: Statistical Interpretation

Re: Statistical Interpretation

Re: Statistical Interpretation

Re: Statistical Interpretation

Re: Statistical Interpretation

Re: Statistical Interpretation

Re: Statistical Interpretation