Results of an engine-engine selfplay match

Discussion of chess software programming and technical issues.

Moderator: Ras

User avatar
Rebel
Posts: 7475
Joined: Thu Aug 18, 2011 12:04 pm
Full name: Ed Schröder

Results of an engine-engine selfplay match

Post by Rebel »

I invested some computer time to satisfy my curiosity about the number of games needed to test a change.

http://www.top-5000.nl/selfplay.htm

It's quite odd to see the capriciousness of the percentages playing at increasing time controls.

Code: Select all

Results of an engine-engine selfplay match
        meant for discussion purposes

Engine-one  ProDeo 1.74
Engine-two  ProDeo 1.74 with an EVAL change in King Safety

 Blitz  5 seconds all   10,000 games     49.8 % 
 Blitz 10 seconds all   10,000 games     50.6 % 
 Blitz 20 seconds all    7,777 games     50.7 % 
 Blitz 40 seconds all   10,000 games     50.3 % 
 Blitz 80 seconds all    8,700 games     51.3 % 

Remarks

1. It seems with increasing time the EVAL change works best.
  
2. Blitz-80 vs Blitz-40 although a full percent better still falls into the error margin of 6 elo according to ELOSTAT. So in theory an improvement is still not proven.
  
Graphs (see the link above)

With a PGN utility the below graphs were made which shows the progress of each match. After each 100 games a datapoint is created and imported into Excel.

From the 5 graphs one might conclude the first 1000 games in a match are pretty meaningless due to the random nature of 2 almost equal engines in strength.

A reasonable number looks 5000 games to conclude an improvement, but not its exact elo.

The PGN tool will be made available later.
User avatar
Ajedrecista
Posts: 2177
Joined: Wed Jul 13, 2011 9:04 pm
Location: Madrid, Spain.

Re: Results of an engine-engine selfplay match.

Post by Ajedrecista »

Hello Ed:

Excellent work as usual!
1. It seems with increasing time the EVAL change works best.
Are search improvements more meaningful in STC and eval improvements more meaningful in LTC? It is just a newbie's guess, so please do not take it seriously.

I can not answer you about the minimum number of games in self-play match, but I made two small programmes some weeks ago... I have improved their outputs just today. Thanks to other Fortran codes I now understand text formats a little more.

I am sure that you do not need these programmes, but just in case I have uploaded them:

Elo_uncertainties_calculator.rar (0.6 MB)

Minimum_score_for_no_regression.rar (0.6 MB)

My two tiny programmes use an algorithm that may be similar to EloSTAT; the obtained results seem logical in most of the cases. Here is an example of Minimum_score_for_no_regression:

Code: Select all

Minimum_score_for_no_regression, © 2012.

Calculation of the minimum score for no regression in a match between two engines:

 Write down the number of games of the match (it must be a positive integer, up to 1073741823):

5000

Write down the draw ratio (in percentage):

50

Write down k (for making confidence intervals of (mu) +/- (k*sigma) in a normal distribution); k must be positive:

1.96

Theoretical minimum score for no regression: 50.9796 %

Minimum number of won points for the engine in this match:      2549.0 points.

Minimum Elo advantage, which is also the negative part of the error bar:
  6.8106 Elo

End of the calculations.

Thanks for using Minimum_score_for_no_regression. Press Enter to exit.
In this case, for 5000 games, 50% of draws and 1.96-sigma confidence (~ 95% confidence), I get that the improved engine must win by at least 2549 - 2451 for reaching some conclusions. In this example, only changing the confidence level to ~ 99% I get a new minimum result of 2564.5 - 2435.5.

As you see, I calculate the minimum score and not the minimum number of games... but something is something.

The other programme (Elo_uncertainties_calculator) calculates error bars only with 1, 2 and 3-sigma confidence; I hope that both programmes are bug-free inside their limitations.

Please keep up the good work. If you are lucky, you will get more useful answers than mine in this topic.

Regards from Spain.

Ajedrecista.
jdart
Posts: 4420
Joined: Fri Mar 10, 2006 5:23 am
Location: http://www.arasanchess.org

Re: Results of an engine-engine selfplay match

Post by jdart »

Rebel wrote:I invested some computer time to satisfy my curiosity about the number of games needed to test a change.

..

A reasonable number looks 5000 games to conclude an improvement, but not its exact elo.
This can be calculated (whether or not the match shows a superiority of one engine, or not). I am using Bayeselo for this currently (can determine a rating and also confidence interval).
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Results of an engine-engine selfplay match

Post by bob »

Rebel wrote:I invested some computer time to satisfy my curiosity about the number of games needed to test a change.

http://www.top-5000.nl/selfplay.htm

It's quite odd to see the capriciousness of the percentages playing at increasing time controls.

Code: Select all

Results of an engine-engine selfplay match
        meant for discussion purposes

Engine-one  ProDeo 1.74
Engine-two  ProDeo 1.74 with an EVAL change in King Safety

 Blitz  5 seconds all   10,000 games     49.8 % 
 Blitz 10 seconds all   10,000 games     50.6 % 
 Blitz 20 seconds all    7,777 games     50.7 % 
 Blitz 40 seconds all   10,000 games     50.3 % 
 Blitz 80 seconds all    8,700 games     51.3 % 

Remarks

1. It seems with increasing time the EVAL change works best.
  
2. Blitz-80 vs Blitz-40 although a full percent better still falls into the error margin of 6 elo according to ELOSTAT. So in theory an improvement is still not proven.
  
Graphs (see the link above)

With a PGN utility the below graphs were made which shows the progress of each match. After each 100 games a datapoint is created and imported into Excel.

From the 5 graphs one might conclude the first 1000 games in a match are pretty meaningless due to the random nature of 2 almost equal engines in strength.

A reasonable number looks 5000 games to conclude an improvement, but not its exact elo.

The PGN tool will be made available later.
Don't agree with last statement. Depends on the amount of the improvement. If it is only 1 or 2 elo, which is still significant, 5000 games is nowhere near enough as the error bar is far wider than that... 30,000 games still has an error bar of +/- 4 Elo... For more significant improvements (or degradations of course) then fewer games are needed, but most of my changes are in 1's and 2's, not in 10's and 20's, meaning that I often need 100K games to get a reliable answer...
Antonio Torrecillas
Posts: 92
Joined: Sun Nov 02, 2008 4:43 pm
Location: Madrid
Full name: Antonio Torrecillas

Re: Results of an engine-engine selfplay match

Post by Antonio Torrecillas »

For just two programs, I use WhoisBest from Rémi Coulom.
http://www.talkchess.com/forum/viewtopi ... 82&t=30624
I've run some test with fixed depth tournaments. Changed only one parameter, Bishop pair. I agree with you that as you increase the depth or time control, the convergence is quicker (in number of games). I agree also with Robert Hyatt that you need a ton of games to get certitude. With 4000 different starting position both colors, 8000 games was not enough to get certitude for bishop pair equal to 20 versus 0.
As a side comment, I've found that tuning with games has also some quirk.
1.- tuning an imbalanced evaluation can create more imbalance.
Suppose you have an imbalance, a better evaluation for Knight than for bishop. The tuning process will push up the knight value.
2.- Synergy value.
The tuned parameter has tendency to take the synergy value. This value match the rest of the evaluation capacity to drive the win with this feature present.
Suppose you do a tournament where one player that enjoy the bishop pair (BB=20) and the other try to avoid the bishop pair (BB=-20). As soon both player have opportunity, a bishop pair imbalance will be present.Now, the result of the tournament is the capability of the rest of the eval to drive a win from the bishop pair. if there is a lack of some complementary knowledge (mobility, doubled pawns) the current evaluation can be unable to show up the advantage of the bishop pair.
A negative result in the tuning process can mean we need a complementary knowledge to get an improvement.

This perspective is from a "weak engine" . May be a more mature engine comes to different conclusions.
User avatar
Rebel
Posts: 7475
Joined: Thu Aug 18, 2011 12:04 pm
Full name: Ed Schröder

Re: Results of an engine-engine selfplay match

Post by Rebel »

bob wrote:
Rebel wrote:I invested some computer time to satisfy my curiosity about the number of games needed to test a change.

http://www.top-5000.nl/selfplay.htm

It's quite odd to see the capriciousness of the percentages playing at increasing time controls.

Code: Select all

Results of an engine-engine selfplay match
        meant for discussion purposes

Engine-one  ProDeo 1.74
Engine-two  ProDeo 1.74 with an EVAL change in King Safety

 Blitz  5 seconds all   10,000 games     49.8 % 
 Blitz 10 seconds all   10,000 games     50.6 % 
 Blitz 20 seconds all    7,777 games     50.7 % 
 Blitz 40 seconds all   10,000 games     50.3 % 
 Blitz 80 seconds all    8,700 games     51.3 % 

Remarks

1. It seems with increasing time the EVAL change works best.
  
2. Blitz-80 vs Blitz-40 although a full percent better still falls into the error margin of 6 elo according to ELOSTAT. So in theory an improvement is still not proven.
  
Graphs (see the link above)

With a PGN utility the below graphs were made which shows the progress of each match. After each 100 games a datapoint is created and imported into Excel.

From the 5 graphs one might conclude the first 1000 games in a match are pretty meaningless due to the random nature of 2 almost equal engines in strength.

A reasonable number looks 5000 games to conclude an improvement, but not its exact elo.

The PGN tool will be made available later.
Don't agree with last statement. Depends on the amount of the improvement. If it is only 1 or 2 elo, which is still significant, 5000 games is nowhere near enough as the error bar is far wider than that... 30,000 games still has an error bar of +/- 4 Elo... For more significant improvements (or degradations of course) then fewer games are needed, but most of my changes are in 1's and 2's, not in 10's and 20's, meaning that I often need 100K games to get a reliable answer...
I am aware following the output of the below listed C-program that 100,000 is pretty rock-solid. It will output the file "excel.txt" which contents you can import into Excel, make a histogram and watch the progression of a match where the percentage settles on 50.0%.

Code: Select all

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

void main()                                         // emulate matches

{       int r,x,max,c,split; float win,loss,draw,f1,f2,f3,f4; char w[200],R[200],s[200]; int rnd,d,e;
        FILE *fpx; FILE *fpy; float tot=0; int rounds; int count=0; char *q;

        div_t result; srand(rnd);

again:  printf("Number of Games "); gets(w); max=atoi(w);
        printf("Split depth     "); gets(w); split=atoi(w);
        printf("Rounds          "); gets(R); rounds=atoi(R);

same:   fpx = fopen("excel.txt","w");               // all

loop:   count++; sprintf(w,"excel%d.txt",count);    // excel1.txt | excel2.txt | etc
        fpy = fopen(w,"w");

        x=0; win=0; loss=0; draw=0; tot=0; printf("\n");

next:   if (x==max) goto einde;

        r=rand(); r=r&3; if (r==0) goto next;
        if (r==1) win++;
        if (r==2) loss++;
        if (r==3) draw++;

        if (r==1) tot=tot+1;
        if (r==3) tot=tot+0.5;

        result=div(x+1,split); if (result.rem==0)
         { f1=x; f2=tot*100; f3=f2/f1;
           sprintf(s,"%.2f\t",f3); q=strstr(s,"."); if (q) q[0]=','; fprintf(fpx,s); fprintf(fpy,s); }

        x++; if (x==(max/4)) goto disp;
             if (x==(max/2)) goto disp;
             if (x==(max/4)+(max/2)) goto disp;
             if (x==max) goto disp;
        goto next;

disp:   f1=win+(draw/2); f2=loss+(draw/2); f4=x; f3=(f1*100)/f4; d=f1; e=f2;
        printf("%d-%d (%.1f%%)    ",d,e,f3);
        goto next;

einde:  fclose(fpy); rounds--; if (rounds) { fprintf(fpx,"\n"); goto loop; }
        fclose(fpx);
        printf("\n(Q)uit (A)gain ");
        c=getch(); if (c=='q') return;
                   if (c=='a') { printf("\n\n"); goto again; }
        rounds=atoi(R); goto same;
}
I have a couple of questions:

1. At which TC are you playing those 100,00 games ?

2. Is it possible for you (or anybody else) to upload some of the 40,000 - 100,000 matches in PGN? I want to make histograms of it to study (and share) its progress for a better understanding after how many games 2 engines settle on a fixed percentage.
User avatar
Don
Posts: 5106
Joined: Tue Apr 29, 2008 4:27 pm

Re: Results of an engine-engine selfplay match

Post by Don »

Rebel wrote:I invested some computer time to satisfy my curiosity about the number of games needed to test a change.

http://www.top-5000.nl/selfplay.htm

It's quite odd to see the capriciousness of the percentages playing at increasing time controls.

Code: Select all

Results of an engine-engine selfplay match
        meant for discussion purposes

Engine-one  ProDeo 1.74
Engine-two  ProDeo 1.74 with an EVAL change in King Safety

 Blitz  5 seconds all   10,000 games     49.8 % 
 Blitz 10 seconds all   10,000 games     50.6 % 
 Blitz 20 seconds all    7,777 games     50.7 % 
 Blitz 40 seconds all   10,000 games     50.3 % 
 Blitz 80 seconds all    8,700 games     51.3 % 

Remarks

1. It seems with increasing time the EVAL change works best.
  
2. Blitz-80 vs Blitz-40 although a full percent better still falls into the error margin of 6 elo according to ELOSTAT. So in theory an improvement is still not proven.
  
Graphs (see the link above)

With a PGN utility the below graphs were made which shows the progress of each match. After each 100 games a datapoint is created and imported into Excel.

From the 5 graphs one might conclude the first 1000 games in a match are pretty meaningless due to the random nature of 2 almost equal engines in strength.

A reasonable number looks 5000 games to conclude an improvement, but not its exact elo.

The PGN tool will be made available later.

Ed,

You don't need 5000 games if the improvement is large, but if it's less than 5 ELO you probably need a lot more - and of course that depends on how much "error" you are willing to accept.

My standard test is 20,000 games for Komodo changes. I wish it was more but that stresses the limit of our meager testing resources. If you use bayeselo you get error margins reported - but they are not valid unless you know how to interpret them. For example you cannot just interpret them on the fly - they have meaning when you have specified in advance how many games you intend to run - otherwise you will watch the results and stop the test when you are "happy" with the result which makes this invalid. It's like flipping an coin and being unhappy with the results and then saying, "let's go for 2 out 3" - you basically stack the deck when you do that.

I think there might be a way to use the error margins on the fly if you do different math but I'm not that strong in statistics. One possibility is to stop a test when 1 side has an N point advantage. That can lead to very short or very long matches. If you require a 100 game advantage and the players are evenly matches, the match will still terminate at some point. With this method you are basically saying that you can trust the result because if it's small or even negative and yet the weaker player wins, it's not enough to be too concerned about (if N is high enough) and if one player is overwhelmingly superior you don't care either - so in either case you are not going to accept a bad change very often.

There is no method on the planet that will guarantee that you will never accept a bad change, even a billion games cannot guarantee that but you can get arbitrarily close if you are willing to wait....
Capital punishment would be more effective as a preventive measure if it were administered prior to the crime.
Adam Hair
Posts: 3226
Joined: Wed May 06, 2009 10:31 pm
Location: Fuquay-Varina, North Carolina

Re: Results of an engine-engine selfplay match

Post by Adam Hair »

Don wrote:
Rebel wrote:I invested some computer time to satisfy my curiosity about the number of games needed to test a change.

http://www.top-5000.nl/selfplay.htm

It's quite odd to see the capriciousness of the percentages playing at increasing time controls.

Code: Select all

Results of an engine-engine selfplay match
        meant for discussion purposes

Engine-one  ProDeo 1.74
Engine-two  ProDeo 1.74 with an EVAL change in King Safety

 Blitz  5 seconds all   10,000 games     49.8 % 
 Blitz 10 seconds all   10,000 games     50.6 % 
 Blitz 20 seconds all    7,777 games     50.7 % 
 Blitz 40 seconds all   10,000 games     50.3 % 
 Blitz 80 seconds all    8,700 games     51.3 % 

Remarks

1. It seems with increasing time the EVAL change works best.
  
2. Blitz-80 vs Blitz-40 although a full percent better still falls into the error margin of 6 elo according to ELOSTAT. So in theory an improvement is still not proven.
  
Graphs (see the link above)

With a PGN utility the below graphs were made which shows the progress of each match. After each 100 games a datapoint is created and imported into Excel.

From the 5 graphs one might conclude the first 1000 games in a match are pretty meaningless due to the random nature of 2 almost equal engines in strength.

A reasonable number looks 5000 games to conclude an improvement, but not its exact elo.

The PGN tool will be made available later.

Ed,

You don't need 5000 games if the improvement is large, but if it's less than 5 ELO you probably need a lot more - and of course that depends on how much "error" you are willing to accept.

My standard test is 20,000 games for Komodo changes. I wish it was more but that stresses the limit of our meager testing resources. If you use bayeselo you get error margins reported - but they are not valid unless you know how to interpret them. For example you cannot just interpret them on the fly - they have meaning when you have specified in advance how many games you intend to run - otherwise you will watch the results and stop the test when you are "happy" with the result which makes this invalid. It's like flipping an coin and being unhappy with the results and then saying, "let's go for 2 out 3" - you basically stack the deck when you do that.

I think there might be a way to use the error margins on the fly if you do different math but I'm not that strong in statistics. One possibility is to stop a test when 1 side has an N point advantage. That can lead to very short or very long matches. If you require a 100 game advantage and the players are evenly matches, the match will still terminate at some point. With this method you are basically saying that you can trust the result because if it's small or even negative and yet the weaker player wins, it's not enough to be too concerned about (if N is high enough) and if one player is overwhelmingly superior you don't care either - so in either case you are not going to accept a bad change very often.

There is no method on the planet that will guarantee that you will never accept a bad change, even a billion games cannot guarantee that but you can get arbitrarily close if you are willing to wait....
Yes, there are methods for stopping a test earlier while maintaining a given confidence level. They are called sequential tests. I tried to flesh out a method for chess but never completed it. Lucas Braesch and Michel Van den Bergh had a productive discussion on this [url="http://talkchess.com/forum/viewtopic.ph ... at&start=0]topic[/url].
User avatar
Ajedrecista
Posts: 2177
Joined: Wed Jul 13, 2011 9:04 pm
Location: Madrid, Spain.

Correction of Adam's link.

Post by Ajedrecista »

Hello:
Adam Hair wrote:Yes, there are methods for stopping a test earlier while maintaining a given confidence level. They are called sequential tests. I tried to flesh out a method for chess but never completed it. Lucas Braesch and Michel Van den Bergh had a productive discussion on this [url="http://talkchess.com/forum/viewtopic.ph ... at&start=0]topic[/url].
I put the correct link:

http://talkchess.com/forum/viewtopic.ph ... at&start=0

This topic seems very interesting: there are maybe too much statistics for my limited knowledge although I am able to understand a bit of it.

Regards from Spain.

Ajedrecista.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Results of an engine-engine selfplay match

Post by bob »

Rebel wrote:
bob wrote:
Rebel wrote:I invested some computer time to satisfy my curiosity about the number of games needed to test a change.

http://www.top-5000.nl/selfplay.htm

It's quite odd to see the capriciousness of the percentages playing at increasing time controls.

Code: Select all

Results of an engine-engine selfplay match
        meant for discussion purposes

Engine-one  ProDeo 1.74
Engine-two  ProDeo 1.74 with an EVAL change in King Safety

 Blitz  5 seconds all   10,000 games     49.8 % 
 Blitz 10 seconds all   10,000 games     50.6 % 
 Blitz 20 seconds all    7,777 games     50.7 % 
 Blitz 40 seconds all   10,000 games     50.3 % 
 Blitz 80 seconds all    8,700 games     51.3 % 

Remarks

1. It seems with increasing time the EVAL change works best.
  
2. Blitz-80 vs Blitz-40 although a full percent better still falls into the error margin of 6 elo according to ELOSTAT. So in theory an improvement is still not proven.
  
Graphs (see the link above)

With a PGN utility the below graphs were made which shows the progress of each match. After each 100 games a datapoint is created and imported into Excel.

From the 5 graphs one might conclude the first 1000 games in a match are pretty meaningless due to the random nature of 2 almost equal engines in strength.

A reasonable number looks 5000 games to conclude an improvement, but not its exact elo.

The PGN tool will be made available later.
Don't agree with last statement. Depends on the amount of the improvement. If it is only 1 or 2 elo, which is still significant, 5000 games is nowhere near enough as the error bar is far wider than that... 30,000 games still has an error bar of +/- 4 Elo... For more significant improvements (or degradations of course) then fewer games are needed, but most of my changes are in 1's and 2's, not in 10's and 20's, meaning that I often need 100K games to get a reliable answer...
I am aware following the output of the below listed C-program that 100,000 is pretty rock-solid. It will output the file "excel.txt" which contents you can import into Excel, make a histogram and watch the progression of a match where the percentage settles on 50.0%.

Code: Select all

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

void main()                                         // emulate matches

{       int r,x,max,c,split; float win,loss,draw,f1,f2,f3,f4; char w[200],R[200],s[200]; int rnd,d,e;
        FILE *fpx; FILE *fpy; float tot=0; int rounds; int count=0; char *q;

        div_t result; srand(rnd);

again:  printf("Number of Games "); gets(w); max=atoi(w);
        printf("Split depth     "); gets(w); split=atoi(w);
        printf("Rounds          "); gets(R); rounds=atoi(R);

same:   fpx = fopen("excel.txt","w");               // all

loop:   count++; sprintf(w,"excel%d.txt",count);    // excel1.txt | excel2.txt | etc
        fpy = fopen(w,"w");

        x=0; win=0; loss=0; draw=0; tot=0; printf("\n");

next:   if (x==max) goto einde;

        r=rand(); r=r&3; if (r==0) goto next;
        if (r==1) win++;
        if (r==2) loss++;
        if (r==3) draw++;

        if (r==1) tot=tot+1;
        if (r==3) tot=tot+0.5;

        result=div(x+1,split); if (result.rem==0)
         { f1=x; f2=tot*100; f3=f2/f1;
           sprintf(s,"%.2f\t",f3); q=strstr(s,"."); if (q) q[0]=','; fprintf(fpx,s); fprintf(fpy,s); }

        x++; if (x==(max/4)) goto disp;
             if (x==(max/2)) goto disp;
             if (x==(max/4)+(max/2)) goto disp;
             if (x==max) goto disp;
        goto next;

disp:   f1=win+(draw/2); f2=loss+(draw/2); f4=x; f3=(f1*100)/f4; d=f1; e=f2;
        printf("%d-%d (%.1f%%)    ",d,e,f3);
        goto next;

einde:  fclose(fpy); rounds--; if (rounds) { fprintf(fpx,"\n"); goto loop; }
        fclose(fpx);
        printf("\n(Q)uit (A)gain ");
        c=getch(); if (c=='q') return;
                   if (c=='a') { printf("\n\n"); goto again; }
        rounds=atoi(R); goto same;
}
I have a couple of questions:

1. At which TC are you playing those 100,00 games ?

2. Is it possible for you (or anybody else) to upload some of the 40,000 - 100,000 matches in PGN? I want to make histograms of it to study (and share) its progress for a better understanding after how many games 2 engines settle on a fixed percentage.
I have several test times.

10s + 0.1 where I can run 30K+ games in a little less than an hour. I do this for quick sanity tests on a change, as if I break something, it shows up within a few minutes.

60s+1s takes less than 12 hours to run and is a pretty solid test time unless there are lots of different changes that need to be tested individually, then it becomes a little long

300s+2s is a 24 hour run. And I have actually run 60m+1m, which is in the 2 week range if I think a test needs to be validated at long time controls (which I very rarely do for obvious reasons).

Very few of the changes I deal with are 1 or 2 elo. If so, then multiply any of the above times by 3, since it requires about 100K games to get resolution that accurate.


As far as the last question, I have a boatload of 30K game matches. But the way I store them is less than convenient (but they could be merged). For each match played (which is for a specific test version at some fast time control) I have about 750 files in each directory, each file is the games played between a single pair of opponents using a subset of my starting positions. The 750 files contain 30K games total.

If you want I could combine all 750 files into one, and repeat for as many different sets as you want. Just be aware that this is into the 30mb per set range if I looked correctly...