Results of an engine-engine selfplay match

bob · Post by **bob** » Thu Jun 07, 2012 12:26 am

BTW, one thing that is definitely missing in my data is a notion of "order". Since I use a hundred+ cluster nodes, with each node having several CPUs, I run MANY matches in parallel. Each individual match goes into a separate file, in the order the games were played. But the end result is 750 files, each with their own set of games between just two opponents and N positions with alternated colors. It is not quite the same as playing 30K games serially and looking at the first 500, then the first 1,000, etc. It is not convenient to have multiple nodes writing to a single file to attempt to preserve the overall order, unfortunately...

Rebel · Post by **Rebel** » Thu Jun 07, 2012 3:09 pm

Don wrote:
Rebel wrote:I invested some computer time to satisfy my curiosity about the number of games needed to test a change.

http://www.top-5000.nl/selfplay.htm

It's quite odd to see the capriciousness of the percentages playing at increasing time controls.
Code: Select all
Results of an engine-engine selfplay match
        meant for discussion purposes

Engine-one  ProDeo 1.74
Engine-two  ProDeo 1.74 with an EVAL change in King Safety

 Blitz  5 seconds all   10,000 games     49.8 % 
 Blitz 10 seconds all   10,000 games     50.6 % 
 Blitz 20 seconds all    7,777 games     50.7 % 
 Blitz 40 seconds all   10,000 games     50.3 % 
 Blitz 80 seconds all    8,700 games     51.3 % 

Remarks

1. It seems with increasing time the EVAL change works best.
  
2. Blitz-80 vs Blitz-40 although a full percent better still falls into the error margin of 6 elo according to ELOSTAT. So in theory an improvement is still not proven.
  
Graphs (see the link above)

With a PGN utility the below graphs were made which shows the progress of each match. After each 100 games a datapoint is created and imported into Excel.

From the 5 graphs one might conclude the first 1000 games in a match are pretty meaningless due to the random nature of 2 almost equal engines in strength.

A reasonable number looks 5000 games to conclude an improvement, but not its exact elo.

The PGN tool will be made available later.
Ed,

You don't need 5000 games if the improvement is large, but if it's less than 5 ELO you probably need a lot more - and of course that depends on how much "error" you are willing to accept.

My standard test is 20,000 games for Komodo changes. I wish it was more but that stresses the limit of our meager testing resources. If you use bayeselo you get error margins reported - but they are not valid unless you know how to interpret them. For example you cannot just interpret them on the fly - they have meaning when you have specified in advance how many games you intend to run - otherwise you will watch the results and stop the test when you are "happy" with the result which makes this invalid. It's like flipping an coin and being unhappy with the results and then saying, "let's go for 2 out 3" - you basically stack the deck when you do that.

I think there might be a way to use the error margins on the fly if you do different math but I'm not that strong in statistics. One possibility is to stop a test when 1 side has an N point advantage. That can lead to very short or very long matches. If you require a 100 game advantage and the players are evenly matches, the match will still terminate at some point. With this method you are basically saying that you can trust the result because if it's small or even negative and yet the weaker player wins, it's not enough to be too concerned about (if N is high enough) and if one player is overwhelmingly superior you don't care either - so in either case you are not going to accept a bad change very often.

There is no method on the planet that will guarantee that you will never accept a bad change, even a billion games cannot guarantee that but you can get arbitrarily close if you are willing to wait....

Thanks for your expert opinion Don. You know, in my days this all (playing 10,000+ games or so) was not possible. Last version Rebel-12 (2003) played 400 x 30/all games for the final release. And so now I am wondering how many good changes I have thrown away during all those years

Back then I used several other measurement tools to keep a grip on the randomness monster such as looking at the progress of a match. Had the match a constant score flow or was it capriciously? I tended to trust a constant score flow of 220-180 better than a capricious one of 230-170. Hence my interest in histograms, splitting matches into parts.

Anyway, I am not in the competing business any longer but the subject (many games) has my fascination and since I was planning an interesting experiment I needed some grip. And I choose for 4000 for the moment. The experiment is on: http://www.top-5000.nl/eval.htm

Rebel · Post by **Rebel** » Thu Jun 07, 2012 3:21 pm

bob wrote:BTW, one thing that is definitely missing in my data is a notion of "order". Since I use a hundred+ cluster nodes, with each node having several CPUs, I run MANY matches in parallel. Each individual match goes into a separate file, in the order the games were played. But the end result is 750 files, each with their own set of games between just two opponents and N positions with alternated colors. It is not quite the same as playing 30K games serially and looking at the first 500, then the first 1,000, etc. It is not convenient to have multiple nodes writing to a single file to attempt to preserve the overall order, unfortunately...

That's impressive Bob. I assume you can make fast progress with such monstrous hardware. Regarding the games, it should make no difference in theory as long as they are not sorted on result

but I think I will provide myself someday.

Don · Post by **Don** » Thu Jun 07, 2012 4:22 pm

Rebel wrote:
Don wrote:
Rebel wrote:I invested some computer time to satisfy my curiosity about the number of games needed to test a change.

http://www.top-5000.nl/selfplay.htm

It's quite odd to see the capriciousness of the percentages playing at increasing time controls.
Code: Select all
Results of an engine-engine selfplay match
        meant for discussion purposes

Engine-one  ProDeo 1.74
Engine-two  ProDeo 1.74 with an EVAL change in King Safety

 Blitz  5 seconds all   10,000 games     49.8 % 
 Blitz 10 seconds all   10,000 games     50.6 % 
 Blitz 20 seconds all    7,777 games     50.7 % 
 Blitz 40 seconds all   10,000 games     50.3 % 
 Blitz 80 seconds all    8,700 games     51.3 % 

Remarks

1. It seems with increasing time the EVAL change works best.
  
2. Blitz-80 vs Blitz-40 although a full percent better still falls into the error margin of 6 elo according to ELOSTAT. So in theory an improvement is still not proven.
  
Graphs (see the link above)

With a PGN utility the below graphs were made which shows the progress of each match. After each 100 games a datapoint is created and imported into Excel.

From the 5 graphs one might conclude the first 1000 games in a match are pretty meaningless due to the random nature of 2 almost equal engines in strength.

A reasonable number looks 5000 games to conclude an improvement, but not its exact elo.

The PGN tool will be made available later.
Ed,

You don't need 5000 games if the improvement is large, but if it's less than 5 ELO you probably need a lot more - and of course that depends on how much "error" you are willing to accept.

My standard test is 20,000 games for Komodo changes. I wish it was more but that stresses the limit of our meager testing resources. If you use bayeselo you get error margins reported - but they are not valid unless you know how to interpret them. For example you cannot just interpret them on the fly - they have meaning when you have specified in advance how many games you intend to run - otherwise you will watch the results and stop the test when you are "happy" with the result which makes this invalid. It's like flipping an coin and being unhappy with the results and then saying, "let's go for 2 out 3" - you basically stack the deck when you do that.

I think there might be a way to use the error margins on the fly if you do different math but I'm not that strong in statistics. One possibility is to stop a test when 1 side has an N point advantage. That can lead to very short or very long matches. If you require a 100 game advantage and the players are evenly matches, the match will still terminate at some point. With this method you are basically saying that you can trust the result because if it's small or even negative and yet the weaker player wins, it's not enough to be too concerned about (if N is high enough) and if one player is overwhelmingly superior you don't care either - so in either case you are not going to accept a bad change very often.

There is no method on the planet that will guarantee that you will never accept a bad change, even a billion games cannot guarantee that but you can get arbitrarily close if you are willing to wait....
Thanks for your expert opinion Don. You know, in my days this all (playing 10,000+ games or so) was not possible. Last version Rebel-12 (2003) played 400 x 30/all games for the final release. And so now I am wondering how many good changes I have thrown away during all those years

Back then I used several other measurement tools to keep a grip on the randomness monster such as looking at the progress of a match. Had the match a constant score flow or was it capriciously? I tended to trust a constant score flow of 220-180 better than a capricious one of 230-170. Hence my interest in histograms, splitting matches into parts.

Anyway, I am not in the competing business any longer but the subject (many games) has my fascination and since I was planning an interesting experiment I needed some grip. And I choose for 4000 for the moment. The experiment is on: http://www.top-5000.nl/eval.htm

In the old days we had the same problem. I remember testing rexchess with about 10 pc's, some Larry had, some I had and some I borrowed from my dad. We had all of them cranking away playing games all night long and still could only come up with maybe 200 games!

Even then I knew that if you had enough CPU power you could make rapid advancement. With one PC in the old days you could not reasonably even test 1 change unless it was a big improvement and thus we were relegated to only major improvements and intuition about the rest.

A good example of that is LMR. I actually stumbled across the concept on my own about 30 years ago. It was very intriguing because of the surprising depths I was getting, but all I could do was tinker a bit and all I could easily detect was that it lost depth on some problems - which almost always meant the idea was no good. I hand played 3 or 4 games and I remember it winning the first game, but then doing badly afterwards. Today, your first try at LMR probably won't help your program very much until you get a lot of the details correct. My first "modern" implementation was a close call but now Komodo is about 90 ELO stronger because of it when I last checked. There is NO way I could have properly developed this idea in the 80's with the hardware I had without stretching the very limits of my patience and devoting months of testing to it.

We sometimes talk about which computer chess ideas have been the most responsible for major software advances, but the real answer is massive automated testing. That's why if you got back into computer chess you would produce an original top 10 program in a few months, perhaps a year or two at most. (I didn't say it would be easy

bob · Post by **bob** » Thu Jun 07, 2012 4:23 pm

Rebel wrote:
bob wrote:BTW, one thing that is definitely missing in my data is a notion of "order". Since I use a hundred+ cluster nodes, with each node having several CPUs, I run MANY matches in parallel. Each individual match goes into a separate file, in the order the games were played. But the end result is 750 files, each with their own set of games between just two opponents and N positions with alternated colors. It is not quite the same as playing 30K games serially and looking at the first 500, then the first 1,000, etc. It is not convenient to have multiple nodes writing to a single file to attempt to preserve the overall order, unfortunately...
That's impressive Bob. I assume you can make fast progress with such monstrous hardware. Regarding the games, it should make no difference in theory as long as they are not sorted on result but I think I will provide myself someday.

The problem is that the games are grouped by (a) specific pair of opponents and (b) specific sub-set of starting positions. That makes it very difficult to get a time-linear representation of the match progress. Which opponent do you look at first? Strongest? Weakest? Which positions do you look at first? easiest? hardest? That's why it is hard to use this data to measure how quickly the results tend to trend toward the final result. If you take the individual files in any sort of order, things end up skewed a bit because during the real match, Crafty is playing against all the opponents at the same time and I can watch the results as they roll in and are fed to BayesElo (I generally sample every 30 seconds for the fast tests to see how things are progressing). No way to get similar output without writing some messy code to combine that large PGN collection into one file, and then randomly extract one game at a time to simulate the way the match is actually played...

But it could be done, obviously...

Results of an engine-engine selfplay match

Re: Results of an engine-engine selfplay match

Re: Results of an engine-engine selfplay match

Re: Results of an engine-engine selfplay match

Re: Results of an engine-engine selfplay match

Re: Results of an engine-engine selfplay match