YAFTS - Yet Another Fast Testing Scheme

Marc Lacrosse · Post by **Marc Lacrosse** » Fri Apr 10, 2009 4:57 pm

Although his practical implementation was somewhat strange, Michael Hart’s recent proposal (in this thread) of a fast elo-estimator based on the comparison between an engine’s evaluation of a set of positions and the moves that were actually played by master correspondence players from these same positions looked interesting to me.

I defined a testing method based on the same principle but with a completely different implementation.

I took all games played at ICCF during the year 2008 between master level (2400+) correspondence players with at least 40 moves played.
From these games I extracted the EPD of the positions at move 15, 25 and 35 both with white and with black to move.
I then discarded all positions that were present more than once in the resulting epd file.
I was left with 4830 positions.
Testing was done using Polyglot’s “epd-test” function. For each position the tested engine was given three seconds to find the move that the correspondence master had played. Test duration was thus a little less than four hours per engine.
The total number of “found” moves was recorded.
Testing was performed on a Quad PC (Q6600 overclocked at 3.5 Ghz) under windows XP 64 bits.
Using this method 16 engines with CCRL 40/40 ratings regularly distributed between 2590 and 3320 were tested. When existing, 64-bits version of the engines were used.
Here is the list of the engines with the number of threads used by each one:

Code: Select all

Rybka 3 - T4
Naum 4 -  T4
Rybka 2.2n2 - T4
Zappa Mex-II -	T4
DeepSjeng WC2008 - T4
Hiarcs 12 - T4
Bright 0.4a - T4
Glaurung 2.2 - T4
Naum 3.1 - T1
Fruit 2.3.1 - T1
Spike 1.2 Turin - T1
ChessTiger 2007.1 - T1
Colossus 2008b - T1
Aristarch 4.50 - T1
SOS 5.1 - T1
Yace 0.99.87 - T1

From the test results, linear regression was used to build an estimator of CCRL 40/40 rating as a function of the test results. For each engine the estimator was applied to guess what could be its estimated rating according to its actual result at the test.

Here are the results :

Code: Select all

Estimated	CCRL 40/40  Abs&#40;Error&#41;
3164	    	  3228	    	    64
3101	    	  3152	    	    51
3032	    	  3109	    	    77
3109	    	  3070	    	    39
2905	    	  3029	    	   124
3161	    	  3004	    	   157
2979	    	  2998	    	   19
3007	    	  2994	    	   13
2917	    	  2962	    	   45
2912	    	  2882	    	   30
2869	    	  2849	    	   20
2774	    	  2802	    	   28
2827      	  2747	    	   80
2646	    	  2699	    	   53
2760	    	  2666	    	   94
2619	    	  2590	    	   29

Plotting this it is evident that there is some kind of relationship between estimated and “real” Elo. :

So there is some kind of trend according which stronger engines tend to agree more than weaker ones with the moves played by correspondence masters when given a very short time for analysis.

But is this link close enough to be of any practical value for effective testing tasks ?

Median error of the estimation is 58 Elo points : this means that when we apply this rating procedure to the 16 engines, our estimation will be closer than 58 points from the real elo for 50 % of engines.
But for the other engines we could get a considerably more faulty value.
The test overestimates DeepSjeng by 157 elo points and underestimates Zappa by 124 points.
And this is when we back-apply the formula to those engines from which the formula has been built. We must always fear that applying it to a “foreign” engine could lead to a larger error …

So the conclusion is evident : there is some kind of value in this test (for example for the fast preliminary rating of a completely unknown engine) but it is far from being enough precise for the one who is busy tuning an engine and needs to discriminate between two versions that will probably differ by no more than a few elo points.

Marc

PS I also tested quite a few variants of the test, with more or less time allowed, larger or smaller number of positions, higher-order interpolation formulas, and so on : I was not able to get something better than the example shown.

Gian-Carlo Pascutto · Post by **Gian-Carlo Pascutto** » Fri Apr 10, 2009 5:15 pm

Marc Lacrosse wrote: But for the other engines we could get a considerably more faulty value.
The test overestimates DeepSjeng by 157 elo points and underestimates Zappa by 124 points.

Hmm, is the order between your 2 lists different? I don't find this result in the table.

I would assume that all top correspondence GM analyze very heavily with Rybka, and the test should like it a lot

Marc Lacrosse · Post by **Marc Lacrosse** » Fri Apr 10, 2009 5:32 pm

Gian-Carlo Pascutto wrote:[
Hmm, is the order between your 2 lists different? I don't find this result in the table.

Sorry I mixed things when writing (I am a bit tired).
This is wrong "The test overestimates DeepSjeng by 157 elo points and underestimates Zappa by 124 points.
"

It should be :
The test overestimates Hiarcs by 157 elo points (estimated elo : 3161; CCRL: 3004) and underestimates DeepSjeng by 124 points (estimated elo 2905; CCRL: 3029) .

Sorry

Marc

bob · Post by **bob** » Fri Apr 10, 2009 8:58 pm

Marc Lacrosse wrote:Although his practical implementation was somewhat strange, Michael Hart’s recent proposal (in this thread) of a fast elo-estimator based on the comparison between an engine’s evaluation of a set of positions and the moves that were actually played by master correspondence players from these same positions looked interesting to me.

I defined a testing method based on the same principle but with a completely different implementation.

I took all games played at ICCF during the year 2008 between master level (2400+) correspondence players with at least 40 moves played.
From these games I extracted the EPD of the positions at move 15, 25 and 35 both with white and with black to move.
I then discarded all positions that were present more than once in the resulting epd file.
I was left with 4830 positions.
Testing was done using Polyglot’s “epd-test” function. For each position the tested engine was given three seconds to find the move that the correspondence master had played. Test duration was thus a little less than four hours per engine.
The total number of “found” moves was recorded.
Testing was performed on a Quad PC (Q6600 overclocked at 3.5 Ghz) under windows XP 64 bits.
Using this method 16 engines with CCRL 40/40 ratings regularly distributed between 2590 and 3320 were tested. When existing, 64-bits version of the engines were used.
Here is the list of the engines with the number of threads used by each one:
Code: Select all
Rybka 3 - T4
Naum 4 -  T4
Rybka 2.2n2 - T4
Zappa Mex-II -	T4
DeepSjeng WC2008 - T4
Hiarcs 12 - T4
Bright 0.4a - T4
Glaurung 2.2 - T4
Naum 3.1 - T1
Fruit 2.3.1 - T1
Spike 1.2 Turin - T1
ChessTiger 2007.1 - T1
Colossus 2008b - T1
Aristarch 4.50 - T1
SOS 5.1 - T1
Yace 0.99.87 - T1
From the test results, linear regression was used to build an estimator of CCRL 40/40 rating as a function of the test results. For each engine the estimator was applied to guess what could be its estimated rating according to its actual result at the test.

Here are the results :
Code: Select all
Estimated	CCRL 40/40  Abs&#40;Error&#41;
3164	    	  3228	    	    64
3101	    	  3152	    	    51
3032	    	  3109	    	    77
3109	    	  3070	    	    39
2905	    	  3029	    	   124
3161	    	  3004	    	   157
2979	    	  2998	    	   19
3007	    	  2994	    	   13
2917	    	  2962	    	   45
2912	    	  2882	    	   30
2869	    	  2849	    	   20
2774	    	  2802	    	   28
2827      	  2747	    	   80
2646	    	  2699	    	   53
2760	    	  2666	    	   94
2619	    	  2590	    	   29
Plotting this it is evident that there is some kind of relationship between estimated and “real” Elo. :

So there is some kind of trend according which stronger engines tend to agree more than weaker ones with the moves played by correspondence masters when given a very short time for analysis.

But is this link close enough to be of any practical value for effective testing tasks ?

Median error of the estimation is 58 Elo points : this means that when we apply this rating procedure to the 16 engines, our estimation will be closer than 58 points from the real elo for 50 % of engines.
But for the other engines we could get a considerably more faulty value.
The test overestimates DeepSjeng by 157 elo points and underestimates Zappa by 124 points.
And this is when we back-apply the formula to those engines from which the formula has been built. We must always fear that applying it to a “foreign” engine could lead to a larger error …

So the conclusion is evident : there is some kind of value in this test (for example for the fast preliminary rating of a completely unknown engine) but it is far from being enough precise for the one who is busy tuning an engine and needs to discriminate between two versions that will probably differ by no more than a few elo points.

Marc

PS I also tested quite a few variants of the test, with more or less time allowed, larger or smaller number of positions, higher-order interpolation formulas, and so on : I was not able to get something better than the example shown.

First, you are making one significant mistake that isn't obvious. The various rating lists for programs include their book. You are excluding that. This means that the actual rating of each program is unknown, since the book can easily account for 200 Elo in a normal test approach... I quit with the book a couple of years ago as a good book can make a weaker engine look better than a stronger engine with a weak book...

What you could try, however, is to take a program you used where you have the source, and make a change that you think is significant (changing some passed pawn weight from 100 to 10 for example) and then measuring that program again. I suspect that is where this kind of testing is not going to work, because the two versions will be pretty close to each other overall, and this kind of test usually won't discriminate that tightly.

Marc Lacrosse · Post by **Marc Lacrosse** » Fri Apr 10, 2009 10:16 pm

bob wrote:
First, you are making one significant mistake that isn't obvious. The various rating lists for programs include their book. You are excluding that. This means that the actual rating of each program is unknown, since the book can easily account for 200 Elo in a normal test approach... I quit with the book a couple of years ago as a good book can make a weaker engine look better than a stronger engine with a weak book...(...)

Just a word : the CCRL rating list does not test with engine books (they use short generic books).

bob wrote:I suspect that is where this kind of testing is not going to work, because the two versions will be pretty close to each other overall, and this kind of test usually won't discriminate that tightly.

That is my conclusion too.
But I just wished to try.

Marc

Hart · Post by **Hart** » Sat Apr 11, 2009 3:54 am

Your results seem consistent with mine. Could you tell me more about this epd-test function, and what program did you use to extract the epd's? I have seem some command line programs that perform this function but I found them very aggravating to use.

I think these tests have great potential if only a better method could be found for position selection. If your positions were like mine then 40% of them were useless and only wasted time. I am now looking for a better criteria to remove unhelpful positions from the remaining 60% but it is tricky.
I am also investigating another idea, probably already tried, that Bob shot down a while back. It rests on assigning points for different moves, and those points depending on which engine previously selected a certain move and their measured strength as found in CCRL. In effect, an engine would score higher the more it played like a higher rated engine at some low depth and time.

Don · Post by **Don** » Fri Apr 17, 2009 3:58 am

I am doing something very much like this with Larry Kauman. What we found is that you need a lot of positions to get an accurate result and although the test is pretty good it does overestimate some programs and underestimate others.

However, it's quite good at measuring changes to a single program.

Another limitation we found is that the set of games must be very high quality. For instance if you allow games of weak masters in the database, it cannot resolve the strength of stronger programs very well. I think Larry runs these at 1 second per move and we use the games of the top correspondence players. It runs all night long for him to get a result, although like auto testing games there is always a score available - it just takes a lot of time before the score gets stable.

Larry fit some formula to it that is surprisingly accurate for most programs and is very accurate at measuring things like 5 or 10 elo changes as verified by tens of thousands of auto-test games.

We did a bunch of work to throw out positions that are less relevant such as positions that are too easy. If several programs including weak and strong programs find the move, we consider it too easy.

Larry's basic idea is to build an evaluation function that plays more like a human in style - he believes that humans are still much better players. It's kind of like what Apollo Creed said to Rocky Balboa, "you fight great, but I'm a great fighter", Larry thinks computers play great, but that humans are great players. And he wants computers to play more like that.

Marc Lacrosse · Post by **Marc Lacrosse** » Fri Apr 17, 2009 11:03 am

Hart wrote:(...)Could you tell me more about this epd-test function, and what program did you use to extract the epd's? I have seem some command line programs that perform this function but I found them very aggravating to use. (...)

Several utilities are able to extract all positions from a PGN file.
The one I use is PGN2FEN v1.0.4 by Tim Foden (google should help you to find it).
One of the reasons I use it is the fact that in the resulting epd file the move number is recorded as "fmvn" opcode which helps select all moves occuring at a given move number.
The actually played move is recorded by this utility in a "pm" field. I simply use a text edotor to substitute all "; pm" by a "; bm" string.
This utility also has a switch for choosing to export only those positions where white (or black) is to move.

For the actual testing I use some recent polyglot version with the "epd-test" switch.
See here for explanation :
http://alpha.uhasselt.be/Research/Algeb ... anpage.txt

Marc

diep · Post by **diep** » Fri Apr 17, 2009 7:33 pm

A few remarks about tests in general.

a) computerchess in general is about worst case of your engine. You are not testing THAT, but you're testing bestcase and reproducing what engines already found.

So basically what you prove is that engines that are the best for corr chess are the engines with more knowledge and especially big mobility and attacking terms and the engines which do so well in games, they simply have been tuned a lot better.

Note that most 2400+ corr chess players are total beginners themselves. I know several of them. They're genius in figuring out deep lines that tactical just work in some sort of variation that is popular at that time; or great in figuring out the weakness in a specific opening some opponent has (that's the 2500-2600+ guys i know, which btw is highest ratings in corr chess). So you simply also have to take into account which engine the corr guys follow.

Most lower rated corr players, as you also can see from Uri Blass postings, is that they soon figure out that following the advice of just 1 engine is better than using several engines.

That is for an obvious reason: they're analytical maybe strong, yet positional/strategical they're total beginners in the game of chess. So if they pick then a patzer move from engine A and then from engine B, their playlevel is basically the worst case of the engines at a huge search depth.

The only thing they can optimize their play for is accuracy.

On other hand they total miss it when a certain attack is strategical hopeless and also in endgames even some 2500-2600 guys are just lacking knowledge, despite that they see certain positions already months ahead of entering them.

In that sense corr chess has dramatically changed from a number of years ago when the computers dominated less.

As most of the low elo guys (low elo in OTB) there follow just 1 engine, it is also not surprisingly that these popular engines in corr chess give a seemingly reasonable result at your graph. It is BECAUSE those engines found that trick there, that the corr players went into that variation.

You research is good therefore to confirm that point in dramatic manner.

More knowledge and mobility is what is useful to human players, that's what you prove real clearly to me once again.

Marc Lacrosse wrote:Although his practical implementation was somewhat strange, Michael Hart’s recent proposal (in this thread) of a fast elo-estimator based on the comparison between an engine’s evaluation of a set of positions and the moves that were actually played by master correspondence players from these same positions looked interesting to me.

I defined a testing method based on the same principle but with a completely different implementation.

I took all games played at ICCF during the year 2008 between master level (2400+) correspondence players with at least 40 moves played.
From these games I extracted the EPD of the positions at move 15, 25 and 35 both with white and with black to move.
I then discarded all positions that were present more than once in the resulting epd file.
I was left with 4830 positions.
Testing was done using Polyglot’s “epd-test” function. For each position the tested engine was given three seconds to find the move that the correspondence master had played. Test duration was thus a little less than four hours per engine.
The total number of “found” moves was recorded.
Testing was performed on a Quad PC (Q6600 overclocked at 3.5 Ghz) under windows XP 64 bits.
Using this method 16 engines with CCRL 40/40 ratings regularly distributed between 2590 and 3320 were tested. When existing, 64-bits version of the engines were used.
Here is the list of the engines with the number of threads used by each one:
Code: Select all
Rybka 3 - T4
Naum 4 -  T4
Rybka 2.2n2 - T4
Zappa Mex-II -	T4
DeepSjeng WC2008 - T4
Hiarcs 12 - T4
Bright 0.4a - T4
Glaurung 2.2 - T4
Naum 3.1 - T1
Fruit 2.3.1 - T1
Spike 1.2 Turin - T1
ChessTiger 2007.1 - T1
Colossus 2008b - T1
Aristarch 4.50 - T1
SOS 5.1 - T1
Yace 0.99.87 - T1
From the test results, linear regression was used to build an estimator of CCRL 40/40 rating as a function of the test results. For each engine the estimator was applied to guess what could be its estimated rating according to its actual result at the test.

Here are the results :
Code: Select all
Estimated	CCRL 40/40  Abs&#40;Error&#41;
3164	    	  3228	    	    64
3101	    	  3152	    	    51
3032	    	  3109	    	    77
3109	    	  3070	    	    39
2905	    	  3029	    	   124
3161	    	  3004	    	   157
2979	    	  2998	    	   19
3007	    	  2994	    	   13
2917	    	  2962	    	   45
2912	    	  2882	    	   30
2869	    	  2849	    	   20
2774	    	  2802	    	   28
2827      	  2747	    	   80
2646	    	  2699	    	   53
2760	    	  2666	    	   94
2619	    	  2590	    	   29
Plotting this it is evident that there is some kind of relationship between estimated and “real” Elo. :

So there is some kind of trend according which stronger engines tend to agree more than weaker ones with the moves played by correspondence masters when given a very short time for analysis.

But is this link close enough to be of any practical value for effective testing tasks ?

Median error of the estimation is 58 Elo points : this means that when we apply this rating procedure to the 16 engines, our estimation will be closer than 58 points from the real elo for 50 % of engines.
But for the other engines we could get a considerably more faulty value.
The test overestimates DeepSjeng by 157 elo points and underestimates Zappa by 124 points.
And this is when we back-apply the formula to those engines from which the formula has been built. We must always fear that applying it to a “foreign” engine could lead to a larger error …

So the conclusion is evident : there is some kind of value in this test (for example for the fast preliminary rating of a completely unknown engine) but it is far from being enough precise for the one who is busy tuning an engine and needs to discriminate between two versions that will probably differ by no more than a few elo points.

Marc

PS I also tested quite a few variants of the test, with more or less time allowed, larger or smaller number of positions, higher-order interpolation formulas, and so on : I was not able to get something better than the example shown.

Uri Blass · Post by **Uri Blass** » Sun Apr 19, 2009 12:11 pm

2400+ correspondence chess player may be total beginners in chess and may follow only a single engine but it changes nothing.

Their moves are clearly better than the move of chess programs at 1 second per move.

Note that not all of the 2400 correspondence chess players are total beginners in chess so you practically optimize your program to find moves of better players when part of them are better players because they outsearch your program but this is not the only case.

Uri

YAFTS - Yet Another Fast Testing Scheme

YAFTS - Yet Another Fast Testing Scheme

Re: YAFTS - Yet Another Fast Testing Scheme

Re: YAFTS - Yet Another Fast Testing Scheme

Re: YAFTS - Yet Another Fast Testing Scheme

Re: YAFTS - Yet Another Fast Testing Scheme

Re: YAFTS - Yet Another Fast Testing Scheme

Re: YAFTS - Yet Another Fast Testing Scheme

Re: YAFTS - Yet Another Fast Testing Scheme

Re: YAFTS - Yet Another Fast Testing Scheme

Re: YAFTS - Yet Another Fast Testing Scheme