fast vs slow games in testing

bob · Post by **bob** » Tue Apr 14, 2009 2:15 am

This has been discussed a good bit, but I just ran into a case that I thought might be interesting. I slightly modified the default piece values. Original values were P=1.0, N/B=3.25, R=5.0 and Q=9.7. I changed the R/Q values to 5.5 and 10.7. And ran a couple of fast game tests (32,000 games per run) at 10s+0.1s time controls. And found that this was worth about +10 Elo. I then re-ran the same test, but changed the time control to 5m+5s (much slower). Took a lot longer, but the interesting thing was this was a -20 Elo change. Nothing else changed between the two versions, both runs were 32,000 games against the usual opponents and positions.

We've had this discussion in the past where some claim that they've never seen a case where a program was better at fast games than at slow games or vice-versa. Here is a simple change that produces exactly that. Version A (original) is 10 Elo weaker than version B (new material scores) at very fast games. But version A is 20 Elo stronger at longer games. A 30 Elo change.

Goes to show that just fast games is not enough.

Dann Corbit · Post by **Dann Corbit** » Tue Apr 14, 2009 3:00 am

bob wrote:This has been discussed a good bit, but I just ran into a case that I thought might be interesting. I slightly modified the default piece values. Original values were P=1.0, N/B=3.25, R=5.0 and Q=9.7. I changed the R/Q values to 5.5 and 10.7. And ran a couple of fast game tests (32,000 games per run) at 10s+0.1s time controls. And found that this was worth about +10 Elo. I then re-ran the same test, but changed the time control to 5m+5s (much slower). Took a lot longer, but the interesting thing was this was a -20 Elo change. Nothing else changed between the two versions, both runs were 32,000 games against the usual opponents and positions.

We've had this discussion in the past where some claim that they've never seen a case where a program was better at fast games than at slow games or vice-versa. Here is a simple change that produces exactly that. Version A (original) is 10 Elo weaker than version B (new material scores) at very fast games. But version A is 20 Elo stronger at longer games. A 30 Elo change.

Goes to show that just fast games is not enough.

I have a theory on this behavior (or, rather, some wild conjecture).

I think that deeper searches reveal the power of the mighty pieces.
The initial values are just hints to the program to give a good ranking to the pieces and to estimate what is a good trade and a bad trade. But as we search deeper and deeper, the mighty chessmen wreak far more havoc than the weaklings. For these situations, I think that search is revealing more and more truth about the true value of the chessmen. I think (even further) that if we could search 100 plies ahead in one second, that we could assign all of the chessmen the value of 0 and their true value would be revealed by the search.

On the other hand, the picture that you present shows the opposite ideas as my thought experiment.

On the other, other hand -- I did an experiment at 5 minutes + 5 seconds {ponder was on, 3.0 GHz} (only 200 games, unfortunately) which shows +16 Elo for :

Code: Select all

{0, 100, 366, 372, 566, 1125, 10000};

verses

Code: Select all

{0, 100, 325, 325, 500, 975, 10000};

The error bars overlap so it is clearly not decisive, but interesting anyway.

bob · Post by **bob** » Tue Apr 14, 2009 7:10 am

Dann Corbit wrote:
bob wrote:This has been discussed a good bit, but I just ran into a case that I thought might be interesting. I slightly modified the default piece values. Original values were P=1.0, N/B=3.25, R=5.0 and Q=9.7. I changed the R/Q values to 5.5 and 10.7. And ran a couple of fast game tests (32,000 games per run) at 10s+0.1s time controls. And found that this was worth about +10 Elo. I then re-ran the same test, but changed the time control to 5m+5s (much slower). Took a lot longer, but the interesting thing was this was a -20 Elo change. Nothing else changed between the two versions, both runs were 32,000 games against the usual opponents and positions.

We've had this discussion in the past where some claim that they've never seen a case where a program was better at fast games than at slow games or vice-versa. Here is a simple change that produces exactly that. Version A (original) is 10 Elo weaker than version B (new material scores) at very fast games. But version A is 20 Elo stronger at longer games. A 30 Elo change.

Goes to show that just fast games is not enough.
I have a theory on this behavior (or, rather, some wild conjecture).

I think that deeper searches reveal the power of the mighty pieces.
The initial values are just hints to the program to give a good ranking to the pieces and to estimate what is a good trade and a bad trade. But as we search deeper and deeper, the mighty chessmen wreak far more havoc than the weaklings. For these situations, I think that search is revealing more and more truth about the true value of the chessmen. I think (even further) that if we could search 100 plies ahead in one second, that we could assign all of the chessmen the value of 0 and their true value would be revealed by the search.

On the other hand, the picture that you present shows the opposite ideas as my thought experiment.

On the other, other hand -- I did an experiment at 5 minutes + 5 seconds {ponder was on, 3.0 GHz} (only 200 games, unfortunately) which shows +16 Elo for :
Code: Select all
{0, 100, 366, 372, 566, 1125, 10000}; 
verses
Code: Select all
{0, 100, 325, 325, 500, 975, 10000}; 
The error bars overlap so it is clearly not decisive, but interesting anyway.

I think the error bar is the killer here. I've had way too many cases where after 500 games something looks good, and by 5000 games, it has plummeted...

I've been doing lots of verification tests, where I use fast games to evaluate a change, and then a slow game match to verify the results. Many times they back each other up quite well. And then recently I have had several cases where something looked good in a fast game match, and dropped significantly at longer games. I'd love to test at 40/2hr or something, but that needs something like the next cluster we are looking at, which might be in the 2000 core range if things work out. At 8,000 games a day (assuming about 6 hours per game) it would actually be possible to test at that time control).

rjgibert · Post by **rjgibert** » Tue Apr 14, 2009 7:29 am

bob wrote:This has been discussed a good bit, but I just ran into a case that I thought might be interesting. I slightly modified the default piece values. Original values were P=1.0, N/B=3.25, R=5.0 and Q=9.7. I changed the R/Q values to 5.5 and 10.7. And ran a couple of fast game tests (32,000 games per run) at 10s+0.1s time controls. And found that this was worth about +10 Elo. I then re-ran the same test, but changed the time control to 5m+5s (much slower). Took a lot longer, but the interesting thing was this was a -20 Elo change. Nothing else changed between the two versions, both runs were 32,000 games against the usual opponents and positions.

We've had this discussion in the past where some claim that they've never seen a case where a program was better at fast games than at slow games or vice-versa. Here is a simple change that produces exactly that. Version A (original) is 10 Elo weaker than version B (new material scores) at very fast games. But version A is 20 Elo stronger at longer games. A 30 Elo change.

Goes to show that just fast games is not enough.

I think the issue isn't just whether or not this can happen, but also if it does, why? I suspect at least 2 things may be going on here.

One, deeper search can offset small errors by a program. Two, some eval changes can result in a speed up or slow down of search.

Perhaps greater search depth has more importance at faster time controls relatively, while better eval has relatively greater importance at slower time controls.

Is there a significant difference in time to depth between the 2 versions? If not, something else is going on that is maybe more subtle.

Eelco de Groot · Post by **Eelco de Groot** » Tue Apr 14, 2009 8:03 am

rjgibert wrote:
bob wrote:This has been discussed a good bit, but I just ran into a case that I thought might be interesting. I slightly modified the default piece values. Original values were P=1.0, N/B=3.25, R=5.0 and Q=9.7. I changed the R/Q values to 5.5 and 10.7. And ran a couple of fast game tests (32,000 games per run) at 10s+0.1s time controls. And found that this was worth about +10 Elo. I then re-ran the same test, but changed the time control to 5m+5s (much slower). Took a lot longer, but the interesting thing was this was a -20 Elo change. Nothing else changed between the two versions, both runs were 32,000 games against the usual opponents and positions.

We've had this discussion in the past where some claim that they've never seen a case where a program was better at fast games than at slow games or vice-versa. Here is a simple change that produces exactly that. Version A (original) is 10 Elo weaker than version B (new material scores) at very fast games. But version A is 20 Elo stronger at longer games. A 30 Elo change.

Goes to show that just fast games is not enough.
I think the issue isn't just whether or not this can happen, but also if it does, why? I suspect at least 2 things may be going on here.

One, deeper search can offset small errors by a program. Two, some eval changes can result in a speed up or slow down of search.

Perhaps greater search depth has more importance at faster time controls relatively, while better eval has relatively greater importance at slower time controls.

Is there a significant difference in time to depth between the 2 versions? If not, something else is going on that is maybe more subtle.

Hello all!

Just as a first thought, and the more senior programmers here must have done these kind of experiments hundreds of times. But nevertheless, the difference between the two sets material values seems rather large to me, suddenly the Queen would be valued a pawn less at longer time controls? It just seems a bit fishy

The trend that Bob sees is maybe not so surprising; at longer timecontrols, I would put more trust in the positional evaluations. If a side has a bad King safety, at longer time controls he would be in more danger. Maybe King Safety is a particularly bad example here for my case because of its volatile nature. At long time controls the side under attack also has better chances of finding a correct defence if there is one.

So, sharp continuations have to be of a different nature but may still pay off because you can see that in human correspondence games, outcalculating your opponent is still the best way to win a correspondence game. It should follow from this reasoning that not all positional values will pay off more (at longer timecontrols). Also there will be lots of other effects that may come into play; with a more material search (Crafty with new values at short time controls) you should be be able to search a bit deeper (within limits). But as Harm Geert pointed out in the thread started by Mark about piece values, changing values of Rook and Queen only makes a difference if both sides do not have equal numbers of these pieces. All in all I don't know if I would entirely trust this kind of result, even if it is based on so many games. Does the change make a difference in trade-offs? Is the effect different at longer timecontrols? Maybe that is something that can be measured as a control that the new material values have effects that Bob would expect.

Regards, Eelco

Uri Blass · Post by **Uri Blass** » Tue Apr 14, 2009 9:03 am

rjgibert wrote:
bob wrote:This has been discussed a good bit, but I just ran into a case that I thought might be interesting. I slightly modified the default piece values. Original values were P=1.0, N/B=3.25, R=5.0 and Q=9.7. I changed the R/Q values to 5.5 and 10.7. And ran a couple of fast game tests (32,000 games per run) at 10s+0.1s time controls. And found that this was worth about +10 Elo. I then re-ran the same test, but changed the time control to 5m+5s (much slower). Took a lot longer, but the interesting thing was this was a -20 Elo change. Nothing else changed between the two versions, both runs were 32,000 games against the usual opponents and positions.

We've had this discussion in the past where some claim that they've never seen a case where a program was better at fast games than at slow games or vice-versa. Here is a simple change that produces exactly that. Version A (original) is 10 Elo weaker than version B (new material scores) at very fast games. But version A is 20 Elo stronger at longer games. A 30 Elo change.

Goes to show that just fast games is not enough.
I think the issue isn't just whether or not this can happen, but also if it does, why? I suspect at least 2 things may be going on here.

One, deeper search can offset small errors by a program. Two, some eval changes can result in a speed up or slow down of search.

Perhaps greater search depth has more importance at faster time controls relatively, while better eval has relatively greater importance at slower time controls.

Is there a significant difference in time to depth between the 2 versions? If not, something else is going on that is maybe more subtle.

The problem is that it is possible to achieve bigger depth with counter productive pruning so it is not clear to me that a program that searches slightly deeper in term of plies has better search .

The only good test may be testing with fixed depth.
Note also that difference of 30 elo between blitz and long time control may happen but I did not see a single case of observed difference of more than 100 elo between blitz and long time control between different programs

The only program that is close to 100 elo difference in rating between blitz and long time control is rybka3 that is almost 100 elo better at blitz based on the CEGT rating list but having higher rating in blitz is a problem of almost all the top programs.

CEGT 40/120 list
1 Rybka 3 x64 2CPU 3117 22 22 582 72.3 % 2950 41.1 % (-92 elo relative to blitz)
2 Rybka 2.3.2a x64 2CPU 3034 15 15 1344 75.3 % 2841 38.9 %(-33 elo relative to blitz)

3 Naum 4 x64 2CPU 3024 28 29 287 44.4 % 3063 49.8 % (-43 elo relative to blitz)
4 Rybka 2.1c x64 2CPU 2989 15 15 1272 70.0 % 2841 39.9 %(-23 elo relative to blitz)
5 Rybka 2.3 x64 2CPU 2981 15 14 1319 69.2 % 2841 42.1 %(-16 elo relative to blitz)
6 Rybka 1.2f x64 2956 14 14 1401 68.4 % 2821 40.0 %(-2 elo relative to blitz)
7 Zappa Mexico II x64 2CPU 2943 15 15 1188 55.6 % 2904 45.9 %(+13 elo relative to blitz)
8 Deep Shredder 11 x64 2CPU 2932 16 16 1067 54.8 % 2898 42.4 %(-12 elo relative to blitz)
9 Naum 3 x64 2CPU 2925 15 15 1235 55.3 % 2887 43.6 %(-31 elo relative to blitz)
10 Zap!Chess Zanzibar x64 2CPU 2923 14 14 1350 61.9 % 2839 45.5 %(-8 elo relative to blitz)
11 Fritz 11 2902 16 16 1010 55.1 % 2866 45.1 %(-13 elo relative to blitz)
12 Naum 2.2 x64 2CPU 2897 13 13 1423 57.4 % 2845 48.6 % (-2 elo relative to blitz)
13 Deep Fritz 10 2CPU 2869 15 15 1330 54.8 % 2836 39.2 % (-21 elo relative to blitz)
14 Glaurung 2.1 X64 2CPU 2865 15 15 1101 49.8 % 2866 45.0 %(-39 elo relative to blitz)
15 HIARCS 11.2 2CPU 2860 15 15 1204 45.3 % 2893 41.7 % (-52 elo relative to blitz)
16 Bright-0.3d 2CPU 2845 17 17 935 47.6 % 2861 41.9 %(-22 elo relative to blitz)
17 Naum 2.1 x64 2CPU 2840 15 15 1150 51.1 % 2833 46.7 %(+1 elo relative to blitz)
18 HIARCS 11 2CPU 2836 16 16 1101 48.9 % 2843 39.5 %(-38 elo relative to blitz)
19 Deep Shredder 10 x64 2CPU 2834 14 14 1450 47.6 % 2851 40.5 %(-21 elo relative to blitz)
21 List 11.64 2CPU 2831 13 13 1550 47.5 % 2848 42.7 %(-34 relative to blitz)
22 Zap!Chess Paderborn x64 2CPU 2830 14 14 1300 47.9 % 2844 43.1 %
(+1 elo relative to blitz)
23 Fruit 2.3.3f 2825 16 16 1050 45.1 % 2859 40.0 %(-35 elo relative to blitz)
24 Spike 1.2 Turin 2CPU 2815 13 13 1500 44.3 % 2855 43.7 %(-49 elo relative to blitz)
25 Deep Junior 10 2CPU 2813 15 15 1396 45.0 % 2848 35.4 %(-8 elo relative to blitz)

27 Toga II 1.3.1 2807 15 15 1300 43.6 % 2852 40.7 %
28 Fritz 10 2794 17 17 1050 46.3 % 2819 33.6 %
29 Toga II 1.2.1 2791 15 15 1150 43.3 % 2838 42.2 %
30 Deep Sjeng 2.7 2CPU 2790 16 16 1150 42.8 % 2840 40.1 %
31 Hiarcs 10 2780 16 16 1150 42.0 % 2836 38.2 %
32 Spike 1.2 Turin 2772 14 14 1450 40.4 % 2839 38.8 %
33 Fruit 2.2 2768 16 16 1100 40.1 % 2838 39.8 %
34 Fritz 9 2757 17 17 1000 40.8 % 2821 35.5 %
35 Hiarcs X50 UCI 2752 16 16 1088 40.9 % 2815 40.2 %
36 Glaurung 1.2 2CPU 2734 16 16 1200 37.6 % 2822 37.8 %
37 Ktulu 8 2724 15 15 1350 36.4 % 2821 37.0 %
38 Chess Tiger 2007 2699 17 17 1000 33.8 % 2817 38.1 %

CEGT 40/4 list

4 Rybka 3.0 x64 2CPU 3209 15 15 1912 80.6% 2961 24.5%
13 Rybka 2.3.2a x64 2CPU 3067 10 10 2834 70.2% 2919 38.2%
12 Naum 4.0 x64 2CPU 3067 13 13 1600 63.2% 2973 39.8%
27 Rybka 2.1c x64 2CPU 3012 35 35 390 77.8% 2794 20.8%
34 Rybka 2.3 x64 2CPU 2997 20 20 900 72.2% 2832 29.9%
58 Rybka 1.2f x64 1CPU 2958 9 9 3978 63.8% 2859 35.5%
86 Zappa Mexico x64 2CPU 2930 9 9 3643 54.1% 2901 37.7%
71 Deep Shredder 11 x64 2CPU 2944 8 8 4443 50.0% 2944 32.6%
60 Naum 3.0 x64 2CPU 2956 11 11 2370 56.1% 2914 38.3%
85 Zap!Chess Zanzibar x64 2CPU 2931 15 15 1330 45.9% 2959 34.1%
95 Fritz 11 2915 6 6 8014 54.4% 2884 38.5%
106 Naum 2.2 x64 2CPU 2899 12 12 1836 52.6% 2880 42.6%
118 Deep Fritz 10 2CPU 2890 11 11 2790 49.2% 2896 28.9%
102 Glaurung 2.1 x64 2CPU 2904 11 11 2738 40.0% 2975 35.8%
97 Hiarcs 11.2 2CPU 2912 11 11 2608 51.4% 2902 36.8%
134 Bright 0.3d 2CPU 2867 15 15 1400 46.4% 2892 34.5%
161 Naum 2.1 x64 2CPU 2839 14 14 1423 49.7% 2841 38.5%
130 Hiarcs 11 2CPU 2874 17 17 1049 47.8% 2889 33.3%
149 Deep Shredder 10 x64 2CPU 2855 12 12 2150 47.8% 2870 30.2%
135 List-MP 11.64 2CPU 2865 15 15 1316 45.9% 2894 32.3%
172 Zap!Chess Paderborn x64 2CPU 2829 28 28 426 55.2% 2793 27.7%
140 Fruit 2.3.3f Test Beta 2860 8 8 4600 46.6% 2883 35.4%
137 Spike 1.2 Turin 2CPU 2864 6 6 8412 42.4% 2918 35.5%
184 Deep Junior 10 2CPU 2821 14 14 1698 42.3% 2875 25.9%

185 Toga II 1.3.1 2820 15 15 1300 51.0% 2814 38.5%
162 Fritz 10 2838 8 8 4620 50.0% 2839 28.9%
198 Toga II 1.2.1a 2804 9 9 4250 47.8% 2819 32.4%
182 Deep Sjeng 2.7 2CPU 2822 12 12 2178 43.5% 2867 33.4%
219 Hiarcs 10 2777 8 8 4590 55.1% 2741 30.2%
216 Spike 1.2 Turin 1CPU 2783 5 5 10214 45.9% 2812 33.7%
226 Fruit 2.2.1 2768 6 6 9203 55.5% 2729 31.4%
210 Fritz 9 2793 8 8 5754 54.4% 2762 27.0%
188 Hiarcs X50 2814 11 11 2519 49.3% 2819 33.5%
239 Glaurung 1.2 x64 2CPU 2746 33 33 304 53.5% 2722 28.6%
223 Ktulu 8 2773 6 6 8289 41.5% 2832 29.8%
240 Chess Tiger 2007.1 2745 8 8 4898 47.8% 2760 29.8%

Uri

MattieShoes · Post by **MattieShoes** » Tue Apr 14, 2009 11:26 am

With more datapoints, I suppose it'd be possible to construct a regression of optimal piece values based on TC*speed. If there's a 30 Elo difference with just those two settings, having piece values tuned to TC*speed could show a fairly significant strength jump, yes?

On a related note, have you experimented with changing piece values as the game progresses? Just from watching games, it seems to become much more important to have at least the same number of pieces since you eventually run out of pieces to protect your pawns...

Hmm, that'd be another interesting thing, adjusting piece values based on the number of pawns you have left... The most obvious would be something like KNP vs KB or something -- the pawn is worth far more than the knight at that point. Search would find this but correct default values based on situation could perhaps help in less obvious situations.

mcostalba · Post by **mcostalba** » Tue Apr 14, 2009 12:57 pm

bob wrote: And ran a couple of fast game tests (32,000 games per run) at 10s+0.1s time controls.

I really woudl like to test at that speed, but I cannot do that. At that time control depth is so low that futility desn't kick in (currently thresold is set at depth 7) so I need to reach at least depth 10-11 to have something real and practically it means to see PV in middle game printed with depth of 12-13 due to extensions.

This constrain, given your hardware, yields to the "realistic" minimum time control to use. I normally play at 1' + 0'' under Fritz GUI and 1' + 0.4" under Arena due to bigger (and crappy

Arena overhead in showing the moves and handling the protocol. Without that +0.4" engines go out of time more often then not.

Karlo Bala · Post by **Karlo Bala** » Tue Apr 14, 2009 1:45 pm

bob wrote:This has been discussed a good bit, but I just ran into a case that I thought might be interesting. I slightly modified the default piece values. Original values were P=1.0, N/B=3.25, R=5.0 and Q=9.7. I changed the R/Q values to 5.5 and 10.7. And ran a couple of fast game tests (32,000 games per run) at 10s+0.1s time controls. And found that this was worth about +10 Elo. I then re-ran the same test, but changed the time control to 5m+5s (much slower). Took a lot longer, but the interesting thing was this was a -20 Elo change. Nothing else changed between the two versions, both runs were 32,000 games against the usual opponents and positions.

We've had this discussion in the past where some claim that they've never seen a case where a program was better at fast games than at slow games or vice-versa. Here is a simple change that produces exactly that. Version A (original) is 10 Elo weaker than version B (new material scores) at very fast games. But version A is 20 Elo stronger at longer games. A 30 Elo change.

Goes to show that just fast games is not enough.

Faster time controls favors queens, and it is logical. It is easier to play with queen then against it when you have little time. 10.7 for queen is exaggerated (B+2N+P) and it is proven with longer time control.

hgm · Post by **hgm** » Tue Apr 14, 2009 4:32 pm

It might also have to do with the evaluation. I noticed that at low search depth engines tend to bunder away pieces against a Queen in the end-game, when they think they can afford to leave them undefended because the actual loss is beyond the horizon. But programs that give a penalty for undefended pieces fare much better, as they tend to keep they prefer to keep thei pieces defended even if they seem not to lose them if they don't. And then there is no way they are going to lose them.

fast vs slow games in testing

fast vs slow games in testing

Re: fast vs slow games in testing

Re: fast vs slow games in testing

Re: fast vs slow games in testing

Re: fast vs slow games in testing

Re: fast vs slow games in testing

Re: fast vs slow games in testing

Re: fast vs slow games in testing

Re: fast vs slow games in testing

Re: fast vs slow games in testing