Scaling

rjgibert · Post by **rjgibert** » Tue Apr 21, 2009 3:43 pm

Lets say you get an idea for improving your chess program and decide to test it with some blitz time limit. You determine that the idea is break even or performs a little worse. Now, as per usual, you would discard the idea and move on to some other idea to try. However, what if you instead decide to test it again with a much faster bullet time limit? And what if it then performs decidedly worse in bullet than in blitz? I think in this case discarding the idea then becomes a mistake, because what you have actually demonstrated is that the idea scales well. Your idea performs badly at bullet, so-so at blitz, so doesn't that suggest it would do well at a slow time limit?

My point is that an idea is only truly bad if it both performs badly and scales badly. Moreover, even if it had been 100 elo better at bullet, but merely 50 elo better at blitz, I think the idea would very likely be a bad one for slow games. You need to perfrom at least 2 tests at significantly contrasting time limits.

This brings me to a perhaps even more important point, you don't actually need to play slow games to determine whether an idea is a good one, you just need to use more than one time control to show that an idea scales well, so you may not need that cluster to do decent testing.

Something Evil

One thing that should be borne in mind is that the concept of scaling can be used to decieve your chess programming competitors by withholding your scaling data. You can steer them towards ideas that do well at fast time controls, but scale poorly. Similarly, you can steer them away from ideas that perform poorly at fast time controls, but scale really well. I think that scaling is the more important attribute than actual playing strength at least in the long run.

Note that such a deception can occur accidentally. Lets say you have some especially fast hardware. You test your idea and announce the result. Others decide to test your idea and they get a poorer result. Poor enough to make them not like the idea, but what has really happened is that they have inadvertently shown the idea to scale well by testing on their slower hardware. Paradoxically, in this case, a bad result may indicate an idea is promising! It's like I said, i]You need to perform at least 2 tests at significantly contrasting time limits.[/i]

bob · Post by **bob** » Tue Apr 21, 2009 5:55 pm

rjgibert wrote:Lets say you get an idea for improving your chess program and decide to test it with some blitz time limit. You determine that the idea is break even or performs a little worse. Now, as per usual, you would discard the idea and move on to some other idea to try. However, what if you instead decide to test it again with a much faster bullet time limit? And what if it then performs decidedly worse in bullet than in blitz? I think in this case discarding the idea then becomes a mistake, because what you have actually demonstrated is that the idea scales well. Your idea performs badly at bullet, so-so at blitz, so doesn't that suggest it would do well at a slow time limit?

I disagree with most of the above. Here's why. Very fast games are much more sensitive to how a program allocates/uses time. Which introduces a lot more randomness into the games. Ideas which slightly slow a program down look quite bad at very fast games, while they might be good at a longer time control and even better at normal time controls. Even decent blitz speeds requires tens of thousands of games to get a decent error bar. And games like 1+1 cause such a test to take 12 hours on my cluster, or 12*256 hours for a single box.

I've found several ideas that look good in fast games, but are significantly worse in the long games. I don't think there is any way to "cheat" the statistical gods here and get away with shortening the test process, unless you are talking about an idea that is 50-100 Elo better. I have yet to find any of those in my cluster testing. Most are 3-4-5 Elo better, which takes 32,000+ games to accurately measure...

My point is that an idea is only truly bad if it both performs badly and scales badly. Moreover, even if it had been 100 elo better at bullet, but merely 50 elo better at blitz, I think the idea would very likely be a bad one for slow games. You need to perfrom at least 2 tests at significantly contrasting time limits.

I agree with your last statement. But for accurate results, you really need to test at a time control that is as close as possible to what you intend to actually play at. I've had too many cases of very fast games looking good for a change, 1+1 games still looking good, but 30+30 games show a 20-30 Elo loss with the new change.

This brings me to a perhaps even more important point, you don't actually need to play slow games to determine whether an idea is a good one, you just need to use more than one time control to show that an idea scales well, so you may not need that cluster to do decent testing.

Something Evil

One thing that should be borne in mind is that the concept of scaling can be used to decieve your chess programming competitors by withholding your scaling data. You can steer them towards ideas that do well at fast time controls, but scale poorly. Similarly, you can steer them away from ideas that perform poorly at fast time controls, but scale really well. I think that scaling is the more important attribute than actual playing strength at least in the long run.

Note that such a deception can occur accidentally. Lets say you have some especially fast hardware. You test your idea and announce the result. Others decide to test your idea and they get a poorer result. Poor enough to make them not like the idea, but what has really happened is that they have inadvertently shown the idea to scale well by testing on their slower hardware. Paradoxically, in this case, a bad result may indicate an idea is promising! It's like I said, i]You need to perform at least 2 tests at significantly contrasting time limits.[/i]

mhull · Post by **mhull** » Tue Apr 21, 2009 6:10 pm

bob wrote:I agree with your last statement. But for accurate results, you really need to test at a time control that is as close as possible to what you intend to actually play at. I've had too many cases of very fast games looking good for a change, 1+1 games still looking good, but 30+30 games show a 20-30 Elo loss with the new change.

The processing cycles that are expended on today's hardware at short time controls would have made for long time controls on hardware 5 to 10 years ago. Is there a way to trend the "optimum" variables over time, so to speak. IOW, killer blitz settings of today (which are no good at standard today) would have been killer at standard some years ago. So where are the variables trending with increased compute power?

bob · Post by **bob** » Tue Apr 21, 2009 6:30 pm

mhull wrote:
bob wrote:I agree with your last statement. But for accurate results, you really need to test at a time control that is as close as possible to what you intend to actually play at. I've had too many cases of very fast games looking good for a change, 1+1 games still looking good, but 30+30 games show a 20-30 Elo loss with the new change.
The processing cycles that are expended on today's hardware at short time controls would have made for long time controls on hardware 5 to 10 years ago. Is there a way to trend the "optimum" variables over time, so to speak. IOW, killer blitz settings of today (which are no good at standard today) would have been killer at standard some years ago. So where are the variables trending with increased compute power?

I think the big issue is search. Search finds the "truth" while evaluation finds "an approximation to the truth that is not very accurate." Which would you rather do, evaluate something (say a weak pawn) or have a search that can discover whether the pawn is _really_ weak or not by looking far enough ahead to determine this? The classic isolated queen pawn comes to mind. At times it is simply a weak target. At other times it is quite strong. An evaluation will have trouble with this. A search that can go deep enough will not.

Otherwise, it is very difficult to answer your question. The only way to do so is to play incredibly slow games to see what happens as you change values. But this is not practical for obvious reasons. It is all happening in a way that only reveals the results when the hardware is fast enough to show that today's long T/C tuning is only good for blitz on tomorrows hardware, but is not very good for long time controls.

I don't believe it is very easy to extrapolate forward to predict long T/C results for 5 years in the future, while it is trivial to extrapolate backward since fast games today match up with slow games at some point in the past, in terms of tree search space / node counts.

rjgibert · Post by **rjgibert** » Tue Apr 21, 2009 6:45 pm

bob wrote:
rjgibert wrote:Lets say you get an idea for improving your chess program and decide to test it with some blitz time limit. You determine that the idea is break even or performs a little worse. Now, as per usual, you would discard the idea and move on to some other idea to try. However, what if you instead decide to test it again with a much faster bullet time limit? And what if it then performs decidedly worse in bullet than in blitz? I think in this case discarding the idea then becomes a mistake, because what you have actually demonstrated is that the idea scales well. Your idea performs badly at bullet, so-so at blitz, so doesn't that suggest it would do well at a slow time limit?
I disagree with most of the above. Here's why. Very fast games are much more sensitive to how a program allocates/uses time. Which introduces a lot more randomness into the games. Ideas which slightly slow a program down look quite bad at very fast games, while they might be good at a longer time control and even better at normal time controls. Even decent blitz speeds requires tens of thousands of games to get a decent error bar. And games like 1+1 cause such a test to take 12 hours on my cluster, or 12*256 hours for a single box.

I've found several ideas that look good in fast games, but are significantly worse in the long games. I don't think there is any way to "cheat" the statistical gods here and get away with shortening the test process, unless you are talking about an idea that is 50-100 Elo better. I have yet to find any of those in my cluster testing. Most are 3-4-5 Elo better, which takes 32,000+ games to accurately measure...

Lets say you have a version X of your program that is your current version. And a new version Y that you want to compare it to that only has changes to its eval. Now you test each one twice (bullet + blitz) against an identical mixture of opponents. And lets say the result is that version Y is much weaker at bullet and only a little bit weaker at blitz. Which version do you prefer for a slow game? X or Y?

Note that the version X time management code has remained the same as the version Y management code. Problems caused by this code for one will be shared more or less by the other to the same degree at the same time limits. It may introduce noise into the result, but more importantly it shouldn't introduce bias.

bob · Post by **bob** » Tue Apr 21, 2009 8:11 pm

rjgibert wrote:
bob wrote:
rjgibert wrote:Lets say you get an idea for improving your chess program and decide to test it with some blitz time limit. You determine that the idea is break even or performs a little worse. Now, as per usual, you would discard the idea and move on to some other idea to try. However, what if you instead decide to test it again with a much faster bullet time limit? And what if it then performs decidedly worse in bullet than in blitz? I think in this case discarding the idea then becomes a mistake, because what you have actually demonstrated is that the idea scales well. Your idea performs badly at bullet, so-so at blitz, so doesn't that suggest it would do well at a slow time limit?
I disagree with most of the above. Here's why. Very fast games are much more sensitive to how a program allocates/uses time. Which introduces a lot more randomness into the games. Ideas which slightly slow a program down look quite bad at very fast games, while they might be good at a longer time control and even better at normal time controls. Even decent blitz speeds requires tens of thousands of games to get a decent error bar. And games like 1+1 cause such a test to take 12 hours on my cluster, or 12*256 hours for a single box.

I've found several ideas that look good in fast games, but are significantly worse in the long games. I don't think there is any way to "cheat" the statistical gods here and get away with shortening the test process, unless you are talking about an idea that is 50-100 Elo better. I have yet to find any of those in my cluster testing. Most are 3-4-5 Elo better, which takes 32,000+ games to accurately measure...

Lets say you have a version X of your program that is your current version. And a new version Y that you want to compare it to that only has changes to its eval. Now you test each one twice (bullet + blitz) against an identical mixture of opponents. And lets say the result is that version Y is much weaker at bullet and only a little bit weaker at blitz. Which version do you prefer for a slow game? X or Y?

Note that the version X time management code has remained the same as the version Y management code. Problems caused by this code for one will be shared more or less by the other to the same degree at the same time limits. It may introduce noise into the result, but more importantly it shouldn't introduce bias.

If that is all I had to go on, I am not sure what I would choose. I have had too many cases where things looked good at fast time controls, both bullet and blitz, but was clearly worse at standard time controls. I have had cases where things looked bad at bullet/blitz, but better at standard.

I use your "trend" idea, but in a different way. If blitz is better than bullet, that triggers a longer-game test. If blitz is worse than bullet, I often stop there, and only occasionally try long games if my intuition tells me that the idea ought to work...

The main problem is you still need tens of thousands of games at each level to get an accurate measure and weed out the randomness.

MattieShoes · Post by **MattieShoes** » Tue Apr 21, 2009 9:32 pm

Out of curiosity, do you remember any tests you've done that don't follow the trend? Where both lightning and standard time controls show gains but blitz is worse? Or vice versa?

I'm just curious what sort of change might cause that.

rjgibert · Post by **rjgibert** » Tue Apr 21, 2009 11:10 pm

I think you're overlooking something rather obvious. You say to make sure, you test with slow games, but this is not possible. The number of changes you have made to crafty are legion and most of them you tested with much slower hardware. You might do some retesting with your faster hardware from time to time, but you can't repeat all the tests you have made for all of your changes. In short, yesteryears slow game testing is effectively fast game testing today, so whether you like it or not, you must do some type of extrapolating whether you are aware of it or not. The best you can do is settle on some rational coherent way of doing something like I suggest rather pretend your slow game testing makes what you do all that different. The fact that hardware is contunuallly improving at least as fast as the software screws your scheme up. This is just one of many fundamental problems of testing in computer chess.

If you don't like scaling with 2 fast time limit games, maybe you would prefer testing with 3 fast time limit games to see if the scaling curves upward or downward. The thing to do is to try to measure the general reliability of any such procedure to see just how useful it is.

Aleks Peshkov · Post by **Aleks Peshkov** » Wed Apr 22, 2009 1:11 am

If your bullet performance is bad, most probably your modification is badly tuned or tuned for some magic search depth, but it does not mean that it scales well. Badly tuned function may have random point of maximum and can easily degrade with infinite time.

bob · Post by **bob** » Wed Apr 22, 2009 1:54 am

MattieShoes wrote:Out of curiosity, do you remember any tests you've done that don't follow the trend? Where both lightning and standard time controls show gains but blitz is worse? Or vice versa?

I'm just curious what sort of change might cause that.

I don't remember any specifics. I am looking right now at an LMR modification. It looked bad at very fast time controls. At blitz, it appeared to be a 10-15 Elo improvement. But at 10+10 it dropped to a -10 loss. And after an incomplete 30+30 run it looked to be -25 to -30 change before I moved on to the next idea.

That isn't that common, and I don't take eval changes to quite that extreme in testing, but any search changes I try to vet carefully since many are depth-sensitive.

Scaling

Scaling

Re: Scaling

Re: Scaling

Re: Scaling

Re: Scaling

Re: Scaling

Re: Scaling

Re: Scaling

Re: Scaling

Re: Scaling