Yes. Against a specific "weakness" the program shows. Not against a specific weakness the opponent shows. That was my point. I am not tuning to "beat glaurung 2 or Toga 2." I am tuning to play better chess overall, and then measuring against those to see if the result is better or worse. It is certainly possible, although not extremely likely, that a "better eval" would play worse against one of them. But I am using 4 opponents most of the time, sometimes 6 or 8, and aggregating the results to look at overall change. +=good, -=bad.mhull wrote:But you used to. IIRC, Roman would make suggestions based on his obeservations of crafty's play on ICC or a player would lean on a known weakness (merciless), and you would tune to improve on mistakes made.bob wrote:I think it might be a bigger danger if we were looking at games played in the testing, and then trying to tune the program to improve on mistakes made because that would be more like training against a specific opponent. We are not developing like that.Gian-Carlo Pascutto wrote:I don't think we have much data on whether performance increases against a small set of engines might be non-transitive against others.
We know *for sure* though the minimum error margins caused by playing a limited number of games.
Since Bob is testing small changes, he's going to be more worried about the latter (which we're certain about affects us) than the former (which is an unknown factor that may or may not affect us).
It is important to notice that we never tuned against a weakness an opponent had. Which is more dangerous. the "Gambit Tiger" is one example. It could beat humans right and left with a speculative style of play. Against programs with weak king safety, it would also win with spectacular attacks. But against programs that were more solid, it would go down in flames when the attack fails, and the material sacrificed to initiate the attack leads to an easy win by the opponent.
If I were watching individual cluster games and tuning to win games we are losing or drawing, I'd be much more concerned that we were training to beat specific opponents, probably to the detriment of play against other opponents. But we are not doing that. We simply use the test results to accept or reject a change we made that had nothing to do with the opponents and how they play.
Not quite. Kaufman claims their improvements have been the result of their testing, which is quite similar to mine, except it takes them overnight to play 40,000 games on an 8-core box, where I can play somewhat longer games and get the results back in an hour. But our approaches are quite similar other than the speed at which we get results back. And that is all cluster-testing offers. Speed. Nothing that I couldn't learn with a single box. Except I learn it far faster.But it is true that big increases have been found without the benefit cluster testing. Shredder held its edge for a long time. Now Rybka has a large edge, not found with the aid of cluster testing.bob wrote:Our changes are based on intuition and chess skill / understanding, we simply use testing to confirm or disprove our intuition. We don't make a single change to try to beat a single opponent. I can tune king safety differently and improve the results against Glaurung 2 / Toga 2. But I know those changes are not generic in nature and have a high risk of backfiring against a different program.
thorough testing. They just can't come anywhere near the turnaround time I can produce, other than that our approaches are quite similar.Yet, cluster testing is without a doubt a very powerful tool. But it's natural for people to wonder how other championship projects (Shredder, Rybka) discovered their crushingly harmonious balance of techniques without large computing resources.