Sven Schüle wrote:lucasart wrote:Sven Schüle wrote:lucasart wrote:So you are saying that it's only a gain in self-play and just break-even against foreign opponents? I've heard so many claims like that, of patches than behave differently against foreign opponents than in self-play, but never seen any evidence myself.
It would be remarkable if this claim is correct. And would certainly question our whole SF testing methodology!
It is obvious that the rating difference between two engines (or versions of the same engine) E1 and E2 depends on the set of opponents that you use to calculate it. If you obtain the rating difference from games between E1 and E2 only ("self-play") you get a difference D1. If you let E1 play against a gauntlet, then E2 against the same gauntlet, you get a difference D2 which may or may not be equal to D1. The reason for that is simply the non-transitivity of playing strength in chess.
For some engines E1/E2 practical tests might show that D2 is often very close or equal to D1, but for others (e.g. Crafty) this may be different. For SF as well as for other engines I'd say that you can't tell it without actually trying. Also the personal goals may differ: while someone wants to get testing results that resemble common rating lists as close as possible, someone else may be satisfied to find out whether a new version of his engine beats the previous version.
I don't argue the fact that D1 != D2. That's obvious, and typically self-play increases rating difference. That is an observation backed by evidence!
But the fact that D1 and D2 can have opposite signum, I have never seen any valid evidence of that in my years of testing. Apart from hand waving, never seen any proof.
PS: I'm not talking about a theoretical possibility here. I'm talking about a real life scenario backed by evidence (accounting for compounded error bars etc.)
Since you seem to agree that the case (D1 * D2 < 0), i.e. opposite signs, is theoretically possible, I would indeed expect evidence from the SF team that for SF (D1 * D2 >= 0) is always true and the opposite does not occur. Of course you may claim that the SF strategy has proven to be very successful, but as long as only D1 is known and never a "D2" you can't say how many accepted patches would have been rejected due to negative D2 and how many rejected patches would have been accepted vice versa. I see no reason to believe that the case of opposite signs would practically never happen, and Crafty is one example where it has actually occurred as reported by Bob. Maybe other people can provide data to support this.
Practically, for (D1 * D2 < 0) to occur you need a change that performs worse against more than one half of the gauntlet but better against the remaining part of it, or exactly vice versa, and in each case the previous version of your engine belongs to the remaining part.
I have never looked into the technical details of the SF testing framework but I think it should be fairly easy to provide a modified version of it that is based on gauntlets with a fixed set of opponents (possibly adding the previous SF version as another reference engine). Bob can do it with his cluster, and I'm pretty sure the SF group can do that as well!
The first time I saw this was fiddling around with king safety. I had removed the old offensive king safety code (this is a LONG time back, and this was the code that tried to initiate attacks, such as pawn-storms and such) which didn't work very well (it was somewhat passive or insensitive to what I considered to be significant king safety changes in the position/game.)
A rewrite and just basic tuning produced something that I thought was a bit better. To be sure I had not broken anything, I ran a crafty vs crafty' (crafty' had new king safety code) match and tuned a bit. And crafty' was winning by a pretty significant margin. When I dropped it into the gauntlet test, the roof fell in. As I looked at games to see what was going on, the new code was simply way too aggressive when the opponent had an idea of how to actually attack/defend on the kingside. It would initiate attacks that were speculative, and the opponent would defend reasonably and Crafty' would simply end up in positions that were wrecked from a pawn structure perspective (or worse, of course).
Quite a few times when I added something completely new to crafty, it would beat its older cousin, but then do worse against the gauntlet. That's what I have always maintained that self-play is not a bad testing/debugging approach, but by itself it can produce wrong answers. Most of my self-testing is for stress-testing things as I get 2x the info when I am interested in "does this code break" (notably parallel search code)?
I have seen cases where self-test looks good, but gauntlet looks bad.
I have seen cases where self-test looks good, but gauntlet looks even (one has to think a bit about what to do here. in my case, if the code is cleaner or simpler, I'll keep it, if not no)
I have NOT seen a case (as of yet, where self-test looked bad but gauntlet looked good, although logic says that if the reverse is true, this must be also.) However, since self-play is not my normal testing approach, this might be a simple lack of enough testing of both types.
There are other testing issues I have also previously reported. The most important was the time-control issue. To date, I do not recall ANY evaluation changes other than king safety, that were sensitive to time control. I had a few (back in that same king safety testing I mentioned above) where faster games would suggest a change was good, but slower games would say "worse". This was a product of an eval saying one thing but a deeper search showing that was wrong tactically. I have seen a LOT of cases where a search change was very sensitive to time control. My normal quick-testing has been to play 30K games, at a time control of 10s + 0.1s increment, which completes in under an hour. My next stop has always been 1m + 1s, which is more like 12 hours or so. Normal eval changes have seen almost perfect correlation between the two time controls but search changes not so much.
One good example was the SE/threat tests. At fast time controls they don't have much time to kick in, so if they are bad, they don't look as bad as they really are if you use very fast games. When I was testing over the past 6 months, it has been 1m + 1s exclusively to give the search enough time to reach a depth where the extensions are being hit enough to really influence the game everywhere.
For SE particularly (hsu SE, not ROBBOLITO SE) it seems to pass the "eye test" pretty well. It will spot some WAC tricks quite a bit sooner, and when you look at PVs, you see extensions in the right place (I had modified Crafty so that any non-check extension added a "!" to the end of the move, as I had done years ago for EGTB probes where "!" was used to say "this is the only move that leads to an optimally short mate.") So looking at the output you think "this is not bad" which is EXACTLY what I had done in Cray Blitz. And it was almost exactly what Hsu had done in 1988, testing on tactical positions to see if it "looked better" and then playing a total of 20 self-test games which was almost a random result. A good one was fine #70, for example, since white (and black on occasion) has exactly one move in a given position that preserves the win (Kb1!) for the first move as an example, but then white has just one move at each position beyond ply=1, and it depends on what black does (coordinating squares idea, sort of like way distant opposition). But each time, doing better tactically did not translate into doing better in real games.
As far as the idea Marco tested, that I am certain is bad. IE if you fail low on a normal null-move, and you fail high on a real move, extend. The reason I am sure here is that Hsu recommended a value of 150, which Thomas refined a bit (but still over a pawn). I tested a bunch of different values here and when I dropped below an offset of 100, the scores started to plummet. A number like alpha-31, alpha-30 produced a -50 Elo change. Hsu originally used -150, with a pawn value of 128, for reference. I also tried a bunch of different null-move R values. My current R = 3 + depth/6 was better, but not as good as no SE. I thought about this test for a while and concluded "R=bigger means less overhead but less accuracy, and vice-versa. So I tried more aggressive R such as R = depth/2, and a few others that were pretty far out there. The problem is that the test is based on errors and overhead, which are inversely proportional. LMR is a similar animal, and it took a while to get to where I am currently. Apparently LMR (at least for Crafty) makes SE/TE ineffective and redundant. As LMR gets more aggressive, null-move doesn't help as much as it used to. A few years ago someone asked "what is null-move worth?" and since I had not tested this in a while, I gave it a run. Removing null-move cost me 40 Elo. Removing LMR (which at the time was a static reduction as pretty much everyone used initially) also costs me about 40 Elo. Removing BOTH cost me about 120 Elo. But what was funny was the overlap. No LMR/NULL-MOVE was -120, no LMR cost me -40, which mean null-move is adding 80, no null-move cost me -40 which means LMR was adding 80, but once I had one, no matter which one, the other only added 50% of what it would add by itself.
Hsu and Thomas eventually reported a +9 gain for SE after reasonable levels of testing. Without null-move or LMR. After dumping several trees, we already have a sort of SE. For example, the hash move is searched first, and I don't think anybody reduces the first move at any ply, hence it is extended automatically by not being reduced. History counters push moves with fail-high tendencies up in the move list, reducing them less (or pseudo-extending them more). I've become convinced that SE simply doesn't fit very well with LMR since they are already doing almost exactly the same thing.
Results from others would be interesting. My results have covered the past 6 months+, and all I have to show for it is a much cleaner search function with the original check extension and nothing else... The parallel search is a bit improved however, since the rewriting led me to fixing some split issues there that were a bit of a bottleneck.
I have saved both the SE and TE versions, but I am not sure I will re-visit them again. One note, I did NOT modify anything else when adding either SE or TE. IE I did not try a less aggressive LMR or null-move search. It is certainly possible that changes there might make a difference. One idea I thought about but did not test was to try to flag a position as "dangerous" and just dial back LMR for all or some of the moves, rather than trying to extend one. Right now I believe there is more to gain by working on ways to better order the moves. In years gone by, all we needed was a "good enough move" ordered first to optimize tree size. Now we need more since we treat moves differently depending on where they occur in the move list. Better ordering now has the chance of actually making a program play better, as opposed to just offering a small speed gain by producing smaller trees...