Robert Flesher wrote:Don wrote:I created a utility called similar which measures how different one chess program is from others. It does this by running 2000 position from random games and noting how often the moves agree and as output returns the percentage of moves that match.
You can get it here:
http://komodochess.com/pub/similar.zip
Here is some sample output, comparing Robbolito with a few other programs:
------ RobboLito version 0.084 (time: 100 ms) ------
69.25 Houdini 1.5 w32 (time: 100 ms)
66.90 Rybka 3 (time: 100 ms)
61.70 Stockfish 1.9.1 JA 64bit (time: 100 ms)
61.35 Stockfish 1.8 JA (time: 100 ms)
59.80 Komodo64 1.2 JA (time: 100 ms)
59.15 Komodo 1.0 (time: 100 ms)
58.95 Stockfish 1.7.1 64bit (time: 100 ms)
58.95 Stockfish 1.6 64bit (time: 100 ms)
57.00 Fruit 2.3.1 (time: 100 ms)
56.20 Fruit 2.1 (time: 100 ms)
I have not tested this on windows so I'm hoping to get some feedback specific to windows.
The similar.exe is designed to run on 64 bit windows and is actually a tcl script wrapped up with a tcl runtime using tclkit technology. I am also including the "starkit" which is platform independent, but requires a tclkit runtime for your platform. It is similar to a jar file and can be taken apart and inspected and modified if you wish - assuming you know how to work with starkit's and such. google for starkit and sdx.kit for more information.
Please let me know if you find this interesting or useful. Email me at
drd@mit.edu
Don
I found this interesting, I quote BB "
I looked at what he wrote and it all sounds very reasonable. What is your point? I agree with everything he said. He said nothing here that was not immediately obvious to me.
I will add a couple of clarifications to what he is saying below (I realize that I am not directly responding to BB.)
From TalkChess (Don Dailey):
I created a utility called similar which measures how different one chess program is from others. It does this by running 2000 position from random games and noting how often the moves agree and as output returns the percentage of moves that match.Since this has been in the works for some time, I've had ample time to prepare any criticism. I will try to leave the semantics aside (though calling it a "clone tester" cries out for nuance), and stick with scientific observations. I must say that I would find such a tool to be valuable if it is done in a scientifically proper manner, and its results parsed according to their proper scope.
The scope I have been urging is that it not be taken too seriously as I am well aware of it's limitations.
Firstly, I would say that the utility measures how much the choice of best move output from one chess program differs from another. It is a different question to say how "similar" this makes them, which seems to be a word with many possible meanings. Indeed, it seems almost tautological to say that "clone testing" (or derivative, if you prefer) is better performed by an actual examination of the executables, though perhaps this is thought too time-consuming (or for the future "rental engines", maybe it is impossible). However, the utility does serve a useful purpose if its output has nonzero correlation with clones and/or derivatives.
I would like to point out that I chose that initial subject line to draw attention to the post. I regret the trouble it caused as people took it WAY too seriously and it has already been cast as a tool of the devil to persecute good people with.
The first problem I have with much of the discussion is that no sense of statistical error is ever mentioned. For instance, running a 1000 position suite should give a 95% confidence interval only of plus/minus 30 positions. This is fairly easily remedied simply by appending the additional maths.
That is why I have been completely transparent about how many problems are in the test and exactly how it works and what it does and why it should not be taken too seriously. I really did not have the time and energy to turn this into a highly polished project, but the source code is there and anyone is free to improve upon it.
In particular, "false positives" should appear rather frequently in a large enough pool, and robust methods to minimise their impact should be used (the numbers seem largely to be in the 550-650 range for random data, and 650-700 for semi-correlated). I can't say I am particularly enamoured by the use of techniques seen in biometry to draw putative hierarchal relationships either.
Another problem is strength conflation, that is, two engines will play similar moves simply because there actually is a "best" move, and suitably strong engines will all agree. This effect is rather hard to measure, and always seems to be in the background.
This is obviously a potential issue. There was some effort to remove obvious moves but if you tamper with the data too much it becomes more suspect. For example if I use 2 programs to measure what an obvious move is, I am "fixing" the test in a way - the test will be biased against considering these 2 program similar.
What I actually did was use several programs and if they all agreed even at low to high depths on a give move I considered it too easy. I did not do this to prepare the data for this application, it was done for totally different reasons and I simply used those positions for the similarity tester.
In contrast, for instance with Toby Tal, it was found to be a clone (or at least the move generator) by giving it a battery of ten mate-in-1 positions with multiple solutions, and seeing an exact match with RobboLito (or something in that family). Here is one possible way to take a first whack at the effect of strength. First test (say) 15 engines at 0.1s per move, getting 105 pairwise measurements. Then do the same at 1.0s per move. As engines should play stronger at 1s per move, presumably the typical overlap (among the 105 comparisons) should be greater. By how much is it? A little or a lot?
The test could be greatly improved by trying to cull out positions that have little impact on this measurement, but I fear that I would inadvertently being putting some kind of bias into the test.
A third critique involves self-validation, or perhaps more generally what could be called playing style. For instance, comparing Engine X at 0.1s to itself at 1.0s is said to be a way of showing that the utility detects not strength but style, as the correlation factor is still typically quite high. Whether or not this holds for a variety of engines (those deemed "tactical" versus "positional", or perhaps those using MTD(f) simply change their mind more/less than PVS) remains to be seen. I guess I am not so prone to agree with the statement: "I believed [...] it is far more difficult to make it play significantly different moves without making it weaker."
When I preface something with "I believe" I do it so that it is not taken as a statement of fact. It's merely an opinion. However, in this case I believe pretty strongly in this principle because both Larry and I have TRIED to change the playing style of Komodo and this makes it play weaker. It's intuitive to me that if I were to take Robbo sources I would have a very difficult time making it play a lot differently without weakening it. It wold be easy of course to make it play differently if I were willing to sacrifice ELO.
I think this concept makes a lot more sense to someone who actually has a lot of experience writing strong chess programs - it's probably not obvious to non-programmers or authors of weak programs but it's one of those things much easier said that done.
This tool has showed me that playing style is almost completely about the evaluation function. Try turning off LMR or drastically changing the search and run this test and you will find that it does not have much impact on the test.
The ELO, as you already pointed out DOES have a small impact as there are surely some moves in the test that strong programs will prefer over weak programs - but the effect is amazingly small.
I provided a way to eliminate most of that bias and people are already using it. You can give weaker programs more time to equalizing their rating. You can easily get within 50-100 ELO with some careful head to head tests to find out how much handicap is needed to equalize ratings.
Finally, as noted above, the question of "move selection" versus "similar ideas" (in the sense of intellectual property) is not really resolved, as one can use many of the "same ideas" with different numerology, and get notably different play. It all depends on how much weighting you give in your sense of "clone" to the concept of the "feature set" of an evaluation function as opposed to merely the specific numerical values therein.
Yes, this is possible and there may be something to it. I think it's your weakest criticism however.
I would challenge anyone to take the robbolito code and make it play substantially differently SOLELY by changing the evaluation weights and nothing else while also not making it play any weaker. I think this little exercise is likely to humble those who don't really understand the difficulty of engineering the evaluation function of a chess program.
The prospective difficulties of drawing conclusions from these methods are seen in:It looks to me that after Rybka 1.0 the program changed very substantially. From this I would assume he completely rewrote the program, and certainly the evaluation function.Au contraire, a disassembly of the Rybka 2.3.2a evaluation function will show much of it to be still quite Fruit-like in its framework, with only two or three minor variations in the features from Rybka 1.0 Beta. The PST is slightly more tweaked, but my impression is that almost all the substantive changes from Rybka 1.0 Beta until LK's work with Rybka 3 were in the search (and some tuning of eval weightings, PST, and material imbalances). [Perhaps the fact that Rybka 1.0 Beta used lazy eval way too often due to a mismatch with 3399 vs 100 scalings might also play a rôle here]. "
BB once again provides objective clarity.
It's not all that objective. In the end he gives his SUBJECTIVE opinion of the source code and claims facts that are just opinions. That is not "objective." For instance he says, "a disassembly of the Rybka 2.3.2a evaluation function will show much of it to be still quite Fruit-like in its framework ..." Is that actually a fact or is it his subjective impression? Are you taking his word for it or are you checking for yourself?
I also used highly subjective language in what he was quoting from me, but you will notice that I did not present it as factual like he did. I said things like "I believe" and "it looks to me like .." in other words I am admitting a subjective viewpoint.
One must be very careful when blindingly accepting the analysis of others that you cannot or are not willing to check for yourself.
That is why I am presenting you with a concrete tool that YOU can check out for yourself and draw your own conclusions, I am trying to put the power in YOUR hands instead of just something you can read about.