Similarity Detector Available

IWB · Post by **IWB** » Mon Dec 27, 2010 12:22 pm

No Graham, I wont!

The "clone-tool" is showing a percentage of identical moves, nothing more, nothing less. The name "clone-tool" indicates that it can find clones ... it cant but it will be used for witch hunting and I will not take part in this.

But regardless of any engine and the outcome of the percentage. Some will say its a clone at 70, some at 65, and some already at 60%. And why is it a clone at 70, but not at 65%. The whole thing is useless for detecting clones.

What about the possibility, that engines in a certain playing strength area produce per definition a certain degree of identical moves? If there is such a thing like a perfect game two engines have to produce 100% identical moves (or very close to that). If this is the case it seems logical that with growing playing strength the moves become more identical. (And still the engines might differ a bit) All this is unproven ... as the "clone-tool" is not identifying clones for sure.
All I am saying is, that you can not be sure why there is a certain percentage of similar moves without looking in the sources. There are several possibilities and we are not able to distinguish between them.

So, at the end, the only use for this tool is to "prove" something for those who wants to believe in something.

BTW: The use is far from trivial. You have to be sure that all engines use the same amount of CPU time ... eg, do not mix MP and single engines. I realized that I do not own the single versions of some commercial engines anymore (deleted) and some engines default to the maximum available cores while others are only available as single engines ...

Bye
Ingo

Don · Post by **Don** » Mon Dec 27, 2010 12:49 pm

stevenaaus wrote:
It does this by running 2000 position from random games and noting how often the moves agree
Don, how do you select the 2000 positions ?

Just a statistical observation (as i haven't tried it out) - But less complicated positions will generally have a "best move", which any strong engine will find. So would it be better to use 2000 hand-picked , complicated positions rather than random ones ? Or even select the exact middle move from 2000 different games.

Larry and I selected several thousand positions 3 or 4 years ago from correspondence games. It was not for this tool and the positions were selected to be used as a tool to make the program play more like humans do.

I removed a subset of the positions (years ago) that many different programs clearly agree is best - in other words the "easy" ones.

The positions were randomized and I took a subset of them for this test.

Steve B · Post by **Steve B** » Mon Dec 27, 2010 12:54 pm

Moderation has removed several personal attack posts and other abusive posts in this thread
certain members were contacted via PM
Steve

AdminX · Post by **AdminX** » Mon Dec 27, 2010 2:23 pm

Albert Silver wrote:Have you tried this method comparing the default Rybka 3 to the Dynamic and Human flavors?

Robert Flesher · Post by **Robert Flesher** » Mon Dec 27, 2010 2:33 pm

Don wrote:I created a utility called similar which measures how different one chess program is from others. It does this by running 2000 position from random games and noting how often the moves agree and as output returns the percentage of moves that match.

You can get it here: http://komodochess.com/pub/similar.zip

Here is some sample output, comparing Robbolito with a few other programs:

------ RobboLito version 0.084 (time: 100 ms) ------
69.25 Houdini 1.5 w32 (time: 100 ms)
66.90 Rybka 3 (time: 100 ms)
61.70 Stockfish 1.9.1 JA 64bit (time: 100 ms)
61.35 Stockfish 1.8 JA (time: 100 ms)
59.80 Komodo64 1.2 JA (time: 100 ms)
59.15 Komodo 1.0 (time: 100 ms)
58.95 Stockfish 1.7.1 64bit (time: 100 ms)
58.95 Stockfish 1.6 64bit (time: 100 ms)
57.00 Fruit 2.3.1 (time: 100 ms)
56.20 Fruit 2.1 (time: 100 ms)

I have not tested this on windows so I'm hoping to get some feedback specific to windows.

The similar.exe is designed to run on 64 bit windows and is actually a tcl script wrapped up with a tcl runtime using tclkit technology. I am also including the "starkit" which is platform independent, but requires a tclkit runtime for your platform. It is similar to a jar file and can be taken apart and inspected and modified if you wish - assuming you know how to work with starkit's and such. google for starkit and sdx.kit for more information.

Please let me know if you find this interesting or useful. Email me at drd@mit.edu

Don

I found this interesting, I quote BB "

From TalkChess (Don Dailey):
I created a utility called similar which measures how different one chess program is from others. It does this by running 2000 position from random games and noting how often the moves agree and as output returns the percentage of moves that match.Since this has been in the works for some time, I've had ample time to prepare any criticism. I will try to leave the semantics aside (though calling it a "clone tester" cries out for nuance), and stick with scientific observations. I must say that I would find such a tool to be valuable if it is done in a scientifically proper manner, and its results parsed according to their proper scope.

Firstly, I would say that the utility measures how much the choice of best move output from one chess program differs from another. It is a different question to say how "similar" this makes them, which seems to be a word with many possible meanings. Indeed, it seems almost tautological to say that "clone testing" (or derivative, if you prefer) is better performed by an actual examination of the executables, though perhaps this is thought too time-consuming (or for the future "rental engines", maybe it is impossible). However, the utility does serve a useful purpose if its output has nonzero correlation with clones and/or derivatives.

The first problem I have with much of the discussion is that no sense of statistical error is ever mentioned. For instance, running a 1000 position suite should give a 95% confidence interval only of plus/minus 30 positions. This is fairly easily remedied simply by appending the additional maths. In particular, "false positives" should appear rather frequently in a large enough pool, and robust methods to minimise their impact should be used (the numbers seem largely to be in the 550-650 range for random data, and 650-700 for semi-correlated). I can't say I am particularly enamoured by the use of techniques seen in biometry to draw putative hierarchal relationships either.

Another problem is strength conflation, that is, two engines will play similar moves simply because there actually is a "best" move, and suitably strong engines will all agree. This effect is rather hard to measure, and always seems to be in the background. In contrast, for instance with Toby Tal, it was found to be a clone (or at least the move generator) by giving it a battery of ten mate-in-1 positions with multiple solutions, and seeing an exact match with RobboLito (or something in that family). Here is one possible way to take a first whack at the effect of strength. First test (say) 15 engines at 0.1s per move, getting 105 pairwise measurements. Then do the same at 1.0s per move. As engines should play stronger at 1s per move, presumably the typical overlap (among the 105 comparisons) should be greater. By how much is it? A little or a lot?

A third critique involves self-validation, or perhaps more generally what could be called playing style. For instance, comparing Engine X at 0.1s to itself at 1.0s is said to be a way of showing that the utility detects not strength but style, as the correlation factor is still typically quite high. Whether or not this holds for a variety of engines (those deemed "tactical" versus "positional", or perhaps those using MTD(f) simply change their mind more/less than PVS) remains to be seen. I guess I am not so prone to agree with the statement: "I believed [...] it is far more difficult to make it play significantly different moves without making it weaker."

Finally, as noted above, the question of "move selection" versus "similar ideas" (in the sense of intellectual property) is not really resolved, as one can use many of the "same ideas" with different numerology, and get notably different play. It all depends on how much weighting you give in your sense of "clone" to the concept of the "feature set" of an evaluation function as opposed to merely the specific numerical values therein.

The prospective difficulties of drawing conclusions from these methods are seen in:It looks to me that after Rybka 1.0 the program changed very substantially. From this I would assume he completely rewrote the program, and certainly the evaluation function.Au contraire, a disassembly of the Rybka 2.3.2a evaluation function will show much of it to be still quite Fruit-like in its framework, with only two or three minor variations in the features from Rybka 1.0 Beta. The PST is slightly more tweaked, but my impression is that almost all the substantive changes from Rybka 1.0 Beta until LK's work with Rybka 3 were in the search (and some tuning of eval weightings, PST, and material imbalances). [Perhaps the fact that Rybka 1.0 Beta used lazy eval way too often due to a mismatch with 3399 vs 100 scalings might also play a rôle here]. "

BB once again provides objective clarity.

gerold · Post by **gerold** » Mon Dec 27, 2010 2:34 pm

IWB wrote:No Graham, I wont!

The "clone-tool" is showing a percentage of identical moves, nothing more, nothing less. The name "clone-tool" indicates that it can find clones ... it cant but it will be used for witch hunting and I will not take part in this.

But regardless of any engine and the outcome of the percentage. Some will say its a clone at 70, some at 65, and some already at 60%. And why is it a clone at 70, but not at 65%. The whole thing is useless for detecting clones.

What about the possibility, that engines in a certain playing strength area produce per definition a certain degree of identical moves? If there is such a thing like a perfect game two engines have to produce 100% identical moves (or very close to that). If this is the case it seems logical that with growing playing strength the moves become more identical. (And still the engines might differ a bit) All this is unproven ... as the "clone-tool" is not identifying clones for sure.
All I am saying is, that you can not be sure why there is a certain percentage of similar moves without looking in the sources. There are several possibilities and we are not able to distinguish between them.

So, at the end, the only use for this tool is to "prove" something for those who wants to believe in something.

BTW: The use is far from trivial. You have to be sure that all engines use the same amount of CPU time ... eg, do not mix MP and single engines. I realized that I do not own the single versions of some commercial engines anymore (deleted) and some engines default to the maximum available cores while others are only available as single engines ...

Bye
Ingo

Thanks for your posts Ingo. Looks like another witch hunt is right.
Let the kids have their fun. Nice Joke.

Best,
Gerold.

Don · Post by **Don** » Mon Dec 27, 2010 2:47 pm

George Tsavdaris wrote:
Don wrote:I created a utility called similar which measures how different one chess program is from others. It does this by running 2000 position from random games and noting how often the moves agree and as output returns the percentage of moves that match.
I guess the positions are a secret for avoiding future clone engines that would try to adapt and manipulate their output.
But it would be nice if there was a setting for putting your own positions.

You can put your own positions. In the distribution there is a file called similar.kit, which is a lot like a java jar file, but for tcl. with the right tools you can unpack this and change the code or the data file that contains the positions.

It would take too long to explain this, but the tools you need can be found here:

http://equi4.com/tclkit/

Does the tool also compares evaluations of the engines or only the selected move?

Just the moves. Evaluations can be faked.

"and noting how often the moves agree"
For different plies or one note/count per position? I mean does the tool finally divides by 2000 for getting the percentage?

Yes. The test is run for 1/10 second but you can change this for any given program.

I have discovered that altering the time does not change the move that often. It was argued that stronger programs play more alike but this test shows that this effect is minor and it's mostly about playing style. Try it yourself - if you run Komodo 4x longer the test is not fooled into thinking Komodo is a different program, it will still correlate very highly.

I have not tested this on windows so I'm hoping to get some feedback specific to windows.

The similar.exe is designed to run on 64 bit windows
Designed to run there means designed to run better there or it means that it will run only there and not on 32bit windows?

The kit will run on any platform if you have the appropriate tclkit runtime. The exe will only run on 64 bit windows but I can make one that runs on 32 bit windows or any other platform. The exe is just a packaging gimmick where the tcl runtime and the source code and data are all packaged up together for easy distribution.

I think this tool is more evil than one can think of ( ), since if 2 programs have a high percentage of similar output then there is a relatively strong evidence that one is a clone of the other, but if this is not true and they have a low percentage of similar output we can't say that one is not a clone of the other(because the clown author may have changed many things and programs not to be comparable any more) so we can't conclude nothing specific so we can't vindicate a suspicious engine.

So we can incriminate an engine but we can't vindicate it.

Actually, I see it as just the opposite. I would view the tool as a fairly accurate means to vindicate a program. If it scores low on the similarity test it's probably not a clone. If it scores high I would use a great deal of caution about proclaiming it to be a clone unless there is other strong evidence to go along with it.

In what theory does this tool is based on?
I mean what kind of positions are these? I hope they are not best move positions right? Because then good programs would have similar selections.

They are random position but I culled out positions that several programs agreed upon. Actually I did not prepare the positions for this particular test, I just happened to have done this 3 or 4 years ago for a completely different purpose so I just used those positions.

I would like the positions to have many different possible good moves to play. Let's say the best 5 moves to be playable. Semi-Opening positions are good for this.
If there are 5 moves per position to choose then we have 5^2000 different combinations the engines can choose from, so the randomness is very good and the statistics at the end are strong and can be a good indicator.

Don · Post by **Don** » Mon Dec 27, 2010 2:54 pm

IWB wrote:Hello Don,

I do not fell very well with such a tool!

I have a screwdriver in my toolbox at home. However I know that screwdrivers have been used as weapons to stab people with. This is an improper use of a screwdriver and it's not how I use my screwdriver.

It would be arrogant of me to hold back this tool simply because I don't approve of how you might use it. I don't burn books either, even if I don't agree with them.

It's my hope this tool is not used in the way that you predict, but I'm not inclined to be the moral cop and say you cannot have it because I have judged you incompetent to use it.

I played around a bit and found similarities between engines where I expected it and similarities between engines where there is no way for them of being similar.
I found some similarities of 68 to 70% (like you) between engines which have to have similarities, while I found 65% identical moves between engines were it is impossible that the engines have something in common. Sometimes 1,2,3% will decide if someone is named a "cloner" or not. This tool will be used as soon as it fits to someones goals, and not to bring truth. For years we will see comparisons which are the "truth" because "clone-tool" say so.
I consider this as an instrument of inquisition.

It is out, and as with the Litos we have to live with it.

Bye
Ingo

Sean Evans · Post by **Sean Evans** » Mon Dec 27, 2010 3:18 pm

Don wrote: I have a screwdriver in my toolbox at home. However I know that screwdrivers have been used as weapons to stab people with. This is an improper use of a screwdriver and it's not how I use my screwdriver.

Don, a better way to think of tool comparison:

Q: Is a hammer a tool or a weapon?

A: It can be a tool or a weapon, it depends on the mentality of the person using it!

Cordially,

Sean

Don · Post by **Don** » Mon Dec 27, 2010 4:01 pm

SzG wrote:
Don wrote: It's my hope this tool is not used in the way that you predict
My narrow imagination can't see for what other reason it might be and would be used.

It could be used as strong evidence to exonerate a program. You don't really have much imagination, do you?

The tool is nothing more than what it is. It runs 2000 random positions and I have been completely transparent in explaining exactly what it does.

There is a significant amount of statistically error in this with only 2000 positions but more than 2000 starts to take too long to test.

Similarity Detector Available

Re: clone tester available

Re: clone tester available

Re: clone tester available

Re: clone tester available

Re: clone tester available

Re: clone tester available

Re: clone tester available

Re: clone tester available

Re: clone tester available

Re: clone tester available