Similarity Detector Available

Roger Brown · Post by **Roger Brown** » Mon Dec 27, 2010 4:20 pm

Don wrote:
It could be used as strong evidence to exonerate a program. You don't really have much imagination, do you?

The tool is nothing more than what it is. It runs 2000 random positions and I have been completely transparent in explaining exactly what it does.

There is a significant amount of statistically error in this with only 2000 positions but more than 2000 starts to take too long to test.

Hello Don,

Do you consider the significant amount of statistical error worth the possible usage of the tool to draw conclusions? How crippling would it be to use more positions in terms of time?

To continue the tool analogy, if there is a device that is supposed to deliver a certain amount of electric shock to your heart, I am pretty sure that you would want the variation to be within an extremely narrow band before it was used on you or anyone you cared about.

Of course I am only an ignorant user but what is the size of the statistical error? you claim that vindication of a program is possible if it scores low and so forth but outside of what range is low enough?

I mean, if it scores 60% - given your positions - for argument's sake, is that outside of the bounds of the significant statistical error you mentioned? Is 60% low enough to vindicate? Is 50%?

Later.

Roger Brown · Post by **Roger Brown** » Mon Dec 27, 2010 4:31 pm

SzG wrote:
Don wrote: It could be used as strong evidence to exonerate a program. You don't really have much imagination, do you?
I told you so, although the example you mention did in fact occur to me.

However, if the tool shows strong evidence that a program is not a clone, we have in all probability known that already.

Hello Gabor,

Do not worry too much about how Don throws out his condescensions. he is certainly someone who knows how to give in that regard.

Later.

Don · Post by **Don** » Mon Dec 27, 2010 5:15 pm

Roger Brown wrote:
Don wrote:
It could be used as strong evidence to exonerate a program. You don't really have much imagination, do you?

The tool is nothing more than what it is. It runs 2000 random positions and I have been completely transparent in explaining exactly what it does.

There is a significant amount of statistically error in this with only 2000 positions but more than 2000 starts to take too long to test.
Hello Don,

Do you consider the significant amount of statistical error worth the possible usage of the tool to draw conclusions? How crippling would it be to use more positions in terms of time?

You should not use this test to draw conclusions at all. I can make a version that does 10,000 positions if you want and it will be statistically more accurate at the expense of taking a much longer time for each program but you still should not use it to draw conclusions - at least without other significant information to supplement it.

To continue the tool analogy, if there is a device that is supposed to deliver a certain amount of electric shock to your heart, I am pretty sure that you would want the variation to be within an extremely narrow band before it was used on you or anyone you cared about.

If the tool were designed to actually detect clones you would be right. But it is not suited for that.

I think of this as a diagnositc tool. The doctor might thump your chest but doesn't use this to draw any conclusions, he just does it because it might cause him to notice something suspicious and he will order OTHER tests.

This tester might exonerate a program quickly and save someone a lot of time. If the test show very low correlation, I would not spend any time disassembling a program to prove it is identical, would you?

It's my hope that people will gain enough experience with the tool to discover what it's limitations are and not draw too many conclusions.

Of course I am only an ignorant user but what is the size of the statistical error? you claim that vindication of a program is possible if it scores low and so forth but outside of what range is low enough?

I mean, if it scores 60% - given your positions - for argument's sake, is that outside of the bounds of the significant statistical error you mentioned? Is 60% low enough to vindicate? Is 50%?

I think you ARE an ignorant user - no disrespect intended. You keep thinking of this a tool to identify what a clone is. It's just a statistical tool and as such you CANNOT attach some fixed score to it.

If you go to the doctor, the nurse will take your temperature. That simple diagnostic tool is very useful, but it doesn't tell you what is wrong with you. It doesn't tell you that you have cancer or heart trouble - it's just a diagnostic tool. And the range of "normal" temperatures is not fixed - it's a little higher for some than others and at different times of the day is varies. Does that mean a thermometer is a stupid and useless tool?

Later.

Roger Brown · Post by **Roger Brown** » Mon Dec 27, 2010 5:39 pm

Don wrote:
This tester might exonerate a program quickly and save someone a lot of time. If the test show very low correlation, I would not spend any time disassembling a program to prove it is identical, would you?

This is the statement I wish clarified. What does low correlation mean?

Don wrote:
I think you ARE an ignorant user - no disrespect intended. You keep thinking of this a tool to identify what a clone is. It's just a statistical tool and as such you CANNOT attach some fixed score to it.

No disrespect taken. I have a small list of persons who I allow myself to feel disrespected by.

Later.

De Vos W · Post by **De Vos W** » Mon Dec 27, 2010 5:44 pm

Don wrote:
Don wrote:

If you go to the doctor, the nurse will take your temperature. That simple diagnostic tool is very useful, but it doesn't tell you what is wrong with you. It doesn't tell you that you have cancer or heart trouble - it's just a diagnostic tool. And the range of "normal" temperatures is not fixed - it's a little higher for some than others and at different times of the day is varies. Does that mean a thermometer is a stupid and useless tool?

If the nurse is beautiful and sexy my temperature is going up but if the nurse is ugly and she put the thermometer...OK..Ok i get it!

Later.

BubbaTough · Post by **BubbaTough** » Mon Dec 27, 2010 5:51 pm

SzG wrote:
Don wrote: It's my hope this tool is not used in the way that you predict
My narrow imagination can't see for what other reason it might be and would be used.

Well, if you wanted to be silly, you could use the tool to try to auto-tune your eval to be similar to another program. Admittedly there are faster ways to do that.

Another use is to see if two programs would be complementary to analyze with (in that they had different styles and could propose different moves) or if they were redundant.

Another use is to use on your own program to see if your changes were having much effect on style over the years. Kind of fun for authors (or at least for me).

Another use is for someone writing a new derivative to see if they have sufficiently disguised their work so they can release it without being caught/accused of misbehavior. It would be interesting to see how easy this would be without weakening the end result much.

Anyway, I think its a fun tool.

-Sam

Don · Post by **Don** » Mon Dec 27, 2010 6:23 pm

Robert Flesher wrote:
Don wrote:I created a utility called similar which measures how different one chess program is from others. It does this by running 2000 position from random games and noting how often the moves agree and as output returns the percentage of moves that match.

You can get it here: http://komodochess.com/pub/similar.zip

Here is some sample output, comparing Robbolito with a few other programs:

------ RobboLito version 0.084 (time: 100 ms) ------
69.25 Houdini 1.5 w32 (time: 100 ms)
66.90 Rybka 3 (time: 100 ms)
61.70 Stockfish 1.9.1 JA 64bit (time: 100 ms)
61.35 Stockfish 1.8 JA (time: 100 ms)
59.80 Komodo64 1.2 JA (time: 100 ms)
59.15 Komodo 1.0 (time: 100 ms)
58.95 Stockfish 1.7.1 64bit (time: 100 ms)
58.95 Stockfish 1.6 64bit (time: 100 ms)
57.00 Fruit 2.3.1 (time: 100 ms)
56.20 Fruit 2.1 (time: 100 ms)

I have not tested this on windows so I'm hoping to get some feedback specific to windows.

The similar.exe is designed to run on 64 bit windows and is actually a tcl script wrapped up with a tcl runtime using tclkit technology. I am also including the "starkit" which is platform independent, but requires a tclkit runtime for your platform. It is similar to a jar file and can be taken apart and inspected and modified if you wish - assuming you know how to work with starkit's and such. google for starkit and sdx.kit for more information.

Please let me know if you find this interesting or useful. Email me at drd@mit.edu

Don

I found this interesting, I quote BB "

I looked at what he wrote and it all sounds very reasonable. What is your point? I agree with everything he said. He said nothing here that was not immediately obvious to me.

I will add a couple of clarifications to what he is saying below (I realize that I am not directly responding to BB.)

From TalkChess (Don Dailey):
I created a utility called similar which measures how different one chess program is from others. It does this by running 2000 position from random games and noting how often the moves agree and as output returns the percentage of moves that match.Since this has been in the works for some time, I've had ample time to prepare any criticism. I will try to leave the semantics aside (though calling it a "clone tester" cries out for nuance), and stick with scientific observations. I must say that I would find such a tool to be valuable if it is done in a scientifically proper manner, and its results parsed according to their proper scope.

The scope I have been urging is that it not be taken too seriously as I am well aware of it's limitations.

Firstly, I would say that the utility measures how much the choice of best move output from one chess program differs from another. It is a different question to say how "similar" this makes them, which seems to be a word with many possible meanings. Indeed, it seems almost tautological to say that "clone testing" (or derivative, if you prefer) is better performed by an actual examination of the executables, though perhaps this is thought too time-consuming (or for the future "rental engines", maybe it is impossible). However, the utility does serve a useful purpose if its output has nonzero correlation with clones and/or derivatives.

I would like to point out that I chose that initial subject line to draw attention to the post. I regret the trouble it caused as people took it WAY too seriously and it has already been cast as a tool of the devil to persecute good people with.

The first problem I have with much of the discussion is that no sense of statistical error is ever mentioned. For instance, running a 1000 position suite should give a 95% confidence interval only of plus/minus 30 positions. This is fairly easily remedied simply by appending the additional maths.

That is why I have been completely transparent about how many problems are in the test and exactly how it works and what it does and why it should not be taken too seriously. I really did not have the time and energy to turn this into a highly polished project, but the source code is there and anyone is free to improve upon it.

In particular, "false positives" should appear rather frequently in a large enough pool, and robust methods to minimise their impact should be used (the numbers seem largely to be in the 550-650 range for random data, and 650-700 for semi-correlated). I can't say I am particularly enamoured by the use of techniques seen in biometry to draw putative hierarchal relationships either.

Another problem is strength conflation, that is, two engines will play similar moves simply because there actually is a "best" move, and suitably strong engines will all agree. This effect is rather hard to measure, and always seems to be in the background.

This is obviously a potential issue. There was some effort to remove obvious moves but if you tamper with the data too much it becomes more suspect. For example if I use 2 programs to measure what an obvious move is, I am "fixing" the test in a way - the test will be biased against considering these 2 program similar.

What I actually did was use several programs and if they all agreed even at low to high depths on a give move I considered it too easy. I did not do this to prepare the data for this application, it was done for totally different reasons and I simply used those positions for the similarity tester.

In contrast, for instance with Toby Tal, it was found to be a clone (or at least the move generator) by giving it a battery of ten mate-in-1 positions with multiple solutions, and seeing an exact match with RobboLito (or something in that family). Here is one possible way to take a first whack at the effect of strength. First test (say) 15 engines at 0.1s per move, getting 105 pairwise measurements. Then do the same at 1.0s per move. As engines should play stronger at 1s per move, presumably the typical overlap (among the 105 comparisons) should be greater. By how much is it? A little or a lot?

The test could be greatly improved by trying to cull out positions that have little impact on this measurement, but I fear that I would inadvertently being putting some kind of bias into the test.

A third critique involves self-validation, or perhaps more generally what could be called playing style. For instance, comparing Engine X at 0.1s to itself at 1.0s is said to be a way of showing that the utility detects not strength but style, as the correlation factor is still typically quite high. Whether or not this holds for a variety of engines (those deemed "tactical" versus "positional", or perhaps those using MTD(f) simply change their mind more/less than PVS) remains to be seen. I guess I am not so prone to agree with the statement: "I believed [...] it is far more difficult to make it play significantly different moves without making it weaker."

When I preface something with "I believe" I do it so that it is not taken as a statement of fact. It's merely an opinion. However, in this case I believe pretty strongly in this principle because both Larry and I have TRIED to change the playing style of Komodo and this makes it play weaker. It's intuitive to me that if I were to take Robbo sources I would have a very difficult time making it play a lot differently without weakening it. It wold be easy of course to make it play differently if I were willing to sacrifice ELO.

I think this concept makes a lot more sense to someone who actually has a lot of experience writing strong chess programs - it's probably not obvious to non-programmers or authors of weak programs but it's one of those things much easier said that done.

This tool has showed me that playing style is almost completely about the evaluation function. Try turning off LMR or drastically changing the search and run this test and you will find that it does not have much impact on the test.

The ELO, as you already pointed out DOES have a small impact as there are surely some moves in the test that strong programs will prefer over weak programs - but the effect is amazingly small.

I provided a way to eliminate most of that bias and people are already using it. You can give weaker programs more time to equalizing their rating. You can easily get within 50-100 ELO with some careful head to head tests to find out how much handicap is needed to equalize ratings.

Finally, as noted above, the question of "move selection" versus "similar ideas" (in the sense of intellectual property) is not really resolved, as one can use many of the "same ideas" with different numerology, and get notably different play. It all depends on how much weighting you give in your sense of "clone" to the concept of the "feature set" of an evaluation function as opposed to merely the specific numerical values therein.

Yes, this is possible and there may be something to it. I think it's your weakest criticism however.

I would challenge anyone to take the robbolito code and make it play substantially differently SOLELY by changing the evaluation weights and nothing else while also not making it play any weaker. I think this little exercise is likely to humble those who don't really understand the difficulty of engineering the evaluation function of a chess program.

The prospective difficulties of drawing conclusions from these methods are seen in:It looks to me that after Rybka 1.0 the program changed very substantially. From this I would assume he completely rewrote the program, and certainly the evaluation function.Au contraire, a disassembly of the Rybka 2.3.2a evaluation function will show much of it to be still quite Fruit-like in its framework, with only two or three minor variations in the features from Rybka 1.0 Beta. The PST is slightly more tweaked, but my impression is that almost all the substantive changes from Rybka 1.0 Beta until LK's work with Rybka 3 were in the search (and some tuning of eval weightings, PST, and material imbalances). [Perhaps the fact that Rybka 1.0 Beta used lazy eval way too often due to a mismatch with 3399 vs 100 scalings might also play a rôle here]. "

BB once again provides objective clarity.

It's not all that objective. In the end he gives his SUBJECTIVE opinion of the source code and claims facts that are just opinions. That is not "objective." For instance he says, "a disassembly of the Rybka 2.3.2a evaluation function will show much of it to be still quite Fruit-like in its framework ..." Is that actually a fact or is it his subjective impression? Are you taking his word for it or are you checking for yourself?

I also used highly subjective language in what he was quoting from me, but you will notice that I did not present it as factual like he did. I said things like "I believe" and "it looks to me like .." in other words I am admitting a subjective viewpoint.

One must be very careful when blindingly accepting the analysis of others that you cannot or are not willing to check for yourself.

That is why I am presenting you with a concrete tool that YOU can check out for yourself and draw your own conclusions, I am trying to put the power in YOUR hands instead of just something you can read about.

Don · Post by **Don** » Mon Dec 27, 2010 6:27 pm

BubbaTough wrote:
SzG wrote:
Don wrote: It's my hope this tool is not used in the way that you predict
My narrow imagination can't see for what other reason it might be and would be used.

Well, if you wanted to be silly, you could use the tool to try to auto-tune your eval to be similar to another program. Admittedly there are faster ways to do that.

Another use is to see if two programs would be complementary to analyze with (in that they had different styles and could propose different moves) or if they were redundant.

That is an excellent use I had not thought of. Good player would certainly like this.

Another use is to use on your own program to see if your changes were having much effect on style over the years. Kind of fun for authors (or at least for me).

Another use is for someone writing a new derivative to see if they have sufficiently disguised their work so they can release it without being caught/accused of misbehavior. It would be interesting to see how easy this would be without weakening the end result much.

Anyway, I think its a fun tool.

-Sam

Don · Post by **Don** » Mon Dec 27, 2010 6:53 pm

Roger Brown wrote:
Don wrote:
This tester might exonerate a program quickly and save someone a lot of time. If the test show very low correlation, I would not spend any time disassembling a program to prove it is identical, would you?

This is the statement I wish clarified. What does low correlation mean?

It means that the number is low enough for you to feel comfortable with.

How much time do you need to get ready for work in the morning? Please tell me exactly how much time is not enough, but having 1 more second is enough.

You cannot do that because the transition to plenty of time to not enough time is gradual.

In the same way the transition from low correlation to high correlation is gradual and you cannot pin an exact value on it and it will vary with your comfort level.

If you absolutely insist on using this tool as a clone detector and want an actual number, I'll give you these guidelines:

If the moves match somewhere between zero and 100 percent of the time, the program may or may not be a clone. Use your best judgement but don't base it on some analysis someone else did, base it on something YOU did. Run your own tests, analyse the code, etc.

Don wrote:
I think you ARE an ignorant user - no disrespect intended. You keep thinking of this a tool to identify what a clone is. It's just a statistical tool and as such you CANNOT attach some fixed score to it.

No disrespect taken. I have a small list of persons who I allow myself to feel disrespected by.

Later.

perejaslav · Post by **perejaslav** » Mon Dec 27, 2010 7:05 pm

------ Naum 4.2 (time: 100 ms) ------
84.30 Naum 4.1 (time: 100 ms)
75.15 Rybka 2.2n2 32-bit (time: 100 ms)
70.45 Belka 1.8.22 (time: 100 ms)
69.00 Rybka 1.0 Beta 32-bit (time: 100 ms)
65.05 Rybka 4x64 (time: 100 ms)
64.45 Rybka 3 32-bit (time: 100 ms)
63.60 IPPOLIT 0.080b3 x64 (time: 100 ms)
63.10 IvanHoe 9.48b x64 (time: 100 ms)
62.80 Rybka 4w32 (time: 100 ms)
62.15 Fruit 2.1 (time: 100 ms)
61.65 Critter 0.90 64-bit (time: 100 ms)
61.45 Houdini 1.5 x64 (time: 100 ms)
61.25 Stockfish 1.9.1 JA 64bit (time: 100 ms)
60.80 Gull 1.1 x64 (time: 100 ms)
60.55 Hannibal 1.0a (time: 100 ms)
60.45 Crab 1.0 beta 64bit (time: 100 ms)
60.10 Tinapa 1.01 (time: 100 ms)
57.80 Deep Sjeng c't 2010 (time: 100 ms)
56.80 Deep Sjeng WC2008 x64 (time: 100 ms)
56.70 Shredder 12 UCI (time: 100 ms)
56.65 HIARCS 13.2 SP (time: 100 ms)
55.45 Jonny 4.00 (time: 100 ms)
12.20 spark-1.0 (time: 100 ms)

Naum 4.2 is a clone of Rybka 2. It' even more «clonish» than IPPOLIT-Rybka affair

------ IPPOLIT 0.080b3 x64 (time: 100 ms) ------
68.60 IvanHoe 9.48b x64 (time: 100 ms)
66.45 Rybka 3 32-bit (time: 100 ms)
64.75 Houdini 1.5 x64 (time: 100 ms)
64.15 Rybka 4w32 (time: 100 ms)
63.80 Belka 1.8.22 (time: 100 ms)
63.60 Naum 4.2 (time: 100 ms)
63.15 Rybka 2.2n2 32-bit (time: 100 ms)
62.85 Rybka 4x64 (time: 100 ms)
62.70 Naum 4.1 (time: 100 ms)
61.80 Critter 0.90 64-bit (time: 100 ms)
60.20 Hannibal 1.0a (time: 100 ms)
60.00 Gull 1.1 x64 (time: 100 ms)
59.90 Rybka 1.0 Beta 32-bit (time: 100 ms)
59.90 Fruit 2.1 (time: 100 ms)
59.40 Stockfish 1.9.1 JA 64bit (time: 100 ms)
59.20 Tinapa 1.01 (time: 100 ms)
59.00 Crab 1.0 beta 64bit (time: 100 ms)
58.55 Deep Sjeng c't 2010 (time: 100 ms)
58.35 Shredder 12 UCI (time: 100 ms)
57.25 Deep Sjeng WC2008 x64 (time: 100 ms)
54.65 Jonny 4.00 (time: 100 ms)
53.75 HIARCS 13.2 SP (time: 100 ms)
11.95 spark-1.0 (time: 100 ms)

Thank you for a great tool! I think Naum 4.2 should be banned from all chess fora!

Similarity Detector Available

Re: clone tester available

Re: clone tester available

Re: clone tester available

Re: clone tester available

Re: clone tester available

Re: clone tester available

Re: clone tester available

Re: clone tester available

Re: clone tester available

Re: Similarity Detector Available