Similarity Detector Available

perejaslav · Post by **perejaslav** » Mon Dec 27, 2010 7:31 pm

I would say the same about Rybka 3 and IPPOLIT...

Don · Post by **Don** » Mon Dec 27, 2010 7:32 pm

michiguel wrote:
perejaslav wrote:------ Naum 4.2 (time: 100 ms) ------
84.30 Naum 4.1 (time: 100 ms)
75.15 Rybka 2.2n2 32-bit (time: 100 ms)
70.45 Belka 1.8.22 (time: 100 ms)
69.00 Rybka 1.0 Beta 32-bit (time: 100 ms)
65.05 Rybka 4x64 (time: 100 ms)
64.45 Rybka 3 32-bit (time: 100 ms)
63.60 IPPOLIT 0.080b3 x64 (time: 100 ms)
63.10 IvanHoe 9.48b x64 (time: 100 ms)
62.80 Rybka 4w32 (time: 100 ms)
62.15 Fruit 2.1 (time: 100 ms)
61.65 Critter 0.90 64-bit (time: 100 ms)
61.45 Houdini 1.5 x64 (time: 100 ms)
61.25 Stockfish 1.9.1 JA 64bit (time: 100 ms)
60.80 Gull 1.1 x64 (time: 100 ms)
60.55 Hannibal 1.0a (time: 100 ms)
60.45 Crab 1.0 beta 64bit (time: 100 ms)
60.10 Tinapa 1.01 (time: 100 ms)
57.80 Deep Sjeng c't 2010 (time: 100 ms)
56.80 Deep Sjeng WC2008 x64 (time: 100 ms)
56.70 Shredder 12 UCI (time: 100 ms)
56.65 HIARCS 13.2 SP (time: 100 ms)
55.45 Jonny 4.00 (time: 100 ms)
12.20 spark-1.0 (time: 100 ms)

Naum 4.2 is a clone of Rybka 2. It' even more «clonish» than IPPOLIT-Rybka affair

------ IPPOLIT 0.080b3 x64 (time: 100 ms) ------
68.60 IvanHoe 9.48b x64 (time: 100 ms)
66.45 Rybka 3 32-bit (time: 100 ms)
64.75 Houdini 1.5 x64 (time: 100 ms)
64.15 Rybka 4w32 (time: 100 ms)
63.80 Belka 1.8.22 (time: 100 ms)
63.60 Naum 4.2 (time: 100 ms)
63.15 Rybka 2.2n2 32-bit (time: 100 ms)
62.85 Rybka 4x64 (time: 100 ms)
62.70 Naum 4.1 (time: 100 ms)
61.80 Critter 0.90 64-bit (time: 100 ms)
60.20 Hannibal 1.0a (time: 100 ms)
60.00 Gull 1.1 x64 (time: 100 ms)
59.90 Rybka 1.0 Beta 32-bit (time: 100 ms)
59.90 Fruit 2.1 (time: 100 ms)
59.40 Stockfish 1.9.1 JA 64bit (time: 100 ms)
59.20 Tinapa 1.01 (time: 100 ms)
59.00 Crab 1.0 beta 64bit (time: 100 ms)
58.55 Deep Sjeng c't 2010 (time: 100 ms)
58.35 Shredder 12 UCI (time: 100 ms)
57.25 Deep Sjeng WC2008 x64 (time: 100 ms)
54.65 Jonny 4.00 (time: 100 ms)
53.75 HIARCS 13.2 SP (time: 100 ms)
11.95 spark-1.0 (time: 100 ms)

Thank you for a great tool! I think Naum 4.2 should be banned from all chess fora!
We discussed this already almost a year ago. By private email to Michael Hart, the author claims it fitted the evaluation to Rybka 2.2.

Miguel

And the test easily picked up on this! I think that is amazingly impressive that a completely independent test "noticed" the playing style correlation due to this fitting!

michiguel · Post by **michiguel** » Mon Dec 27, 2010 7:36 pm

perejaslav wrote:
michiguel wrote:fitted the evaluation to Rybka 2.2.

Miguel
Can we consider this as an act of crime (engine cloning)?

Who said that???
Miguel

Otherwise I only see a double standard regarding free and commercial top-class engines!

michiguel · Post by **michiguel** » Mon Dec 27, 2010 7:41 pm

SzG wrote:I am quite slow at thinking. It has just occurred to me that we have already had a similar 'tool': ponder hit statistics.

2000 positions is about the same as 50 games, so ponder hit statistics of 1000 games seems to provide 20 times more data than SimDetect over 2000 positions.
Of course I am aware that ponder hits must be extracted from games and then a ratio must be calculated, so some tool is required there as well.

I don't remember how successful ponder hit ratio was at detecting clones but it seems to me we can't expect more accurate results from SimDetect.

I accept that for other purposes it may still be valuable.

It is not the same. With ponder statistics you do not have the same positions to compare between engine A, B, C etc. When you compare A vs B, you have a set of positions, when you have B vs C, you have a completely different set. This kills the whole thing.

Miguel

Uri Blass · Post by **Uri Blass** » Mon Dec 27, 2010 7:44 pm

SzG wrote:I am quite slow at thinking. It has just occurred to me that we have already had a similar 'tool': ponder hit statistics.

2000 positions is about the same as 50 games, so ponder hit statistics of 1000 games seems to provide 20 times more data than SimDetect over 2000 positions.
Of course I am aware that ponder hits must be extracted from games and then a ratio must be calculated, so some tool is required there as well.

I don't remember how successful ponder hit ratio was at detecting clones but it seems to me we can't expect more accurate results from SimDetect.

I accept that for other purposes it may still be valuable.

I think that ponder hit may have problems.

If a program have non symmetric evaluation then it may not be even similiar to itself based on ponder hit.

If a program want to open the game because it is designed to play against humans like pablo and not against other programs
then it may expect the opponent to close the game when the opponent(again the same program) may be happy to open the game so we do not get a ponder hit.

Edit:Another problem is that ponder hit statistics give set of different positions for different engines so comparison is not good and if an engine go to a line with lot of forced move you may get more ponder hits.

Robert Flesher · Post by **Robert Flesher** » Mon Dec 27, 2010 9:37 pm

Don wrote:
Robert Flesher wrote:
Don wrote:I created a utility called similar which measures how different one chess program is from others. It does this by running 2000 position from random games and noting how often the moves agree and as output returns the percentage of moves that match.

You can get it here: http://komodochess.com/pub/similar.zip

Here is some sample output, comparing Robbolito with a few other programs:

------ RobboLito version 0.084 (time: 100 ms) ------
69.25 Houdini 1.5 w32 (time: 100 ms)
66.90 Rybka 3 (time: 100 ms)
61.70 Stockfish 1.9.1 JA 64bit (time: 100 ms)
61.35 Stockfish 1.8 JA (time: 100 ms)
59.80 Komodo64 1.2 JA (time: 100 ms)
59.15 Komodo 1.0 (time: 100 ms)
58.95 Stockfish 1.7.1 64bit (time: 100 ms)
58.95 Stockfish 1.6 64bit (time: 100 ms)
57.00 Fruit 2.3.1 (time: 100 ms)
56.20 Fruit 2.1 (time: 100 ms)

I have not tested this on windows so I'm hoping to get some feedback specific to windows.

The similar.exe is designed to run on 64 bit windows and is actually a tcl script wrapped up with a tcl runtime using tclkit technology. I am also including the "starkit" which is platform independent, but requires a tclkit runtime for your platform. It is similar to a jar file and can be taken apart and inspected and modified if you wish - assuming you know how to work with starkit's and such. google for starkit and sdx.kit for more information.

Please let me know if you find this interesting or useful. Email me at drd@mit.edu

Don

I found this interesting, I quote BB "

I looked at what he wrote and it all sounds very reasonable. What is your point? I agree with everything he said. He said nothing here that was not immediately obvious to me.

I will add a couple of clarifications to what he is saying below (I realize that I am not directly responding to BB.)

From TalkChess (Don Dailey):
I created a utility called similar which measures how different one chess program is from others. It does this by running 2000 position from random games and noting how often the moves agree and as output returns the percentage of moves that match.Since this has been in the works for some time, I've had ample time to prepare any criticism. I will try to leave the semantics aside (though calling it a "clone tester" cries out for nuance), and stick with scientific observations. I must say that I would find such a tool to be valuable if it is done in a scientifically proper manner, and its results parsed according to their proper scope.
The scope I have been urging is that it not be taken too seriously as I am well aware of it's limitations.

Firstly, I would say that the utility measures how much the choice of best move output from one chess program differs from another. It is a different question to say how "similar" this makes them, which seems to be a word with many possible meanings. Indeed, it seems almost tautological to say that "clone testing" (or derivative, if you prefer) is better performed by an actual examination of the executables, though perhaps this is thought too time-consuming (or for the future "rental engines", maybe it is impossible). However, the utility does serve a useful purpose if its output has nonzero correlation with clones and/or derivatives.
I would like to point out that I chose that initial subject line to draw attention to the post. I regret the trouble it caused as people took it WAY too seriously and it has already been cast as a tool of the devil to persecute good people with.

The first problem I have with much of the discussion is that no sense of statistical error is ever mentioned. For instance, running a 1000 position suite should give a 95% confidence interval only of plus/minus 30 positions. This is fairly easily remedied simply by appending the additional maths.

That is why I have been completely transparent about how many problems are in the test and exactly how it works and what it does and why it should not be taken too seriously. I really did not have the time and energy to turn this into a highly polished project, but the source code is there and anyone is free to improve upon it.

In particular, "false positives" should appear rather frequently in a large enough pool, and robust methods to minimise their impact should be used (the numbers seem largely to be in the 550-650 range for random data, and 650-700 for semi-correlated). I can't say I am particularly enamoured by the use of techniques seen in biometry to draw putative hierarchal relationships either.

Another problem is strength conflation, that is, two engines will play similar moves simply because there actually is a "best" move, and suitably strong engines will all agree. This effect is rather hard to measure, and always seems to be in the background.
This is obviously a potential issue. There was some effort to remove obvious moves but if you tamper with the data too much it becomes more suspect. For example if I use 2 programs to measure what an obvious move is, I am "fixing" the test in a way - the test will be biased against considering these 2 program similar.

What I actually did was use several programs and if they all agreed even at low to high depths on a give move I considered it too easy. I did not do this to prepare the data for this application, it was done for totally different reasons and I simply used those positions for the similarity tester.

In contrast, for instance with Toby Tal, it was found to be a clone (or at least the move generator) by giving it a battery of ten mate-in-1 positions with multiple solutions, and seeing an exact match with RobboLito (or something in that family). Here is one possible way to take a first whack at the effect of strength. First test (say) 15 engines at 0.1s per move, getting 105 pairwise measurements. Then do the same at 1.0s per move. As engines should play stronger at 1s per move, presumably the typical overlap (among the 105 comparisons) should be greater. By how much is it? A little or a lot?
The test could be greatly improved by trying to cull out positions that have little impact on this measurement, but I fear that I would inadvertently being putting some kind of bias into the test.

A third critique involves self-validation, or perhaps more generally what could be called playing style. For instance, comparing Engine X at 0.1s to itself at 1.0s is said to be a way of showing that the utility detects not strength but style, as the correlation factor is still typically quite high. Whether or not this holds for a variety of engines (those deemed "tactical" versus "positional", or perhaps those using MTD(f) simply change their mind more/less than PVS) remains to be seen. I guess I am not so prone to agree with the statement: "I believed [...] it is far more difficult to make it play significantly different moves without making it weaker."
When I preface something with "I believe" I do it so that it is not taken as a statement of fact. It's merely an opinion. However, in this case I believe pretty strongly in this principle because both Larry and I have TRIED to change the playing style of Komodo and this makes it play weaker. It's intuitive to me that if I were to take Robbo sources I would have a very difficult time making it play a lot differently without weakening it. It wold be easy of course to make it play differently if I were willing to sacrifice ELO.

I think this concept makes a lot more sense to someone who actually has a lot of experience writing strong chess programs - it's probably not obvious to non-programmers or authors of weak programs but it's one of those things much easier said that done.

This tool has showed me that playing style is almost completely about the evaluation function. Try turning off LMR or drastically changing the search and run this test and you will find that it does not have much impact on the test.

The ELO, as you already pointed out DOES have a small impact as there are surely some moves in the test that strong programs will prefer over weak programs - but the effect is amazingly small.

I provided a way to eliminate most of that bias and people are already using it. You can give weaker programs more time to equalizing their rating. You can easily get within 50-100 ELO with some careful head to head tests to find out how much handicap is needed to equalize ratings.

Finally, as noted above, the question of "move selection" versus "similar ideas" (in the sense of intellectual property) is not really resolved, as one can use many of the "same ideas" with different numerology, and get notably different play. It all depends on how much weighting you give in your sense of "clone" to the concept of the "feature set" of an evaluation function as opposed to merely the specific numerical values therein.
Yes, this is possible and there may be something to it. I think it's your weakest criticism however.

I would challenge anyone to take the robbolito code and make it play substantially differently SOLELY by changing the evaluation weights and nothing else while also not making it play any weaker. I think this little exercise is likely to humble those who don't really understand the difficulty of engineering the evaluation function of a chess program.

The prospective difficulties of drawing conclusions from these methods are seen in:It looks to me that after Rybka 1.0 the program changed very substantially. From this I would assume he completely rewrote the program, and certainly the evaluation function.Au contraire, a disassembly of the Rybka 2.3.2a evaluation function will show much of it to be still quite Fruit-like in its framework, with only two or three minor variations in the features from Rybka 1.0 Beta. The PST is slightly more tweaked, but my impression is that almost all the substantive changes from Rybka 1.0 Beta until LK's work with Rybka 3 were in the search (and some tuning of eval weightings, PST, and material imbalances). [Perhaps the fact that Rybka 1.0 Beta used lazy eval way too often due to a mismatch with 3399 vs 100 scalings might also play a rôle here]. "

BB once again provides objective clarity.
It's not all that objective. In the end he gives his SUBJECTIVE opinion of the source code and claims facts that are just opinions. That is not "objective." For instance he says, "a disassembly of the Rybka 2.3.2a evaluation function will show much of it to be still quite Fruit-like in its framework ..." Is that actually a fact or is it his subjective impression? Are you taking his word for it or are you checking for yourself?

I also used highly subjective language in what he was quoting from me, but you will notice that I did not present it as factual like he did. I said things like "I believe" and "it looks to me like .." in other words I am admitting a subjective viewpoint.

One must be very careful when blindingly accepting the analysis of others that you cannot or are not willing to check for yourself.

That is why I am presenting you with a concrete tool that YOU can check out for yourself and draw your own conclusions, I am trying to put the power in YOUR hands instead of just something you can read about.

Don, I was stating nothing more than I found some points for BB very interesting. As you can see that people on this forum are often fanatical about there beliefs. On the contrast BB seems to always remain objective and logical.

However, others may surmise your software "clone detector" is nothing more than a clever ploy to re-activate the zealots. Thus starting the witch hunt of the bastard children once again. Already we saw a few ready to burn the fires. I won't mention any names.

Although, I am always puzzled at your obvious campaign against Houndini and family. It stinks of envy, and jealously, and this cannot be the case. Correct?

I think you have stated ( not verbatim) that you are not impressed that someone takes an already very strong source code and makes a super strong engine. Fair enough! But, some evidence suggests this is what Vas did with Rybka 1.0, it then evolved into Rybka 2.0- 3.0- 4.0. So why not witch hunt Vas ?

Roubert Houdart released something stronger than Rybka 4.0 with very few, if any bugs. If that was so easy, why has Vas not done it for us
paying customers ?

Ant_Gugdin · Post by **Ant_Gugdin** » Mon Dec 27, 2010 9:45 pm

Don't know if anyone's posted comparisons with Houdini 1.03a before...

------ Houdini 1.03a x64 1_CPU (time: 100 ms) ------
72.85 RobboLito 0.084 (time: 100 ms)
72.80 RobboLito 0.085g3 x64 (time: 100 ms)
71.75 Houdini 1.5 x64 (time: 100 ms)

------ Houdini 1.5 x64 (time: 100 ms) ------
71.75 Houdini 1.03a x64 1_CPU (time: 100ms)
68.45 RobboLito 0.085g3 x64 (time: 100 ms)
67.30 RobboLito 0.084 (time: 100 ms)

Whatever caveats Don attaches to his program, people are going to use it to test/judge how original engines are. Before this, I had believed that Robert Houdart's use of the Ippolit/ Robbolito source code wasn't any "worse" than Vas' use of the Fruit source code. However, the similarity detector appears to show that Vas' use of the Fruit source code was less extensive than Robert's use of the Robbolito source code. Assuming that Robbolito is a Rybka 3 derivative, perhaps there is some justification for treating Rybka as legitimate and Houdini as illegitimate? (I'm not in any way, shape or form a Rybka fanboy.)

Don · Post by **Don** » Mon Dec 27, 2010 10:08 pm

Robert Flesher wrote: Don, I was stating nothing more than I found some points for BB very interesting. As you can see that people on this forum are often fanatical about there beliefs. On the contrast BB seems to always remain objective and logical.

However, others may surmise your software "clone detector" is nothing more than a clever ploy to re-activate the zealots.

That is their problem, I don't really care what they think.

Thus starting the witch hunt of the bastard children once again. Already we saw a few ready to burn the fires. I won't mention any names.

It's funny how you invoke images of old superstitions to make it seem like you are on the moral high road and everyone else is backwards. It's cute but it's not appropriate.

Also, why do people keep calling this a clone detector? I have denounced that name and I only made the original post to draw people to this thread.

Although, I am always puzzled at your obvious campaign against Houndini and family. It stinks of envy, and jealously, and this cannot be the case. Correct?

I don't have bad feelings towards Houdart, he was relatively straightforward about the connection right from the start.

Neither before or after the cloning did I ever say a bad word against Rybka, a program much stronger than Komodo was back then. So you figure it out. If I'm being petty and jealous why not Rybka?

I presented a tool to help people try to analyze what is going on and I made no special claims. This obviously does not set very well with those who don't want to see actual data of any kind.

I think you have stated ( not verbatim) that you are not impressed that someone takes an already very strong source code and makes a super strong engine.

That's not how I feel - where did I say that? I said I WAS impressed if someone is able to take an already strong program and add a substantial amount of ELO to it. In fact I said this about Houdart, the person I am supposed to be envious of for improving Robbolito.

Fair enough! But, some evidence suggests this is what Vas did with Rybka 1.0, it then evolved into Rybka 2.0- 3.0- 4.0. So why not witch hunt Vas ?

What is your theory on this?

Roubert Houdart released something stronger than Rybka 4.0 with very few, if any bugs. If that was so easy, why has Vas not done it for us
paying customers ?

What is your point, that Robert is your hero?

Don · Post by **Don** » Mon Dec 27, 2010 10:26 pm

Ant_Gugdin wrote:Don't know if anyone's posted comparisons with Houdini 1.03a before...

------ Houdini 1.03a x64 1_CPU (time: 100 ms) ------
72.85 RobboLito 0.084 (time: 100 ms)
72.80 RobboLito 0.085g3 x64 (time: 100 ms)
71.75 Houdini 1.5 x64 (time: 100 ms)

------ Houdini 1.5 x64 (time: 100 ms) ------
71.75 Houdini 1.03a x64 1_CPU (time: 100ms)
68.45 RobboLito 0.085g3 x64 (time: 100 ms)
67.30 RobboLito 0.084 (time: 100 ms)

Whatever caveats Don attaches to his program, people are going to use it to test/judge how original engines are. Before this, I had believed that Robert Houdart's use of the Ippolit/ Robbolito source code wasn't any "worse" than Vas' use of the Fruit source code. However, the similarity detector appears to show that Vas' use of the Fruit source code was less extensive than Robert's use of the Robbolito source code. Assuming that Robbolito is a Rybka 3 derivative, perhaps there is some justification for treating Rybka as legitimate and Houdini as illegitimate? (I'm not in any way, shape or form a Rybka fanboy.)

I don't think either can be viewed as not legitimate, we are talking about open source code right? Houdart clearly did not do anything illegal.

Perhaps Vas did, but it's really hard for me to believe this is really what this is all about.

Will Singleton · Post by **Will Singleton** » Mon Dec 27, 2010 10:45 pm

Anyone get SD to work with wb2uci? My first try:

C:\chess\similar>similar -t wb2uci
program: amateur295x4 (time: 100 ms)

0.0 percent .error writing "file2537fb0": broken pipe
while executing
"puts $fh "stop""
("for" body line 15)
invoked from within
"for { set n $start } { $n < $e } { incr n } {

if { ($n % 50) == 0 } {
puts ""
set perc [expr ($cc * 100.0) / 2000.0]
puts -nonewline [format "..."
(file "C:/chess/similar/similar.exe/lib/app-clone/clone.tcl" line 141)
invoked from within
"source C:/chess/similar/similar.exe/lib/app-clone/clone.tcl"
("package ifneeded app-clone 1.0" script)
invoked from within
"package require app-clone"
(file "C:/chess/similar/similar.exe/main.tcl" line 4)

C:\chess\similar>

Similarity Detector Available

Re: Similarity Detector Available

Re: Similarity Detector Available

Re: Similarity Detector Available

Re: Similarity Detector Available

Re: Similarity Detector Available

Re: clone tester available

Re: Similarity Detector Available

Re: clone tester available

Re: Similarity Detector Available

Re: Similarity Detector & wb2uci