Similarity Detector Available

Will Singleton · Post by **Will Singleton** » Fri Dec 31, 2010 5:32 am

Been too busy to give it a go. I should probably use polyglot instead of wb2uci, since it's fairly up-to-date.

Will

Don wrote:Did you get this to work?

Will Singleton wrote:Anyone get SD to work with wb2uci? My first try:

C:\chess\similar>similar -t wb2uci
program: amateur295x4 (time: 100 ms)

0.0 percent .error writing "file2537fb0": broken pipe
while executing
"puts $fh "stop""
("for" body line 15)
invoked from within
"for { set n $start } { $n < $e } { incr n } {

if { ($n % 50) == 0 } {
puts ""
set perc [expr ($cc * 100.0) / 2000.0]
puts -nonewline [format "..."
(file "C:/chess/similar/similar.exe/lib/app-clone/clone.tcl" line 141)
invoked from within
"source C:/chess/similar/similar.exe/lib/app-clone/clone.tcl"
("package ifneeded app-clone 1.0" script)
invoked from within
"package require app-clone"
(file "C:/chess/similar/similar.exe/main.tcl" line 4)

C:\chess\similar>

Don · Post by **Don** » Fri Dec 31, 2010 6:45 am

bob wrote:
Don wrote:
bob wrote:my only comment here is that this is likely going to run afoul of the "birthday paradox" frequently. Given enough programs. A new program will frequently choose the same moves as another program, "just because". The more samples, the greater the probability this will happen. Lots of false positives are not going to help a thing...
In order to have a false positive you need context. All this utility does is counts how many moves (out of approx 8000) that 2 programs play in common and returns the percentage. How can that be a false positive? It will be whatever it will be for any two programs.
Simple. Someone is going to choose a number. Say 70%. If A matches B 70% of the time, it is likely a derivative.

Why are you picking numbers and talking about derivatives? The tool is not designed to determine what program is a derivative of some other program.

(replace 70% by any reasonable number you want). If you take program A and compare it to B, you might get 40%. If you compare it to C, you might get 50%. If you compare it to enough programs, you will get at least one 70% or higher. From unrelated programs...

This would only have meaning if the tool was designed to determine if programs are related, but it's not. The tester will be able to determine if 2 unrelated program play a lot alike. That is not a false positive. It's only a false positive if the tester is rigged to say, "hey those programs are related!" or if someone like you comes along and tries to assign that meaning to it.

It's completely normal and expected that some pairs of unrelated programs will play more alike than others, that is what the tester is designed to test.

When you produce numbers, you have to expect _someone_ to use them to reach a conclusion. In this case, the conclusion might be right, wrong, or random.

The only conclusion this tool provides is how often two different programs play the same move. This is not a conclusion, it's a statistic. You are trying to make something out of it that it is not.

I continue to get comments over and over again from people who are assuming context which betrays a fundamental misunderstanding of what this tool does and how it works.

If you view this utility as a "clone tester", and you assign some arbitrary percentage value to signify that a program is a "clone", then you can have false positives. But that is not what this utility does and it's not what it's for.

For example: When I tested Robbolito and Houdini, I got a ridiculously high match rate, higher than most other pairs of programs and in many cases much higher than the match rate between two versions of the SAME chess program!

So is that a false positive? No, it's just a fact. The two program play a lot of moves the same. It does not mean Robbolito is a clone of Houdini or a derivative or anything else, it just means they both play the same move a lot more than almost any other program.
All well and good. But the moment you produce numbers, you have to expect someone to take them at face value. I wouldn't consider such comparisons myself. But many will. And they will draw the wrong conclusion.

Taking it at face value is not the problem. I think what you really mean is that they will impute meaning and context that don't exist, just like you are doing.

I used this analogy earlier, but hammers are very useful objects. However every once in a while someone uses one improperly and hits someone over the head with one and kills them. This tool can be used improperly but it can be useful too.

Exactly what do you expect the numbers to show?

The numbers show how often 2 programs play the same move.

What does it mean when two programs match 70% of the time?

It means exactly what you said, they play the same move 70% of the time.

That they have the same search but different evals? Same evals but different search? A combination of both? It is pretty much meaningless.

I don't think it's meaningless, I have learned a LOT just from playing with it. But several people are doing their own research to learn more about this. On the OpenChess site BB has built a similar tool and is studying several aspects of it. I have also learned a lot about it and here is what I have found:

The program is uncanny in it's ability to identify different versions of the same program, even when the program has evolved substantially. The closest matches to Stockfish 1.9 is Sf 1.8, SF 1.7, SF 1.6 and Sf 1.5. This represents significant changes and ELO gains. It's this way for EVERY program I have tested that has multiple versions. Those versions tend to be the closest matches.

The ELO rating of the programs in question have very little impact on similarity scores. For example if you run the test 10x longer for program X, the tool is not fooled into thinking it's a different program or that it is much more like a stronger program.

I can change the search of Komodo and the test is not fooled. For example LMR can be turned off and the tester is almost oblivious, although this single change is a major search change.

My tentative conclusion (and I'm still studying it) is that search does not have much to do with it. I think what makes each program play the way it does is more about the evaluation function than anything else by far. Every test I have done bears that out.

Perhaps a good way to compute some random numbers for a Zobrist hashing scheme... but there are less expensive ways to do that.

If you look at some of the results that the test is returning, you probably wouldn't see any humor in that as it implies that programs all play random moves and are not consistent about what they think is important.

My intent for the tool was as a diagnostic aid and a tool to examine the playing styles of programs. It returns some result and it's up to you to figure out what it means or doesn't mean and to use good sense and judgement, an increasingly rare commodity these days.

I actually got the idea for this from YOU and John Stanback. I was at a tournament where a version of Crafty was claimed to be heavily modified in the evaluation and was allowed in the tournament. However this program was doing unusually well and Vincent suspected something and you were contacted and consulted. From what I was told, you checked the moves of the game against Crafty and felt too many were the same.

John Stanback in another tournament noticed the same thing simply by watching the tournament games on line and comparing the moves to his own program.
For an isolated data point, that is a good place to start. But to compare a suspected clone against a huge suite of others? Again, false positives. Too many samples.

Why do you keep talking about clones and false positives? That has nothing to do with the tool.

But even if the tool WERE to be used improperly to test clones, I need to point something out to you. The birthday paradox is about determining the odds that ANY two people have the same birthday in a small population of people. The odds that YOU have the same birthday as someone else in that small room is FAR less. In the context of using my tool (improperly) as a kind of "clone test", you are not checking every program ever written to see if any two match, you are interested in just one program, your own. For example if you suspect that Crafty is being cloned, you would test Crafty against the suspicious program along with several other control programs. If the suspected clone was by far the strongest correlated program, you would use this as circumstantial evidence to investigate further. You would NOT test every combination of 2 programs.

For some reason whenever the birthday paradox comes up even smart people get confused about it, I guess that's why it's called a "paradox."

I think every good chess player who gets really familiar with chess program agree's that each program has it's own individual personality. Of course that can only be revealed through the moves it makes.

I understand what you are saying about the birthday paradox and agree, I just think it's not relevant without assuming the context of "clone testing." However, if you tested 1000 unique program by different authors who did not share ideas, etc. you would surely find 2 programs that played very similar chess. The fact that they might play very similar is not a paradox or a lie, it's just how it is.
The danger is, as I said, that some will take these numbers to be something like a correlation coefficient, with some threshold beyond which clone is proven...

That sound pretty silly to me. There is this idea floating around that computers are someday going to become self-aware, take over the world and make us their slaves. I think the scenario you are talking about is more paranoid than realistic.

tmokonen · Post by **tmokonen** » Fri Dec 31, 2010 7:52 am

Don wrote: That sound pretty silly to me. There is this idea floating around that computers are someday going to become self-aware, take over the world and make us their slaves. I think the scenario you are talking about is more paranoid than realistic.

Judging by the way people walk around obliviously, with their noses buried in their stupid phones, I'd say it has already happened.

Don · Post by **Don** » Fri Dec 31, 2010 1:06 pm

tmokonen wrote:
Don wrote: That sound pretty silly to me. There is this idea floating around that computers are someday going to become self-aware, take over the world and make us their slaves. I think the scenario you are talking about is more paranoid than realistic.
Judging by the way people walk around obliviously, with their noses buried in their stupid phones, I'd say it has already happened.

Yes, that is really annoying. You cannot have a conversation without someone constantly interrupting to answer their phone. My wife and I had a couple over to our house and they spent most of the evening on their cell phones talking to other people.

Adam Hair · Post by **Adam Hair** » Fri Dec 31, 2010 1:49 pm

bob wrote:
Don wrote:
bob wrote:my only comment here is that this is likely going to run afoul of the "birthday paradox" frequently. Given enough programs. A new program will frequently choose the same moves as another program, "just because". The more samples, the greater the probability this will happen. Lots of false positives are not going to help a thing...
In order to have a false positive you need context. All this utility does is counts how many moves (out of approx 8000) that 2 programs play in common and returns the percentage. How can that be a false positive? It will be whatever it will be for any two programs.
Simple. Someone is going to choose a number. Say 70%. If A matches B 70% of the time, it is likely a derivative. (replace 70% by any reasonable number you want). If you take program A and compare it to B, you might get 40%. If you compare it to C, you might get 50%. If you compare it to enough programs, you will get at least one 70% or higher. From unrelated programs...

When you produce numbers, you have to expect _someone_ to use them to reach a conclusion. In this case, the conclusion might be right, wrong, or random.

Well, let's say that 70% is the criteria for assuming an engine is a "clone"
of another engine. And let's say that the set of positions used is such that
even a completely unrelated engine has a 65% chance of choosing the
same move, for each position, as the engine we are comparing to.
If a small set of positions was being used, the probability of a false
positive would be high. Let's say we used 100 positions.If we compared
5 engines to Engine A, there would be a 55% chance that we
would "claim" one of those engines was a "clone" of engine A. For 10
engines, it would be 80% likely.

However, if we used 1,000 positions, the percentages become 0.23%
and 0.46%. We would need to use 112 engines to get a 5% chance of
a false positive. If we used ~8,000 positions, as Don is using right now,
there is virtually no chance of a false positive.

That is for comparing engines to one engine. When looking at all possible
comparisions, the percentages increase. The probability of finding a
false positive for 10 engines and 1,000 positions is around 4%. For
100 engines ( assuming they are all unrelated), the probability is 99%.
However, increase the number of positions to 2,000, the probability falls
to 1.4%. For 8,000 positions, virtually no chance of a false positive.

All of this is based on 65% chance that any two unrelated engines will
choose the same move per position. Don has culled 2,000 positions
where all engines tend to choose the exact same move. I don't think
Don's utility will unjustly relate one engine to another one. It just can't
tell you why they are related ( shared ideas or copied code).

Laskos · Post by **Laskos** » Fri Dec 31, 2010 2:03 pm

Don wrote:
tmokonen wrote:
Don wrote: That sound pretty silly to me. There is this idea floating around that computers are someday going to become self-aware, take over the world and make us their slaves. I think the scenario you are talking about is more paranoid than realistic.
Judging by the way people walk around obliviously, with their noses buried in their stupid phones, I'd say it has already happened.
Yes, that is really annoying. You cannot have a conversation without someone constantly interrupting to answer their phone. My wife and I had a couple over to our house and they spent most of the evening on their cell phones talking to other people.

I went to the length of buying a portable, small cell-phone jammer, it works nicely when meeting people, especially girls

It's completely normal and expected that some pairs of unrelated programs will play more alike than others, that is what the tester is designed to test.

Yes, but we are dealing here with a matrix, the whole row must match to get a false-positive, which is unlikely. The birthday paradox is a wrong metaphor, it has nothing to do with what we get. On contrary, what we get with this tool is pretty much intuitive to the point of becoming a tautology. In all my tests up to now, what I knew would happen actually happened. For example, as expected, increasing time control increases self-similarity. Going from multicore to one core increases similarity. Compensating with time for strength reduces the span of similarity results. Etc.

Kai

Don · Post by **Don** » Fri Dec 31, 2010 2:44 pm

Laskos wrote:
Don wrote:
tmokonen wrote:
Don wrote: That sound pretty silly to me. There is this idea floating around that computers are someday going to become self-aware, take over the world and make us their slaves. I think the scenario you are talking about is more paranoid than realistic.
Judging by the way people walk around obliviously, with their noses buried in their stupid phones, I'd say it has already happened.
Yes, that is really annoying. You cannot have a conversation without someone constantly interrupting to answer their phone. My wife and I had a couple over to our house and they spent most of the evening on their cell phones talking to other people.
I went to the length of buying a portable, small cell-phone jammer, it works nicely when meeting people, especially girls

It's completely normal and expected that some pairs of unrelated programs will play more alike than others, that is what the tester is designed to test.
Yes, but we are dealing here with a matrix, the whole row must match to get a false-positive, which is unlikely. The birthday paradox is a wrong metaphor, it has nothing to do with what we get. On contrary, what we get with this tool is pretty much intuitive to the point of becoming a tautology. In all my tests up to now, what I knew would happen actually happened. For example, as expected, increasing time control increases self-similarity. Going from multicore to one core increases similarity. Compensating with time for strength reduces the span of similarity results. Etc.

I am taking the point of view that I am not qualified to interpret the result of the test - I will leave that for others. I know that people are going to take EXTREME points of view in both directions. We have already seen "it is totally meaningless and random" which of course it totally foolish. But the other extreme is that it can be used to prove any program is a clone of another. That is equally foolish. It tends to be the same people that take extreme points of view on things without using their reasoning ability.

The pragmatic way to use this if your primary interest is in "clone testing" is to view any output as circumstantial evidence for or against and not weight it too heavily. If you are doing an investigation, the tool might guide you but shouldn't replace good common sense. For example if the tool seems to indicate a very low similarity between 2 programs, it is probably foolish to continue an investigation unless you have something else that is pretty convincing.

Kai

Allard Siemelink · Post by **Allard Siemelink** » Fri Dec 31, 2010 2:50 pm

Don wrote:You were not doing anything wrong - the whole thing was buggy.

It is fixed now - sorry about any inconvenience.

Get it at: http://komodochess.com

Don

Thanks for the update.
I suspect there is still some kind of synchronisation/buffering issue.
When applying the tool to Spark, it is still stalling and I get erratic results.

However, when I disable all uci output (except bestmove), Spark gets 100% cpu and passes the self similarity test with flying colors:

------ spark-dev (time: 100 ms scale: 1.0) ------
99.26 spark-dev (time: 99 ms scale: 1.0)
3.79 Komodo64 1.2 JA (time: 100 ms scale: 1.0)
3.74 Komodo64 1.2 JA (time: 99 ms scale: 1.0)

When I applied the tool to Komodo, I noticed that cpu utilisation was fluttering around only 75%. And the results show a rather poor self similarity:

------ Komodo64 1.2 JA (time: 100 ms scale: 1.0) -----
70.96 Komodo64 1.2 JA (time: 99 ms scale: 1.0)
3.79 spark-dev (time: 100 ms scale: 1.0)
3.78 spark-dev (time: 99 ms scale: 1.0)

I wonder what % of self-similarity you get for Komodo?
Are you able to get it close to 99% if you disable all uci output (except bestmove) in Komodo?

Laskos · Post by **Laskos** » Fri Dec 31, 2010 2:55 pm

Allard Siemelink wrote::
------ spark-dev (time: 100 ms scale: 1.0) ------
99.26 spark-dev (time: 99 ms scale: 1.0)
3.79 Komodo64 1.2 JA (time: 100 ms scale: 1.0)
3.74 Komodo64 1.2 JA (time: 99 ms scale: 1.0)

The Komodo numbers don't look right. Even Spark-copy seems too deterministic. Something must be wrong there.

Kai

Don · Post by **Don** » Fri Dec 31, 2010 3:01 pm

Allard Siemelink wrote:
Don wrote:You were not doing anything wrong - the whole thing was buggy.

It is fixed now - sorry about any inconvenience.

Get it at: http://komodochess.com

Don
Thanks for the update.
I suspect there is still some kind of synchronisation/buffering issue.
When applying the tool to Spark, it is still stalling and I get erratic results.

However, when I disable all uci output (except bestmove), Spark gets 100% cpu and passes the self similarity test with flying colors:
------ spark-dev (time: 100 ms scale: 1.0) ------
99.26 spark-dev (time: 99 ms scale: 1.0)
3.79 Komodo64 1.2 JA (time: 100 ms scale: 1.0)
3.74 Komodo64 1.2 JA (time: 99 ms scale: 1.0)
When I applied the tool to Komodo, I noticed that cpu utilisation was fluttering around only 75%. And the results show a rather poor self similarity:
------ Komodo64 1.2 JA (time: 100 ms scale: 1.0) -----
70.96 Komodo64 1.2 JA (time: 99 ms scale: 1.0)
3.79 spark-dev (time: 100 ms scale: 1.0)
3.78 spark-dev (time: 99 ms scale: 1.0)
I wonder what % of self-similarity you get for Komodo?
Are you able to get it close to 99% if you disable all uci output (except bestmove) in Komodo?

Most modern programs are not very deterministic. I noticed that Spike is much more deterministic that other much stronger programs. It would be remarkable if you can get any program close to 99%. What kind of self similarity do you get with Spark?

I did build a version of the tester that does much better at self-similarity but I don't think what I did is quite valid - it was just an experiment. I basically tested about 20 programs 3 times and removed positions where lots of different programs failed to agree with themselves on what move to play. I tested at 3 different time controls set close to each other (95, 100, 105.) One problem with that, and there may be more, is that it may only work well when testing at that specific time control.

I don't understand why Spark has trouble, what the program does is pretty straightforward - but I want to investigate.

I'm re-running many programs now and I don't have data yet for Komodo - but I will let you know when I do.

Similarity Detector Available

Re: Similarity Detector & wb2uci

Re: Similarity Detector Available

Re: Similarity Detector Available

Re: Similarity Detector Available

Re: Similarity Detector Available

Re: Similarity Detector Available

Re: Similarity Detector Available

Re: Similarity Detector Available

Re: Similarity Detector Available

Re: Similarity Detector Available