Similarity Detector Available

michiguel · Post by **michiguel** » Fri Dec 31, 2010 9:27 pm

bob wrote:
Don wrote:
bob wrote:
Don wrote:
bob wrote:my only comment here is that this is likely going to run afoul of the "birthday paradox" frequently. Given enough programs. A new program will frequently choose the same moves as another program, "just because". The more samples, the greater the probability this will happen. Lots of false positives are not going to help a thing...
In order to have a false positive you need context. All this utility does is counts how many moves (out of approx 8000) that 2 programs play in common and returns the percentage. How can that be a false positive? It will be whatever it will be for any two programs.
Simple. Someone is going to choose a number. Say 70%. If A matches B 70% of the time, it is likely a derivative.
Why are you picking numbers and talking about derivatives? The tool is not designed to determine what program is a derivative of some other program.
Because of the title of this thread. "similarity". To the typical reader, that will be interpreted as "if two programs are similar, one is likely a clone of the other" with the certainty of that statement going up as the percentage of matches goes up.

You have to think about the "general audience". I know that the numbers have little relevance to anything. But I've been doing this a long time. The typical reader here will "just assume..." and there we go...

(replace 70% by any reasonable number you want). If you take program A and compare it to B, you might get 40%. If you compare it to C, you might get 50%. If you compare it to enough programs, you will get at least one 70% or higher. From unrelated programs...
This would only have meaning if the tool was designed to determine if programs are related, but it's not. The tester will be able to determine if 2 unrelated program play a lot alike. That is not a false positive. It's only a false positive if the tester is rigged to say, "hey those programs are related!" or if someone like you comes along and tries to assign that meaning to it.

It's completely normal and expected that some pairs of unrelated programs will play more alike than others, that is what the tester is designed to test.

When you produce numbers, you have to expect _someone_ to use them to reach a conclusion. In this case, the conclusion might be right, wrong, or random.
The only conclusion this tool provides is how often two different programs play the same move. This is not a conclusion, it's a statistic. You are trying to make something out of it that it is not.
When you write something, you have to consider the audience. To someone that understands parallel programming, I can use the term "atomic lock" and it will be perfectly clear what I am talking about without any explanation. To a casual reader, it will mean something completely different. Similar to the name change from "nuclear magnetic resonance imaging" to "magnetic resonance imaging" to get rid of that "nuclear" part where everyone assumed they were being exposed to atomic radiation when they were not.

Here, the audience is primarily non-programmers. And the majority are just casual computer chess users. They read this "similarity" stuff in a different light than others will. That was my point.

I continue to get comments over and over again from people who are assuming context which betrays a fundamental misunderstanding of what this tool does and how it works.

If you view this utility as a "clone tester", and you assign some arbitrary percentage value to signify that a program is a "clone", then you can have false positives. But that is not what this utility does and it's not what it's for.

For example: When I tested Robbolito and Houdini, I got a ridiculously high match rate, higher than most other pairs of programs and in many cases much higher than the match rate between two versions of the SAME chess program!

So is that a false positive? No, it's just a fact. The two program play a lot of moves the same. It does not mean Robbolito is a clone of Houdini or a derivative or anything else, it just means they both play the same move a lot more than almost any other program.
All well and good. But the moment you produce numbers, you have to expect someone to take them at face value. I wouldn't consider such comparisons myself. But many will. And they will draw the wrong conclusion.
Taking it at face value is not the problem. I think what you really mean is that they will impute meaning and context that don't exist, just like you are doing.

I used this analogy earlier, but hammers are very useful objects. However every once in a while someone uses one improperly and hits someone over the head with one and kills them. This tool can be used improperly but it can be useful too.

Exactly what do you expect the numbers to show?
The numbers show how often 2 programs play the same move.

What does it mean when two programs match 70% of the time?

It means exactly what you said, they play the same move 70% of the time.

That they have the same search but different evals? Same evals but different search? A combination of both? It is pretty much meaningless.

I don't think it's meaningless, I have learned a LOT just from playing with it. But several people are doing their own research to learn more about this. On the OpenChess site BB has built a similar tool and is studying several aspects of it. I have also learned a lot about it and here is what I have found:

The program is uncanny in it's ability to identify different versions of the same program, even when the program has evolved substantially. The closest matches to Stockfish 1.9 is Sf 1.8, SF 1.7, SF 1.6 and Sf 1.5. This represents significant changes and ELO gains. It's this way for EVERY program I have tested that has multiple versions. Those versions tend to be the closest matches.
And unfortunately, that is bad. Because if two versions match 90%, and then two different programs match 90%, one might naturally conclude the obvious. Without thinking about the birthday paradox. General audiences interpret things differently than a technical group.

The ELO rating of the programs in question have very little impact on similarity scores. For example if you run the test 10x longer for program X, the tool is not fooled into thinking it's a different program or that it is much more like a stronger program.

I can change the search of Komodo and the test is not fooled. For example LMR can be turned off and the tester is almost oblivious, although this single change is a major search change.
That might be more troubling to me. Depending on what time control you are using. Very fast might not be so sensitive to reduced pruning. But if you choose the same moves with and without, that does say something....

My tentative conclusion (and I'm still studying it) is that search does not have much to do with it. I think what makes each program play the way it does is more about the evaluation function than anything else by far. Every test I have done bears that out.

Perhaps a good way to compute some random numbers for a Zobrist hashing scheme... but there are less expensive ways to do that.

If you look at some of the results that the test is returning, you probably wouldn't see any humor in that as it implies that programs all play random moves and are not consistent about what they think is important.
Did not say that at all. I simply said that a single program will match different programs with wildly different percentages, given enough samples...

My intent for the tool was as a diagnostic aid and a tool to examine the playing styles of programs. It returns some result and it's up to you to figure out what it means or doesn't mean and to use good sense and judgement, an increasingly rare commodity these days.

I actually got the idea for this from YOU and John Stanback. I was at a tournament where a version of Crafty was claimed to be heavily modified in the evaluation and was allowed in the tournament. However this program was doing unusually well and Vincent suspected something and you were contacted and consulted. From what I was told, you checked the moves of the game against Crafty and felt too many were the same.

John Stanback in another tournament noticed the same thing simply by watching the tournament games on line and comparing the moves to his own program.
For an isolated data point, that is a good place to start. But to compare a suspected clone against a huge suite of others? Again, false positives. Too many samples.
Why do you keep talking about clones and false positives? That has nothing to do with the tool.

But even if the tool WERE to be used improperly to test clones, I need to point something out to you. The birthday paradox is about determining the odds that ANY two people have the same birthday in a small population of people. The odds that YOU have the same birthday as someone else in that small room is FAR less. In the context of using my tool (improperly) as a kind of "clone test", you are not checking every program ever written to see if any two match, you are interested in just one program, your own. For example if you suspect that Crafty is being cloned, you would test Crafty against the suspicious program along with several other control programs. If the suspected clone was by far the strongest correlated program, you would use this as circumstantial evidence to investigate further. You would NOT test every combination of 2 programs.
The paradox still applies in two ways. First, for a big population it doesn't take too many before the probability of a match shows up. Simple enough. But if you play a single program against a group of others, as the size of that group increases, the probability of a much better (or much worse) match also goes up as well, due to the larger population from which you are looking at a single sample for each comparison...

That paradox does not apply because of the sheer size of the landscape to be explored. In that "paradox", we evaluate the probability of a match, which is as high as 1/365. Here, the probabilities of a "match" are waaaay much smaller. Imagine you have two reasonable moves per position, and you have 1000 positions. What is the probability to have 70% match by random chance? negligible.

That will be equivalent to have a gene of 500 bases (4 options per base). A gene of that size will be enough to trace it and find (at least!) the taxonomic family of the organism, and there are many more species in planet earth than chess engines.

This technique is not devoid of weaknesses, but that stats is not one of them. If you are not happy with 1000 positions, you can increase it to 10,000 etc.

Miguel

I mentioned the paradox because the original implication was to play a match between a large number of programs and determine who matches with whom, the best. That is a direct birthday paradox issue.

For some reason whenever the birthday paradox comes up even smart people get confused about it, I guess that's why it's called a "paradox."

I think every good chess player who gets really familiar with chess program agree's that each program has it's own individual personality. Of course that can only be revealed through the moves it makes.

I understand what you are saying about the birthday paradox and agree, I just think it's not relevant without assuming the context of "clone testing." However, if you tested 1000 unique program by different authors who did not share ideas, etc. you would surely find 2 programs that played very similar chess. The fact that they might play very similar is not a paradox or a lie, it's just how it is.
The danger is, as I said, that some will take these numbers to be something like a correlation coefficient, with some threshold beyond which clone is proven...
That sound pretty silly to me. There is this idea floating around that computers are someday going to become self-aware, take over the world and make us their slaves. I think the scenario you are talking about is more paranoid than realistic.
I think it is a quite natural assumption for someone that is not directly involved in computer chess, which covers the majority of the people here...

Hood · Post by **Hood** » Sat Jan 01, 2011 2:13 pm

Don wrote: First of all, this tool is not a clone tester.

Then what is a purpose of that programm

. You could expect what will be its usage.

You are getting really annoying because you keep trying to argue against a point nobody is trying to make.

If i am arguing against a point no one made i am arguing with none.

There is no reason for the annoyance.

My method cannot even show that. You are not reading anything I am writing are you?

Sometimes we have to read between the lines. If you are reading what i am writing i read what you wrote.

I do however have my own theories. I believe that it cannot tell you much about the search of the program, but it can tell you a lot about the evaluation function and playing style of a program..

I am not sure if you can split eval from the search in the case.

2 different programs can evalute the move the same basing on different principal future position.
The evaluation functions get differrent future main position for evaluation because of different search algorithms.

So 2 different evals and differrent searches could give the same evaluation of the current move.

Rgds
Hood

Martin Thoresen · Post by **Martin Thoresen** » Sat Jan 01, 2011 4:30 pm

Don wrote: I'll see if I can figure out what it happening with Spark. I think the value 70% is approximately correct for most engines. 99% is not correct.

If I can figure out what the problem is, I will see what I can do to fix it and warn the user of any issues.

P.S.
Does your program have an usually long startup time between moves?

Not sure it's very related, but I remember that running Jonny in ChessGUI with blitz time controls (typically 2+1) was almost impossible since the engine had some 1% CPU utilization. If I increased the time control, like the one officially used in TCEC, there was no problem. It definitely seemed as if Jonny was a "slow starter", after some 10 seconds the speed and CPU utilization was correct.

Perhaps this is the same kind of "issue" you are seeing with Spark.

Allard Siemelink · Post by **Allard Siemelink** » Sat Jan 01, 2011 5:04 pm

Don wrote:Please don't bother, there is indeed a bug, I'm not actually SENDING the options to the engine! I'm going to release a version 03 later, but first I want to look into the issue with Spark.
Don

Back again, and got some time to spend....
Great that you want to look into Spark's issue, to summarize, here are my current thoughts on this:

I Spark's low correlation with other engines.
I suspect it is related to Spark, not to the similarity utility.
I'll do some more analysis on Spark's similarity data file, as a missing or duplicate move would explain the results.

II Spark's high self correlation.
My working hypothesis is that this is related to some form of communication synhronisation/overhead between Spark and the similarity utility.
I only get the 99% self correlation if I turn off all uci output (except bestmove) and run 1cpu only, otherwise is it much lower.
For deterministic engines, I still feel that 70% self correlation would be quite low. I would like to see if the self correlation numbers will go up for other single cpu engines with disabled pv's and info messages, just bestmove. This should tell us if the synchronisation/overhead hypothesis is correct. You could try the experiment with Komodo, and I suppose I could try it myself with some open source engine, e.g. stockfish.

I'll let you know if I find anything...

Allard Siemelink · Post by **Allard Siemelink** » Sat Jan 01, 2011 5:08 pm

Don wrote: P.S.
Does your program have an usually long startup time between moves?

Good suggestion, I shouldn't think so, (I do not clear hash between moves) but I will check if this is the case.

Allard Siemelink · Post by **Allard Siemelink** » Sat Jan 01, 2011 5:26 pm

Allard Siemelink wrote:
Don wrote: P.S.
Does your program have an usually long startup time between moves?
Good suggestion, I shouldn't think so, (I do not clear hash between moves) but I will check if this is the case.

That does not seem to be it, after receiving the uci 'go' command, the first move is found quickly.
(the numbers in the log are milliseconds from GetTickCount())

go depth 50
03590718: uci received: go depth 50
info depth 1 currmovenumber 1 currmove a2a4 time 0 nodes 0 hashfull 0
03590718: pv changed
info depth 1 seldepth 0 time 0 nodes 1 hashfull 0 score cp -9 pv a2a4
03590718: pv changed
info depth 1 seldepth 0 time 0 nodes 3 hashfull 0 score cp 10 pv b2b4
03590718: pv changed
info depth 1 seldepth 0 time 0 nodes 6 hashfull 0 score cp 43 pv d2d4
03590718: pv changed
info depth 1 seldepth 0 time 0 nodes 21 hashfull 0 score cp 46 pv b1c3
info depth 2 currmovenumber 1 currmove b1c3 time 0 nodes 23 hashfull 0
03590718: pv changed
info depth 2 seldepth 1 time 0 nodes 47 hashfull 0 score cp -8 pv b1c3 d7d5
03590718: pv changed
info depth 2 seldepth 1 time 16 nodes 90 hashfull 0 score cp 1 pv d2d4 g8f6
03590734: pv changed
info depth 2 seldepth 1 time 16 nodes 165 hashfull 0 score cp 3 pv e2e4 b8c6
03590734: pv changed
info depth 2 seldepth 1 time 16 nodes 234 hashfull 0 score cp 5 pv g1f3 b8c6

michiguel · Post by **michiguel** » Sat Jan 01, 2011 5:43 pm

Allard Siemelink wrote:
Don wrote:Please don't bother, there is indeed a bug, I'm not actually SENDING the options to the engine! I'm going to release a version 03 later, but first I want to look into the issue with Spark.
Don
Back again, and got some time to spend....
Great that you want to look into Spark's issue, to summarize, here are my current thoughts on this:

I Spark's low correlation with other engines.
I suspect it is related to Spark, not to the similarity utility.

That is not the case for Spark 0.3a. Michael Hart tested it months ago and for a different set of positions, Spark looked different than other engines but around ~60% level. He tested it at longer times though.

This is a "disimilarity matrix" (1-similarity)

Code: Select all

bright_04a  0.000 0.411 0.404 0.390 0.407 0.414 0.401 0.400 0.409 0.404 0.402 0.408 0.396 0.399 0.410 0.403 0.399 0.403 0.423 0.402
doch64_1.3  0.411 0.000 0.355 0.400 0.407 0.262 0.374 0.372 0.397 0.356 0.371 0.351 0.381 0.390 0.385 0.383 0.358 0.404 0.440 0.403
firebird_1  0.404 0.355 0.000 0.397 0.383 0.343 0.348 0.336 0.385 0.196 0.339 0.283 0.376 0.362 0.348 0.375 0.330 0.396 0.431 0.386
fruit_21    0.390 0.400 0.397 0.000 0.372 0.405 0.367 0.363 0.362 0.395 0.367 0.400 0.389 0.376 0.392 0.352 0.372 0.299 0.391 0.400
glaurung_2  0.407 0.407 0.383 0.372 0.000 0.404 0.388 0.388 0.387 0.380 0.384 0.397 0.388 0.294 0.326 0.388 0.393 0.368 0.412 0.414
komodo_1.0  0.414 0.262 0.343 0.405 0.404 0.000 0.366 0.358 0.391 0.344 0.360 0.349 0.390 0.386 0.386 0.380 0.361 0.402 0.446 0.396
naum_4.0    0.401 0.374 0.348 0.367 0.388 0.366 0.000 0.198 0.392 0.345 0.255 0.344 0.376 0.378 0.384 0.295 0.358 0.372 0.428 0.395
naum_4.1    0.400 0.372 0.336 0.363 0.388 0.358 0.198 0.000 0.392 0.336 0.245 0.333 0.370 0.371 0.378 0.292 0.360 0.375 0.424 0.386
protector_  0.409 0.397 0.385 0.362 0.387 0.391 0.392 0.392 0.000 0.379 0.392 0.389 0.390 0.381 0.374 0.398 0.389 0.357 0.415 0.396
robbolito_  0.404 0.356 0.196 0.395 0.380 0.344 0.345 0.336 0.379 0.000 0.333 0.282 0.374 0.364 0.353 0.371 0.324 0.391 0.427 0.381
rybka_22n2  0.402 0.371 0.339 0.367 0.384 0.360 0.255 0.245 0.392 0.333 0.000 0.341 0.377 0.373 0.373 0.277 0.358 0.368 0.428 0.389
rybka_3     0.408 0.351 0.283 0.400 0.397 0.349 0.344 0.333 0.389 0.282 0.341 0.000 0.383 0.382 0.375 0.372 0.289 0.406 0.430 0.387
spark_0.3a  0.396 0.381 0.376 0.389 0.388 0.390 0.376 0.370 0.390 0.374 0.377 0.383 0.000 0.351 0.366 0.383 0.377 0.387 0.418 0.389
sf__1.4     0.399 0.390 0.362 0.376 0.294 0.386 0.378 0.371 0.381 0.364 0.373 0.382 0.351 0.000 0.296 0.379 0.370 0.384 0.413 0.394
sf__1.6     0.410 0.385 0.348 0.392 0.326 0.386 0.384 0.378 0.374 0.353 0.373 0.375 0.366 0.296 0.000 0.385 0.372 0.385 0.423 0.398
strelka_2   0.403 0.383 0.375 0.352 0.388 0.380 0.295 0.292 0.398 0.371 0.277 0.372 0.383 0.379 0.385 0.000 0.338 0.374 0.420 0.396
strelka_3r  0.399 0.358 0.330 0.372 0.393 0.361 0.358 0.360 0.389 0.324 0.358 0.289 0.377 0.370 0.372 0.338 0.000 0.389 0.411 0.382
toga_1.3.1  0.403 0.404 0.396 0.299 0.368 0.402 0.372 0.375 0.357 0.391 0.368 0.406 0.387 0.384 0.385 0.374 0.389 0.000 0.404 0.406
zappa_1.1   0.423 0.440 0.431 0.391 0.412 0.446 0.428 0.424 0.415 0.427 0.428 0.430 0.418 0.413 0.423 0.420 0.411 0.404 0.000 0.380
zappa_mexi  0.402 0.403 0.386 0.400 0.414 0.396 0.395 0.386 0.396 0.381 0.389 0.387 0.389 0.394 0.398 0.396 0.382 0.406 0.380 0.000

An the results were plotted and reported here

http://www.talkchess.com/forum/viewtopi ... =&start=83

Miguel

I'll do some more analysis on Spark's similarity data file, as a missing or duplicate move would explain the results.

II Spark's high self correlation.
My working hypothesis is that this is related to some form of communication synhronisation/overhead between Spark and the similarity utility.
I only get the 99% self correlation if I turn off all uci output (except bestmove) and run 1cpu only, otherwise is it much lower.
For deterministic engines, I still feel that 70% self correlation would be quite low. I would like to see if the self correlation numbers will go up for other single cpu engines with disabled pv's and info messages, just bestmove. This should tell us if the synchronisation/overhead hypothesis is correct. You could try the experiment with Komodo, and I suppose I could try it myself with some open source engine, e.g. stockfish.

I'll let you know if I find anything...

bob · Post by **bob** » Sat Jan 01, 2011 7:19 pm

My point was that with a reasonable number of programs, there is a good chance that some pair of programs will agree significantly. Chess is not a random game. Many positions will have a single best move anyway. The rest will have a small set of "best moves". The more programs you test, the greater the chance you will get a significant match between a pair of them.

Programs seem to differ in only a few key areas. Tactics, or pawn structure, or king safety. That doesn't offer a lot of different behaviours. Yes, the test might be made better with a lot more positions, but not positions chosen randomly. Positions that tend to each require a different piece of the puzzle. Deep tactics. Speculative sacrifices. Pawn structure. King safety. Piece coordination. Material imbalances. Etc. Random positions test general chess knowledge, where all programs are more than "just exceptional".

Don · Post by **Don** » Sat Jan 01, 2011 10:49 pm

bob wrote:My point was that with a reasonable number of programs, there is a good chance that some pair of programs will agree significantly. Chess is not a random game. Many positions will have a single best move anyway. The rest will have a small set of "best moves". The more programs you test, the greater the chance you will get a significant match between a pair of them.

You act as if this is some kind of profound insight. Of COURSE you can find pairs of programs that have more similar scores and the more you run the more pairs you will find that are close. If this were not the case the tester would give all pairs of programs the same score! duhhh!

So you don't need to bring this up again because I never disagreed with you on this point.

Programs seem to differ in only a few key areas. Tactics, or pawn structure, or king safety. That doesn't offer a lot of different behaviours. Yes, the test might be made better with a lot more positions, but not positions chosen randomly. Positions that tend to each require a different piece of the puzzle. Deep tactics. Speculative sacrifices. Pawn structure. King safety. Piece coordination. Material imbalances. Etc. Random positions test general chess knowledge, where all programs are more than "just exceptional".

The test should not measure chess knowledge, I think you still completely misunderstand what it does and how it works. If it were designed to test chess skill then it would not be a similarity tester, it would measure chess strength. All strong programs would look like they play the same. I'm trying to get at the positions where the moves are determined by personal preference, not by "choose the best move."

The positions I would like to remove are ones that all chess program would eventually play if given enough thinking time. Ideally I want only positions where there are more than 1 good move that is playing with none obviously better than any other. Otherwise, it's just a performance test. The test you propose could be used to assign crude elo ratings or something, but that is not a stylistic thing.

Don · Post by **Don** » Sun Jan 02, 2011 12:11 am

Version 03 of the tester is now on the site. If fixes several bugs, has a matrix view of all programs and the index of players is sorted.

Also, A J Siemelink figured out why Spark was not testing properly. It has to do with UCI issues where Spark has a minor bug - but the tester now works around that and thus will be more robust for programs with similar UCI bugs.

Don wrote:I created a utility called similar which measures how different one chess program is from others. It does this by running 2000 position from random games and noting how often the moves agree and as output returns the percentage of moves that match.

You can get it here: http://komodochess.com/pub/similar.zip

Similarity Detector Available

Re: Similarity Detector Available

Re: clone tester available

Re: Similarity Detector Available

Re: Similarity Detector Available

Re: Similarity Detector Available

Re: Similarity Detector Available

Re: Similarity Detector Available

Re: Similarity Detector Available

Re: Similarity Detector Available

Re: Similarity Detector Available