Similarity tests

Sedat Canbaz · Post by **Sedat Canbaz** » Mon Oct 06, 2014 12:39 pm

Btw,

Once more I'd like to point out that,
SCCT's current rule: 55% +100 Elo is not perfect,
But probably the best one comparing with other available rating lists !!

And soon I am going to open a new main thread:
- Regarding how to be an independent and professional tester !!

sim version 3
------ Heron impossible 231113 X64 Normal mode (time: 100 ms scale: 1.0) ------
60.86 Stockfish 2.1 JA 64bit (time: 100 ms scale: 1.0)
60.39 Stockfish 2.1 JA 64bit (time: 50 ms scale: 1.0)
58.27 Stockfish 140614 64 SSE4.2 (time: 100 ms scale: 1.0)
58.06 Stockfish 1.7.1 JA (time: 100 ms scale: 1.0)
57.88 Stockfish 1.7 JA 64bit (time: 100 ms scale: 1.0)
55.85 Stockfish 1.5 JA 64bit (time: 100 ms scale: 1.0)
55.30 Protector 1.7.0 (time: 100 ms scale: 1.0)
53.63 Critter 0.90 64-bit SSE4 (time: 100 ms scale: 1.0)
53.24 Glaurung 2.2 JA (time: 100 ms scale: 1.0)
53.00 Houdini 4 x64 (time: 100 ms scale: 1.0)
52.80 Rybka 3 (time: 100 ms scale: 1.0)
52.55 Critter 1.6a 64-bit (time: 100 ms scale: 1.0)
52.14 Murka 3 x64 UCI (time: 100 ms scale: 1.0)
52.11 Senpai 1.0 (time: 100 ms scale: 1.0)
52.10 Fire 3.0 x64 (time: 100 ms scale: 1.0)
51.89 RobboLito 0.085g3 w32 (time: 100 ms scale: 1.0)
51.86 Equinox 3.20 x64mp (time: 100 ms scale: 1.0)
51.53 Elektro 1.0 (time: 100 ms scale: 1.0)
51.29 Gull 3 x64 (time: 100 ms scale: 1.0)
50.87 TwinFish 0.07 (time: 100 ms scale: 1.0)
50.78 Naum 4.6 (time: 100 ms scale: 1.0)
50.36 BlackMamba 2.0 x64 (time: 100 ms scale: 1.0)
50.32 Komodo64 2.03 DC (time: 100 ms scale: 1.0)
49.95 Komodo 8 64-bit (time: 100 ms scale: 1.0)
49.61 Octochess revision 5190 (time: 100 ms scale: 1.0)
49.43 spark-1.0 (time: 100 ms scale: 1.0)
49.34 MinkoChess 1.3 x64 (time: 100 ms scale: 1.0)
49.17 Rybka 1.0 Beta (time: 100 ms scale: 1.0)
48.41 Deep Shredder 12 x64 (time: 100 ms scale: 1.0)
48.40 Toga II 3.0 (time: 100 ms scale: 1.0)
48.20 Chiron 2 64bit (time: 100 ms scale: 1.0)
47.44 Rodent 1.4 (build 2) (time: 100 ms scale: 1.0)
47.23 Crafty 23.8 x64 (time: 100 ms scale: 1.0)
47.20 Strelka 2.0 B (time: 50 ms scale: 1.0)
47.06 Spike 1.4 (time: 100 ms scale: 1.0)
46.88 Fruit 090705 Test Beta (time: 100 ms scale: 1.0)
46.20 cheng4 0.36a (time: 100 ms scale: 1.0)
46.19 Bobcat 3.25 (time: 100 ms scale: 1.0)
45.95 Tornado 5.0 x64 SSE4 (time: 100 ms scale: 1.0)
45.91 Fruit 2.1 (time: 100 ms scale: 1.0)
44.71 Daydreamer 1.75 JA (time: 100 ms scale: 1.0)
44.70 Cyrano 0.6b17 (time: 100 ms scale: 1.0)
44.19 EXchess v7.31b x64 (time: 100 ms scale: 1.0)
44.10 Vajolet 2.48 (time: 100 ms scale: 1.0)
37.74 Igorrit 0.086v8_x64 (time: 100 ms scale: 1.0)
37.30 Booot 5.2.0(64) (time: 100 ms scale: 1.0)
26.07 Arasan 17.2 (time: 100 ms scale: 1.0)

Uri Blass · Post by **Uri Blass** » Mon Oct 06, 2014 1:03 pm

I wonder what are we going to get if we have only the 55% without the +100 elo

The way to do it is the following steps:
1)Take all the engines ordered based on time of release.

Consider different versions of the same program as different engines so
Rybka1 beta and Rybka1.1 and Rybka1.2 in this stage
so I guess you may have many thousands of engines after completing step 1.

2)The first engine is of course considered to be an original engine

3)Test every engine in the list based on the order of release time to see if it has more than 55% similarity to previous engine that some author released earlier.

If it does not have more than 55% similarity it can stay in the list.
If it has more than 55% similarity to a previous engine then you need to drop it from the list or drop the earlier version from the list in case that the author of both versions is the same(for example if you find that EngineX2 has more than 55% similarity only to EngineX1 by the same author then you drop EngineX1 from the list but if you find that EngineX2 has more than 55% similarity to EngineY by a different author then you drop EngineX2 from the list when you can keep both X1 and Y assuming the similarity between them is less than 55%.

Uri Blass · Post by **Uri Blass** » Mon Oct 06, 2014 1:08 pm

Note that I am surprised by the small number of arasan because I guess that every 2 engines should get at least 30% assuming that we do not talk about weak engines and the question is if there is no bug in arasan17.2 that cause this result(maybe in part of the cases it does not print the move that it is really going to play after a fixed time)

Frank Quisinsky · Post by **Frank Quisinsky** » Mon Oct 06, 2014 1:24 pm

STOP it Sedat!

On the end of the day you lost your interest!
Believe me ...
I made the same for a while.
And today my fancy have again 40-50% only.

Do it for yourself, means to STOP run this test.

Sedat Canbaz · Post by **Sedat Canbaz** » Mon Oct 06, 2014 2:50 pm

Dear Uri,

Don't worry... your engine (Movei) is below than 55%

But with Pia's method is expecting to be above..
However, I think time: 100 ms scale: 1.0 is better ....

sim version 3
------ Movei00_8_438 (time: 100 ms scale: 1.0) ------
52.71 Fruit 2.1 (time: 100 ms scale: 1.0)
50.32 Cyrano 0.6b17 (time: 100 ms scale: 1.0)
49.72 Bobcat 3.25 (time: 100 ms scale: 1.0)
49.32 Strelka 2.0 B (time: 50 ms scale: 1.0)
47.69 Daydreamer 1.75 JA (time: 100 ms scale: 1.0)
47.67 Toga II 3.0 (time: 100 ms scale: 1.0)
47.66 Rybka 1.0 Beta (time: 100 ms scale: 1.0)
47.43 Naum 4.6 (time: 100 ms scale: 1.0)
47.39 Crafty 23.8 x64 (time: 100 ms scale: 1.0)
46.65 Stockfish 1.5 JA 64bit (time: 100 ms scale: 1.0)
46.52 cheng4 0.36a (time: 100 ms scale: 1.0)
46.03 Octochess revision 5190 (time: 100 ms scale: 1.0)
45.58 MinkoChess 1.3 x64 (time: 100 ms scale: 1.0)
45.51 Protector 1.7.0 (time: 100 ms scale: 1.0)
45.45 Glaurung 2.2 JA (time: 100 ms scale: 1.0)
45.39 Vajolet 2.48 (time: 100 ms scale: 1.0)
45.21 Murka 3 x64 UCI (time: 100 ms scale: 1.0)
44.95 Tornado 5.0 x64 SSE4 (time: 100 ms scale: 1.0)
44.79 Komodo64 2.03 DC (time: 100 ms scale: 1.0)
44.53 Deep Shredder 12 x64 (time: 100 ms scale: 1.0)
44.33 RobboLito 0.085g3 w32 (time: 100 ms scale: 1.0)
44.26 Rybka 3 (time: 100 ms scale: 1.0)
44.21 Senpai 1.0 (time: 100 ms scale: 1.0)
44.15 Elektro 1.0 (time: 100 ms scale: 1.0)
44.04 Stockfish 1.7.1 JA (time: 100 ms scale: 1.0)
43.91 BlackMamba 2.0 x64 (time: 100 ms scale: 1.0)
43.72 Spike 1.4 (time: 100 ms scale: 1.0)
43.68 EXchess v7.31b x64 (time: 100 ms scale: 1.0)
43.58 Critter 0.90 64-bit SSE4 (time: 100 ms scale: 1.0)
43.55 spark-1.0 (time: 100 ms scale: 1.0)
43.41 Fruit 090705 Test Beta (time: 100 ms scale: 1.0)
43.25 Stockfish 2.1 JA 64bit (time: 50 ms scale: 1.0)
43.14 Rodent 1.4 (build 2) (time: 100 ms scale: 1.0)
43.06 Heron impossible 231113 X64 Normal mode (time: 100 ms scale: 1.0)
43.04 Stockfish 1.7 JA 64bit (time: 100 ms scale: 1.0)
42.70 Stockfish 2.1 JA 64bit (time: 100 ms scale: 1.0)
42.39 Equinox 3.20 x64mp (time: 100 ms scale: 1.0)
42.36 Chiron 2 64bit (time: 100 ms scale: 1.0)
42.00 Critter 1.6a 64-bit (time: 100 ms scale: 1.0)
41.67 Fire 3.0 x64 (time: 100 ms scale: 1.0)
41.60 TwinFish 0.07 (time: 100 ms scale: 1.0)
41.55 Igorrit 0.086v8_x64 (time: 100 ms scale: 1.0)
41.36 Stockfish 140614 64 SSE4.2 (time: 100 ms scale: 1.0)
41.26 Komodo 8 64-bit (time: 100 ms scale: 1.0)
40.56 Houdini 4 x64 (time: 100 ms scale: 1.0)
40.56 Gull 3 x64 (time: 100 ms scale: 1.0)
40.29 Booot 5.2.0(64) (time: 100 ms scale: 1.0)
29.41 Arasan 17.2 (time: 100 ms scale: 1.0)

Sedat Canbaz · Post by **Sedat Canbaz** » Mon Oct 06, 2014 2:52 pm

Frank Quisinsky wrote:STOP it Sedat!

On the end of the day you lost your interest!
Believe me ...
I made the same for a while.
And today my fancy have again 40-50% only.

Do it for yourself, means to STOP run this test.

NEVER NEVER NEVER dear Frank !

Modern Times · Post by **Modern Times** » Mon Oct 06, 2014 3:11 pm

Sedat Canbaz wrote:Btw,

Once more I'd like to point out that,
SCCT's current rule: 55% +100 Elo is not perfect,
But probably the best one comparing with other available rating lists !!

Given the level of disagreement, and this thread is a prime example, my approach is to test anything and everything that I want to test. I'm not comfortable making a judgement myself on what might be a derivative or clone, and I can't rely on others because there is a lot of disagreement. It is a no-win situation. You make a judgement one way and people disagree, you make a judgement the other way, and someone else disagrees. So I avoid the issue altogether, and have no restrictions.

Having said that, if you want to impose some restrictions, then you need some sort of objective (not subjective) criteria, and Sedat's is a good as any.

Guenther · Post by **Guenther** » Mon Oct 06, 2014 4:27 pm

Uri Blass wrote:Note that I am surprised by the small number of arasan because I guess that every 2 engines should get at least 30% assuming that we do not talk about weak engines and the question is if there is no bug in arasan17.2 that cause this result(maybe in part of the cases it does not print the move that it is really going to play after a fixed time)

This is no surprise because this is not the way simtest should be used.
If you wanna test a wider field with different strength programs it is suggested to quantify the test time by a certain formula.
Actually the 100 ms 1.0 scaling is only a default value for the lazy ones and can give wrong results.
(Sedat should calculate by the given formula)

Guenther

Adam Hair · Post by **Adam Hair** » Mon Oct 06, 2014 4:47 pm

Modern Times wrote:
Sedat Canbaz wrote:Btw,

Once more I'd like to point out that,
SCCT's current rule: 55% +100 Elo is not perfect,
But probably the best one comparing with other available rating lists !!

Given the level of disagreement, and this thread is a prime example, my approach is to test anything and everything that I want to test. I'm not comfortable making a judgement myself on what might be a derivative or clone, and I can't rely on others because there is a lot of disagreement. It is a no-win situation. You make a judgement one way and people disagree, you make a judgement the other way, and someone else disagrees. So I avoid the issue altogether, and have no restrictions.

Having said that, if you want to impose some restrictions, then you need some sort of objective (not subjective) criteria, and Sedat's is a good as any.

There are several problems with Sedat's method:

1) There are engines that do not respond correctly to the similarity tool.
2) The similarity tool only sends UCI commands. In my experience, it can be difficult to properly test WB engines.
3) There is a positive correlation between engine strength and similarity scores. Since there is also a positive correlation between engine strength and processor speed, different people will have different similarity measurements.
4) Whatever threshold used should have some statistical analysis to back it up. I do not suggest using my numbers because they are relative to the computer I used.

Modern Times · Post by **Modern Times** » Mon Oct 06, 2014 5:07 pm

Yes - which takes us back to:

Everyone can do what they are comfortable and happy doing.

Similarity tests

Re: Similarity tests

Re: Similarity tests

Re: Similarity tests

Re: Similarity tests

Re: Similarity tests

Re: Similarity tests

Re: Similarity tests

Re: Similarity tests

Re: Similarity tests

Re: Similarity tests