Similarity Detector Available

bob · Post by **bob** » Sun Jan 02, 2011 1:02 am

Don wrote:
bob wrote:My point was that with a reasonable number of programs, there is a good chance that some pair of programs will agree significantly. Chess is not a random game. Many positions will have a single best move anyway. The rest will have a small set of "best moves". The more programs you test, the greater the chance you will get a significant match between a pair of them.
You act as if this is some kind of profound insight. Of COURSE you can find pairs of programs that have more similar scores and the more you run the more pairs you will find that are close. If this were not the case the tester would give all pairs of programs the same score! duhhh!

So you don't need to bring this up again because I never disagreed with you on this point.

Programs seem to differ in only a few key areas. Tactics, or pawn structure, or king safety. That doesn't offer a lot of different behaviours. Yes, the test might be made better with a lot more positions, but not positions chosen randomly. Positions that tend to each require a different piece of the puzzle. Deep tactics. Speculative sacrifices. Pawn structure. King safety. Piece coordination. Material imbalances. Etc. Random positions test general chess knowledge, where all programs are more than "just exceptional".

The test should not measure chess knowledge, I think you still completely misunderstand what it does and how it works. If it were designed to test chess skill then it would not be a similarity tester, it would measure chess strength. All strong programs would look like they play the same. I'm trying to get at the positions where the moves are determined by personal preference, not by "choose the best move."

I don't know what you are talking about. I have not said a _thing_ about measuring "chess strength". I said the set of positions needs to have several thematic characteristics, so that you can measure how well a program does tactically, how well it does on speculative ideas like a king side attack that doesn't show a tactical win, just a speculative attack that leaves one side with what appears to be a significant advantage, etc. The more two programs match with respect to different themes, the more similar they would be. And it would be less likely that you could find two programs that show up as similar based purely on a random set of positions. The STS idea comes to mind, in fact, where the positions are chosen to measure a programs skill in several different types of positions.

The positions I would like to remove are ones that all chess program would eventually play if given enough thinking time. Ideally I want only positions where there are more than 1 good move that is playing with none obviously better than any other. Otherwise, it's just a performance test. The test you propose could be used to assign crude elo ratings or something, but that is not a stylistic thing.

I don't disagree with that at all. I was simply addressing the positions themselves. There is room for a set of positions with several playable moves. There is room for a set of positions with only one or two playable moves, if the program understands the underlying theme (a measure of a programs "knowledge" and how broad it is.)

michiguel · Post by **michiguel** » Sun Jan 02, 2011 1:26 am

bob wrote:My point was that with a reasonable number of programs, there is a good chance that some pair of programs will agree significantly. Chess is not a random game. Many positions will have a single best move anyway. The rest will have a small set of "best moves". The more programs you test, the greater the chance you will get a significant match between a pair of them.

That is what I addressed in my previous post. Statistically, the difference between 500 and 700 matches among 1000 positions (assuming two reasonable options per position) is **HUGE** and it cannot be said it was just random luck of the positions chosen. It could even be confirmed by resampling it. We have checked this already months ago.

Programs seem to differ in only a few key areas. Tactics, or pawn structure, or king safety. That doesn't offer a lot of different behaviours. Yes, the test might be made better with a lot more positions, but not positions chosen randomly. Positions that tend to each require a different piece of the puzzle. Deep tactics. Speculative sacrifices. Pawn structure. King safety. Piece coordination. Material imbalances. Etc. Random positions test general chess knowledge, where all programs are more than "just exceptional".

They need to be random, otherwise, you risk a bias. The only problem is positions that every program choose the same move, but that does not hurt or help. It just burn useless CPU cycles.
Any position in which there is more than one reasonable move contributes.

Miguel

Adam Hair · Post by **Adam Hair** » Sun Jan 02, 2011 1:47 am

All that is needed is a large enough set of positions where there
is a set of at least two "best moves". Positions that where the engines
tend to choose the same move have to be removed, as Don has been
doing. Also, positions where engines can not make up their mind on
their selection ( when repeatedly tested at that position) should be
removed. Those positions just add random noise to the test. If these
things are done, then you can reliably measure how *dissimilar 2 engines
are. Thematic suites are unnecessary. In fact, thematic suites may be
more prone to "false positives". You do have to make sure there is
more than one good choice for each position. If it is a large enough
set of random positions ( suitably culled), then some of the thematic characteristics will be contained in the positions.

You threw out a number, 70%, as possibly being considered to indicate
that an engine is a clone to a given engine. Read what I wrote here:
http://talkchess.com/forum/viewtopic.ph ... 70&t=37308
It is as Miguel pointed out, the statistics of the test gives us confidence
in the results, provided that the positions have the suitably characteristics
( more than one choice; each engine has a move that it will tend to make
for each position ).

*- I said dissimilar because this is what this test can answer. If we agree
that if two engines are related then they will tend to choose the same
moves, THEN this is also true: If two engines tend not to choose the
same moves, then they are unrelated.

Don · Post by **Don** » Sun Jan 02, 2011 11:19 am

bob wrote:My point was that with a reasonable number of programs, there is a good chance that some pair of programs will agree significantly. Chess is not a random game. Many positions will have a single best move anyway. The rest will have a small set of "best moves". The more programs you test, the greater the chance you will get a significant match between a pair of them.

I think one thing needs to be clarified here. You are correct that if you test enough programs some will be much closer than others. But that is not the same thing as saying that 2 are likely to be indistinguishable from each other.

Let's assume that every position has exactly 2 equally playable moves and which one a program chooses is not relevant and that each position will be chosen equally by all programs. To make this simple to understand I am making these simplifying assumptions.

Then you can build a large bit string by setting or clearing 1 bit per position. A bit is off or on depending on which of two moves were chosen. The bit string can be arbitrarily long, but let's say it's just 64 bits representing 64 positions. This can serve as a signature of sorts for a given program.

The similarity score is how many bits 2 programs have in common. So in this version of the test it can be 0 - 64. You can expect any two arbitrarily chosen programs to match about half the bits for a score of around 32.

That vast majority of pairs of programs are going to score near 32. If you do the math you will find that less than 1 percent will score 41 or more and the percentage gets astonishingly small for each match greater than this.

What you talk about the "birthday paradox" I don't know how many programs you expect to be testing, but if you wanted to find out which programs played the most like Crafty, how many would you test? 10,000 or more?

In my example we are only talking about 64 positions. There are almost 8000 in my set. I don't think you realize how the math changes even if I add a few problems to the 64, it because ridiculously unlikely 2 unrelated programs will ever match or come close. I feel silly trying to explain the math to you, you should know this stuff.

In my set of 8000 position not all of them fit the simplifying assumptions I mentioned of course. Some positions have 2 or 3 likely moves but one of them is played by most computers more often than the others. And since programs are not very deterministic under the conditions of my test you do get some noise. But 8000 positions is a LOT and I can always add more.

My future plans for this test is to cull out the positions that return the least amount of information. I'm trying to consider how best to do that and it's very tricky.

Adam Hair · Post by **Adam Hair** » Sun Jan 02, 2011 2:52 pm

SzG wrote:
Adam Hair wrote:All that is needed is a large enough set of positions where there
is a set of at least two "best moves". Positions that where the engines
tend to choose the same move have to be removed, as Don has been
doing. Also, positions where engines can not make up their mind on
their selection ( when repeatedly tested at that position) should be
removed. Those positions just add random noise to the test. If these
things are done, then you can reliably measure how *dissimilar 2 engines
are.
Hi Adam,

It seems to me that to create such a set of positions would take a year.

I believe that Don said the positions that were being used were the
offshoot from something else he and Larry Kaufman were doing.

But, the positions could be randomly chosen from any quality database.

Here is the blueprint I see to creating a good set of positions for Don's
utility:

Let's equate move selection to a loaded die, where each face of the die
corresponds to a good move available for a given position. We desire
positions where the die is loaded ( favoring one face) for each engine.
When an individual engine is repeatedly given a position to think about,
it should tend roll the same number with high probability ( chose the
same move).

For each position, the dice should not all be loaded the same. That would
correspond to most/all the engines selecting the same move. Those
positions can be found by testing a number of unrelated engines and
seeing for which positions do they tend to chose the same move. It does
not have to be 100% agreement. Depending on the number of unrelated
engines used ( more is better ), statistics can be used to determine the
threshold percentage.

Also, we would need to see if the dice are loaded. It does no good to
have a large number of positions where each engine's move selection
is more or less random. This would violate an assumption of the test,
that each engine has a preferred move. Too many positions with this
characteristic renders the test invalid. How can these positions be
found so that they can be removed? Choose a set of unrelated engines.
Test them a number of times ( I think 20 to 30 times would be good, that
allows certain assumptions about the distribution of the move selections
to be used). Find the positions where a large fraction of the engines
do not have a definite move choice.

It would take some time to collect the necessary number of positions.
However, the set of positions does not have to be perfect, as far as the
criteria above goes. Miguel's analysis program can deal with some random
noise. And a total of 2000 positions ( or less, depends on the number
of engines being compared) would be enough. I don't think it would
take an excessive amount of time to accumulate that amount of positions
from scratch, provided of course you are not trying to examine the move
choices by eye.

Laskos · Post by **Laskos** » Sun Jan 02, 2011 11:39 pm

Don wrote:Version 03 of the tester is now on the site. If fixes several bugs, has a matrix view of all programs and the index of players is sorted.

Don, it seems to work fine, but still has problems with some engines. Several kinds of Zappa seem to use 6-10% of CPU, Shredder 9 hangs always in the same place, I just skipped those, don't know what happens with them.

All on one core, compensating for strength the two weakest, Ruffian and Fruit 2.1, with a factor of 3.

Code: Select all


C&#58;\similar03>sim -m

sim version 3

  Key&#58;

  1&#41; DRybka3 &#40;time&#58; 100 ms  scale&#58; 1.0&#41;
  2&#41; DRybka4 &#40;time&#58; 100 ms  scale&#58; 1.0&#41;
  3&#41; DShredder10 &#40;time&#58; 100 ms  scale&#58; 1.0&#41;
  4&#41; DShredder12 &#40;time&#58; 100 ms  scale&#58; 1.0&#41;
  5&#41; Fruit 2.1 &#40;time&#58; 100 ms  scale&#58; 3.0&#41;
  6&#41; Glaurung 2 &#40;time&#58; 100 ms  scale&#58; 1.0&#41;
  7&#41; Houdini_15_a &#40;time&#58; 100 ms  scale&#58; 1.0&#41;
  8&#41; Houdini_15_b &#40;time&#58; 100 ms  scale&#58; 1.0&#41;
  9&#41; IvanHoe_B49jA &#40;time&#58; 100 ms  scale&#58; 1.0&#41;
 10&#41; Robbo009a &#40;time&#58; 100 ms  scale&#58; 1.0&#41;
 11&#41; Ruffian &#40;time&#58; 100 ms  scale&#58; 3.0&#41;
 12&#41; Rybka 1,0 beta &#40;time&#58; 100 ms  scale&#58; 1.0&#41;
 13&#41; Tiger2007 &#40;time&#58; 100 ms  scale&#58; 1.0&#41;

         1     2     3     4     5     6     7     8     9    10    11    12    13
  1.  ----- 58.25 45.75 48.43 52.05 48.05 56.24 55.67 57.87 59.65 46.56 52.51 44.80
  2.  58.25 ----- 44.48 47.35 50.51 46.66 54.60 54.39 55.06 56.07 46.22 52.23 44.89
  3.  45.75 44.48 ----- 51.93 45.93 44.87 42.60 42.78 43.53 44.60 46.02 44.61 42.18
  4.  48.43 47.35 51.93 ----- 46.73 43.23 45.59 45.55 46.70 47.78 46.37 45.45 44.68
  5.  52.05 50.51 45.93 46.73 ----- 53.68 47.67 47.45 48.92 49.77 48.05 56.41 43.02
  6.  48.05 46.66 44.87 43.23 53.68 ----- 43.41 43.70 45.05 45.92 46.98 50.00 41.71
  7.  56.24 54.60 42.60 45.59 47.67 43.41 ----- 67.75 61.80 62.70 43.01 48.74 43.28
  8.  55.67 54.39 42.78 45.55 47.45 43.70 67.75 ----- 60.72 62.50 42.68 47.97 43.47
  9.  57.87 55.06 43.53 46.70 48.92 45.05 61.80 60.72 ----- 66.90 44.53 49.53 44.08
 10.  59.65 56.07 44.60 47.78 49.77 45.92 62.70 62.50 66.90 ----- 45.21 50.97 44.74
 11.  46.56 46.22 46.02 46.37 48.05 46.98 43.01 42.68 44.53 45.21 ----- 45.42 44.08
 12.  52.51 52.23 44.61 45.45 56.41 50.00 48.74 47.97 49.53 50.97 45.42 ----- 41.95
 13.  44.80 44.89 42.18 44.68 43.02 41.71 43.28 43.47 44.08 44.74 44.08 41.95 -----

Code: Select all

------ DRybka3 &#40;time&#58; 100 ms  scale&#58; 1.0&#41; ------
 59.65  Robbo009a &#40;time&#58; 100 ms  scale&#58; 1.0&#41;
 58.25  DRybka4 &#40;time&#58; 100 ms  scale&#58; 1.0&#41;
 57.87  IvanHoe_B49jA &#40;time&#58; 100 ms  scale&#58; 1.0&#41;
 56.24  Houdini_15_a &#40;time&#58; 100 ms  scale&#58; 1.0&#41;
 55.67  Houdini_15_b &#40;time&#58; 100 ms  scale&#58; 1.0&#41;
 52.51  Rybka 1,0 beta &#40;time&#58; 100 ms  scale&#58; 1.0&#41;
 52.05  Fruit 2.1 &#40;time&#58; 100 ms  scale&#58; 3.0&#41;
 48.43  DShredder12 &#40;time&#58; 100 ms  scale&#58; 1.0&#41;
 48.05  Glaurung 2 &#40;time&#58; 100 ms  scale&#58; 1.0&#41;
 46.56  Ruffian &#40;time&#58; 100 ms  scale&#58; 3.0&#41;
 45.75  DShredder10 &#40;time&#58; 100 ms  scale&#58; 1.0&#41;
 44.80  Tiger2007 &#40;time&#58; 100 ms  scale&#58; 1.0&#41;


------ Fruit 2.1 &#40;time&#58; 100 ms  scale&#58; 3.0&#41; ------
 56.41  Rybka 1,0 beta &#40;time&#58; 100 ms  scale&#58; 1.0&#41;
 53.68  Glaurung 2 &#40;time&#58; 100 ms  scale&#58; 1.0&#41;
 52.05  DRybka3 &#40;time&#58; 100 ms  scale&#58; 1.0&#41;
 50.51  DRybka4 &#40;time&#58; 100 ms  scale&#58; 1.0&#41;
 49.77  Robbo009a &#40;time&#58; 100 ms  scale&#58; 1.0&#41;
 48.92  IvanHoe_B49jA &#40;time&#58; 100 ms  scale&#58; 1.0&#41;
 48.05  Ruffian &#40;time&#58; 100 ms  scale&#58; 3.0&#41;
 47.67  Houdini_15_a &#40;time&#58; 100 ms  scale&#58; 1.0&#41;
 47.45  Houdini_15_b &#40;time&#58; 100 ms  scale&#58; 1.0&#41;
 46.73  DShredder12 &#40;time&#58; 100 ms  scale&#58; 1.0&#41;
 45.93  DShredder10 &#40;time&#58; 100 ms  scale&#58; 1.0&#41;
 43.02  Tiger2007 &#40;time&#58; 100 ms  scale&#58; 1.0&#41;


------ Houdini_15_a &#40;time&#58; 100 ms  scale&#58; 1.0&#41; ------
 67.75  Houdini_15_b &#40;time&#58; 100 ms  scale&#58; 1.0&#41;
 62.70  Robbo009a &#40;time&#58; 100 ms  scale&#58; 1.0&#41;
 61.80  IvanHoe_B49jA &#40;time&#58; 100 ms  scale&#58; 1.0&#41;
 56.24  DRybka3 &#40;time&#58; 100 ms  scale&#58; 1.0&#41;
 54.60  DRybka4 &#40;time&#58; 100 ms  scale&#58; 1.0&#41;
 48.74  Rybka 1,0 beta &#40;time&#58; 100 ms  scale&#58; 1.0&#41;
 47.67  Fruit 2.1 &#40;time&#58; 100 ms  scale&#58; 3.0&#41;
 45.59  DShredder12 &#40;time&#58; 100 ms  scale&#58; 1.0&#41;
 43.41  Glaurung 2 &#40;time&#58; 100 ms  scale&#58; 1.0&#41;
 43.28  Tiger2007 &#40;time&#58; 100 ms  scale&#58; 1.0&#41;
 43.01  Ruffian &#40;time&#58; 100 ms  scale&#58; 3.0&#41;
 42.60  DShredder10 &#40;time&#58; 100 ms  scale&#58; 1.0&#41;

Houdini a,b are identical, just checking the self-similarity and error margins.
All the best and thanks,
Kai

Will Singleton · Post by **Will Singleton** » Mon Jan 03, 2011 6:59 am

oops, polyglot swings the other way, forgot about that. Anyway, I finally got it to work with wb2uci. (Thx, odd gunnar.)

The problem turned out to be the fact that Similar sends the command Hard, or ponder, which caused problems with amateur due to some timing issues. There's no need to specify pondering with your test, it doesn't make sense in this context. Also, you send the sd command (set depth 50) which seems unnecessary.

And so, with a nice Oliva Serie V, some smooth Johnny Walker Blue, and a cool 60-day extension of msvc 10 pro, I've found my little old prog needs some serious eval changes.

sim version 3
------ RobboLito 0.09 x64 (time: 100 ms scale: 1.0) ------
51.23 Komodo64 1.3 JA (time: 100 ms scale: 1.0)
37.93 amateur 2.95x4 (time: 100 ms scale: 1.0)
37.44 amateur 2.86 (time: 100 ms scale: 1.0)

michiguel · Post by **michiguel** » Mon Jan 03, 2011 10:18 am

Laskos wrote:

Don wrote:Version 03 of the tester is now on the site. If fixes several bugs, has a matrix view of all programs and the index of players is sorted.

Don, it seems to work fine, but still has problems with some engines. Several kinds of Zappa seem to use 6-10% of CPU, Shredder 9 hangs always in the same place, I just skipped those, don't know what happens with them.

All on one core, compensating for strength the two weakest, Ruffian and Fruit 2.1, with a factor of 3.

Code: Select all


C&#58;\similar03>sim -m

sim version 3

  Key&#58;

  1&#41; DRybka3 &#40;time&#58; 100 ms  scale&#58; 1.0&#41;
  2&#41; DRybka4 &#40;time&#58; 100 ms  scale&#58; 1.0&#41;
  3&#41; DShredder10 &#40;time&#58; 100 ms  scale&#58; 1.0&#41;
  4&#41; DShredder12 &#40;time&#58; 100 ms  scale&#58; 1.0&#41;
  5&#41; Fruit 2.1 &#40;time&#58; 100 ms  scale&#58; 3.0&#41;
  6&#41; Glaurung 2 &#40;time&#58; 100 ms  scale&#58; 1.0&#41;
  7&#41; Houdini_15_a &#40;time&#58; 100 ms  scale&#58; 1.0&#41;
  8&#41; Houdini_15_b &#40;time&#58; 100 ms  scale&#58; 1.0&#41;
  9&#41; IvanHoe_B49jA &#40;time&#58; 100 ms  scale&#58; 1.0&#41;
 10&#41; Robbo009a &#40;time&#58; 100 ms  scale&#58; 1.0&#41;
 11&#41; Ruffian &#40;time&#58; 100 ms  scale&#58; 3.0&#41;
 12&#41; Rybka 1,0 beta &#40;time&#58; 100 ms  scale&#58; 1.0&#41;
 13&#41; Tiger2007 &#40;time&#58; 100 ms  scale&#58; 1.0&#41;

         1     2     3     4     5     6     7     8     9    10    11    12    13
  1.  ----- 58.25 45.75 48.43 52.05 48.05 56.24 55.67 57.87 59.65 46.56 52.51 44.80
  2.  58.25 ----- 44.48 47.35 50.51 46.66 54.60 54.39 55.06 56.07 46.22 52.23 44.89
  3.  45.75 44.48 ----- 51.93 45.93 44.87 42.60 42.78 43.53 44.60 46.02 44.61 42.18
  4.  48.43 47.35 51.93 ----- 46.73 43.23 45.59 45.55 46.70 47.78 46.37 45.45 44.68
  5.  52.05 50.51 45.93 46.73 ----- 53.68 47.67 47.45 48.92 49.77 48.05 56.41 43.02
  6.  48.05 46.66 44.87 43.23 53.68 ----- 43.41 43.70 45.05 45.92 46.98 50.00 41.71
  7.  56.24 54.60 42.60 45.59 47.67 43.41 ----- 67.75 61.80 62.70 43.01 48.74 43.28
  8.  55.67 54.39 42.78 45.55 47.45 43.70 67.75 ----- 60.72 62.50 42.68 47.97 43.47
  9.  57.87 55.06 43.53 46.70 48.92 45.05 61.80 60.72 ----- 66.90 44.53 49.53 44.08
 10.  59.65 56.07 44.60 47.78 49.77 45.92 62.70 62.50 66.90 ----- 45.21 50.97 44.74
 11.  46.56 46.22 46.02 46.37 48.05 46.98 43.01 42.68 44.53 45.21 ----- 45.42 44.08
 12.  52.51 52.23 44.61 45.45 56.41 50.00 48.74 47.97 49.53 50.97 45.42 ----- 41.95
 13.  44.80 44.89 42.18 44.68 43.02 41.71 43.28 43.47 44.08 44.74 44.08 41.95 -----

Code: Select all

------ DRybka3 &#40;time&#58; 100 ms  scale&#58; 1.0&#41; ------
 59.65  Robbo009a &#40;time&#58; 100 ms  scale&#58; 1.0&#41;
 58.25  DRybka4 &#40;time&#58; 100 ms  scale&#58; 1.0&#41;
 57.87  IvanHoe_B49jA &#40;time&#58; 100 ms  scale&#58; 1.0&#41;
 56.24  Houdini_15_a &#40;time&#58; 100 ms  scale&#58; 1.0&#41;
 55.67  Houdini_15_b &#40;time&#58; 100 ms  scale&#58; 1.0&#41;
 52.51  Rybka 1,0 beta &#40;time&#58; 100 ms  scale&#58; 1.0&#41;
 52.05  Fruit 2.1 &#40;time&#58; 100 ms  scale&#58; 3.0&#41;
 48.43  DShredder12 &#40;time&#58; 100 ms  scale&#58; 1.0&#41;
 48.05  Glaurung 2 &#40;time&#58; 100 ms  scale&#58; 1.0&#41;
 46.56  Ruffian &#40;time&#58; 100 ms  scale&#58; 3.0&#41;
 45.75  DShredder10 &#40;time&#58; 100 ms  scale&#58; 1.0&#41;
 44.80  Tiger2007 &#40;time&#58; 100 ms  scale&#58; 1.0&#41;


------ Fruit 2.1 &#40;time&#58; 100 ms  scale&#58; 3.0&#41; ------
 56.41  Rybka 1,0 beta &#40;time&#58; 100 ms  scale&#58; 1.0&#41;
 53.68  Glaurung 2 &#40;time&#58; 100 ms  scale&#58; 1.0&#41;
 52.05  DRybka3 &#40;time&#58; 100 ms  scale&#58; 1.0&#41;
 50.51  DRybka4 &#40;time&#58; 100 ms  scale&#58; 1.0&#41;
 49.77  Robbo009a &#40;time&#58; 100 ms  scale&#58; 1.0&#41;
 48.92  IvanHoe_B49jA &#40;time&#58; 100 ms  scale&#58; 1.0&#41;
 48.05  Ruffian &#40;time&#58; 100 ms  scale&#58; 3.0&#41;
 47.67  Houdini_15_a &#40;time&#58; 100 ms  scale&#58; 1.0&#41;
 47.45  Houdini_15_b &#40;time&#58; 100 ms  scale&#58; 1.0&#41;
 46.73  DShredder12 &#40;time&#58; 100 ms  scale&#58; 1.0&#41;
 45.93  DShredder10 &#40;time&#58; 100 ms  scale&#58; 1.0&#41;
 43.02  Tiger2007 &#40;time&#58; 100 ms  scale&#58; 1.0&#41;


------ Houdini_15_a &#40;time&#58; 100 ms  scale&#58; 1.0&#41; ------
 67.75  Houdini_15_b &#40;time&#58; 100 ms  scale&#58; 1.0&#41;
 62.70  Robbo009a &#40;time&#58; 100 ms  scale&#58; 1.0&#41;
 61.80  IvanHoe_B49jA &#40;time&#58; 100 ms  scale&#58; 1.0&#41;
 56.24  DRybka3 &#40;time&#58; 100 ms  scale&#58; 1.0&#41;
 54.60  DRybka4 &#40;time&#58; 100 ms  scale&#58; 1.0&#41;
 48.74  Rybka 1,0 beta &#40;time&#58; 100 ms  scale&#58; 1.0&#41;
 47.67  Fruit 2.1 &#40;time&#58; 100 ms  scale&#58; 3.0&#41;
 45.59  DShredder12 &#40;time&#58; 100 ms  scale&#58; 1.0&#41;
 43.41  Glaurung 2 &#40;time&#58; 100 ms  scale&#58; 1.0&#41;
 43.28  Tiger2007 &#40;time&#58; 100 ms  scale&#58; 1.0&#41;
 43.01  Ruffian &#40;time&#58; 100 ms  scale&#58; 3.0&#41;
 42.60  DShredder10 &#40;time&#58; 100 ms  scale&#58; 1.0&#41;

Houdini a,b are identical, just checking the self-similarity and error margins.
All the best and thanks,
Kai

The data looks noisy. I wonder whether it could be improved if it is tested at 1 second. 67% for identical engines is not so high.

http://sites.google.com/site/gaviotache ... &width=924
Miguel

Don · Post by **Don** » Mon Jan 03, 2011 2:54 pm

Will Singleton wrote:oops, polyglot swings the other way, forgot about that. Anyway, I finally got it to work with wb2uci. (Thx, odd gunnar.)

The problem turned out to be the fact that Similar sends the command Hard, or ponder, which caused problems with amateur due to some timing issues. There's no need to specify pondering with your test, it doesn't make sense in this context. Also, you send the sd command (set depth 50) which seems unnecessary.

The way the tester works is UCI specific, so it may or may not translate well to winboard, I don't really know for sure. I am thinking about making it also support winboard protocol directly.

It first issues a "go depth 50" command and then sleeps for 100 ms, or for whatever time is specified. Then I issue the "stop" command which is defined to make the search stop immediately and return a move.

I have found that it's more reliable to do this then to send a fixed time command to the engine. Not all UCI engines honor "go movetime 100", including komodo and some engines make their own decisions about when to stop and may stop early even if you specify nodes or movetime.

The issue with spark, by the way, was that it would find a checkmate in one of the early positions and then return the checkmate move. But then after sending the stop command it would send another move. This was causing spark to be out of sync and thus would be returning nonsense moves (moves that applied to a different position.)

The fix was to send the uci "isready" command and wait for "readyok" as suggested by the Spark author so that the engine would always be in sync with the tester. Of course this probably was triggered by a Spark bug but it's possible and desirable for the interface to catch this one.

And so, with a nice Oliva Serie V, some smooth Johnny Walker Blue, and a cool 60-day extension of msvc 10 pro, I've found my little old prog needs some serious eval changes.

sim version 3
------ RobboLito 0.09 x64 (time: 100 ms scale: 1.0) ------
51.23 Komodo64 1.3 JA (time: 100 ms scale: 1.0)
37.93 amateur 2.95x4 (time: 100 ms scale: 1.0)
37.44 amateur 2.86 (time: 100 ms scale: 1.0)

Will Singleton · Post by **Will Singleton** » Tue Jan 04, 2011 5:27 am

Don wrote:
Will Singleton wrote:oops, polyglot swings the other way, forgot about that. Anyway, I finally got it to work with wb2uci. (Thx, odd gunnar.)

The problem turned out to be the fact that Similar sends the command Hard, or ponder, which caused problems with amateur due to some timing issues. There's no need to specify pondering with your test, it doesn't make sense in this context. Also, you send the sd command (set depth 50) which seems unnecessary.
The way the tester works is UCI specific, so it may or may not translate well to winboard, I don't really know for sure. I am thinking about making it also support winboard protocol directly.

It first issues a "go depth 50" command and then sleeps for 100 ms, or for whatever time is specified. Then I issue the "stop" command which is defined to make the search stop immediately and return a move.

I have found that it's more reliable to do this then to send a fixed time command to the engine. Not all UCI engines honor "go movetime 100", including komodo and some engines make their own decisions about when to stop and may stop early even if you specify nodes or movetime.

The issue with spark, by the way, was that it would find a checkmate in one of the early positions and then return the checkmate move. But then after sending the stop command it would send another move. This was causing spark to be out of sync and thus would be returning nonsense moves (moves that applied to a different position.)

The fix was to send the uci "isready" command and wait for "readyok" as suggested by the Spark author so that the engine would always be in sync with the tester. Of course this probably was triggered by a Spark bug but it's possible and desirable for the interface to catch this one.

I was kind of hoping other winboard engine authors would join in the discussion, that's why I'm posting here instead of email. Anyway, thanks again for the tool.

I saw the "isready" and "readyok" in the wb2uci log, apparently that's not sent to my engine as it doesn't seem to respond. I don't implement the sd command, so I always return an error code, which is neither here nor there. Winboard engines only need a "level" command sent once (of sufficient time), then a "new" and "go" and it will search until "stop" (the "?" command).

More importantly, why do you send ponder on? My old compile should work if ponder off is specified. I tried a couple other winboard engines which also did not work, would be interesting to see if ponder is the issue.

Will

Similarity Detector Available

Re: Similarity Detector Available

Re: Similarity Detector Available

Re: Similarity Detector Available

Re: Similarity Detector Available

Re: Similarity Detector Available

Re: Similarity Detector Available

Re: Similarity Detector & wb2uci

Re: Similarity Detector Available

Re: Similarity Detector & wb2uci

Re: Similarity Detector & wb2uci