Similarity Detector Available

Don · Post by **Don** » Tue Jan 04, 2011 5:32 am

Will Singleton wrote:
Don wrote:
Will Singleton wrote:oops, polyglot swings the other way, forgot about that. Anyway, I finally got it to work with wb2uci. (Thx, odd gunnar.)

The problem turned out to be the fact that Similar sends the command Hard, or ponder, which caused problems with amateur due to some timing issues. There's no need to specify pondering with your test, it doesn't make sense in this context. Also, you send the sd command (set depth 50) which seems unnecessary.
The way the tester works is UCI specific, so it may or may not translate well to winboard, I don't really know for sure. I am thinking about making it also support winboard protocol directly.

It first issues a "go depth 50" command and then sleeps for 100 ms, or for whatever time is specified. Then I issue the "stop" command which is defined to make the search stop immediately and return a move.

I have found that it's more reliable to do this then to send a fixed time command to the engine. Not all UCI engines honor "go movetime 100", including komodo and some engines make their own decisions about when to stop and may stop early even if you specify nodes or movetime.

The issue with spark, by the way, was that it would find a checkmate in one of the early positions and then return the checkmate move. But then after sending the stop command it would send another move. This was causing spark to be out of sync and thus would be returning nonsense moves (moves that applied to a different position.)

The fix was to send the uci "isready" command and wait for "readyok" as suggested by the Spark author so that the engine would always be in sync with the tester. Of course this probably was triggered by a Spark bug but it's possible and desirable for the interface to catch this one.

I was kind of hoping other winboard engine authors would join in the discussion, that's why I'm posting here instead of email. Anyway, thanks again for the tool.

I saw the "isready" and "readyok" in the wb2uci log, apparently that's not sent to my engine as it doesn't seem to respond. I don't implement the sd command, so I always return an error code, which is neither here nor there. Winboard engines only need a "level" command sent once (of sufficient time), then a "new" and "go" and it will search until "stop" (the "?" command).

More importantly, why do you send ponder on? My old compile should work if ponder off is specified. I tried a couple other winboard engines which also did not work, would be interesting to see if ponder is the issue.

Will

I don't send any ponder on commands. It must be wb2uci which does that. However, I can send a command to turn ponder OFF and that might work. I could send the command to search infinitely, does your engine do that?

jwes · Post by **jwes** » Tue Jan 04, 2011 8:44 pm

One idea is to run the program to be tested twice and then only use the positions where both runs chose the same move to compare with the other programs.

Don · Post by **Don** » Tue Jan 04, 2011 9:03 pm

jwes wrote:One idea is to run the program to be tested twice and then only use the positions where both runs chose the same move to compare with the other programs.

I have already considered that and variations of that including running all the programs multiple times. I have not rejected any approach yet as I'm still trying to get a handle on it.

michiguel · Post by **michiguel** » Tue Jan 04, 2011 9:06 pm

jwes wrote:One idea is to run the program to be tested twice and then only use the positions where both runs chose the same move to compare with the other programs.

it is not possible exactly as you propose because you will have disjoint sets. For instance, engine A will have moves for positions 1,2,5,6,9,10 and engine B will have moves for 2,3,6,7,8,9. They will only overlap in positions 2, 6, and 9. The more engines you run, the smaller the ovelapping set. There is a way to deal with this, but it will introduce a humongous mess aligning the sets and dealing with gaps. The only advantage of the engine similarity approach compared to DNA phylogenetics is that there is no gaps. The reason we can deal with DNA gaps is because there are rare, which facilitates the alignment. Here, it is not.

The easiest way to deal with this noise (I speculate) is to bite the bullet and run the test for a longer time.

However, in line with your idea, it will be better to run the test three times, and take the move that was selected 2 or 3 times. If all three moves were different, choose one at random. Not perfect, but I bet it will reduce the noise somewhat.

Miguel

Laskos · Post by **Laskos** » Tue Jan 04, 2011 9:08 pm

michiguel wrote:
The data looks noisy. I wonder whether it could be improved if it is tested at 1 second. 67% for identical engines is not so high.

http://sites.google.com/site/gaviotache ... &width=924
Miguel

Thanks Miguel. Self-similarity is indeed low for Houdini15, i tried 1000ms, it increases to 74%, but I couldn't do 1000ms for all engines, it would have taken me several days. I tried Fruit 2.1 self-similarity at 300ms, it is 97.5%, for different engines it's different.

Several things in the graph:
While it's undoubtedly correct close to the root, it's finer details on the branches could have been distorted (by noise or insufficient statistics). Also, I think that strength does matter in assembling the branches to the individual branchlets and their nodes. Besides that, could you use some arbitrary a priori knowledge like, for example, time-hierarchy? It would probably require Bayesian statistics, I frankly don't know how to do that.

Anyway, I present here a new matrix, maybe you will do the same with it

I equalized as far as I could the strength of the engines (all 1-core 64 bit). Also, for now I took just one main branch from the previous tree (adding three new engines), finer details of which I want to observe. Taking the whole tree adjusted for strength would take me more than 1 week. I separated engines in strength categories, this one is the upper-weight, the second interesting part would be the weaker Fruit 2.1 category. About engines like Shredder, Tiger, Zappa, Ruffian there are no any doubts, they are just useful to set the lower boundary of the results.

So, the result:

Code: Select all


C&#58;\similar03>sim -m

sim version 3

  Key&#58;

  1&#41; Critter0.90 &#40;time&#58; 100 ms  scale&#58; 3.65&#41;
  2&#41; DRybka3 &#40;time&#58; 100 ms  scale&#58; 3.0&#41;
  3&#41; DRybka4 &#40;time&#58; 100 ms  scale&#58; 1.65&#41;
  4&#41; Houdini15 &#40;time&#58; 100 ms  scale&#58; 1.0&#41;
  5&#41; Ippolit80a &#40;time&#58; 100 ms  scale&#58; 1.9&#41;
  6&#41; IvanHoe_B49jA &#40;time&#58; 100 ms  scale&#58; 1.65&#41;
  7&#41; Robbo009 &#40;time&#58; 100 ms  scale&#58; 1.8&#41;
  8&#41; Stockfish2.0ja &#40;time&#58; 100 ms  scale&#58; 2.2&#41;

         1     2     3     4     5     6     7     8
  1.  ----- 55.79 52.52 55.47 57.76 57.08 57.76 52.90
  2.  55.79 ----- 59.44 58.49 61.74 59.76 61.93 52.29
  3.  52.52 59.44 ----- 55.75 57.94 56.02 57.06 52.12
  4.  55.47 58.49 55.75 ----- 62.76 62.41 63.87 51.06
  5.  57.76 61.74 57.94 62.76 ----- 66.78 70.36 52.49
  6.  57.08 59.76 56.02 62.41 66.78 ----- 68.54 52.04
  7.  57.76 61.93 57.06 63.87 70.36 68.54 ----- 52.35
  8.  52.90 52.29 52.12 51.06 52.49 52.04 52.35 -----

Thanks,
Kai

Allard Siemelink · Post by **Allard Siemelink** » Tue Jan 04, 2011 9:35 pm

michiguel wrote:
Allard Siemelink wrote:
Back again, and got some time to spend....
Great that you want to look into Spark's issue, to summarize, here are my current thoughts on this:

I Spark's low correlation with other engines.
I suspect it is related to Spark, not to the similarity utility.

That is not the case for Spark 0.3a. Michael Hart tested it months ago and for a different set of positions, Spark looked different than other engines but around ~60% level. He tested it at longer times though.

[snip]

Sorry for responding so late, my account got deactivated for the past 4 days after I updated the e-mail address

Yes, I remember the old thread, great stuff.
Of course, Spark does have a low similarity with other engines,
but assuming 2 good moves per position, one would still expect
equal moves 50% of the time for completely dissimilar strong engines.

With sim02, Spark had a reported similarity of only 3%,
which indicated that something went wrong.
(my hunch about the duplicate/missing move was right on target)

Thankfully, sim03 is able to handle Spark's quirk and the similarity with other engines is now within expected range.
I'll post some numbers later on.

Allard Siemelink · Post by **Allard Siemelink** » Tue Jan 04, 2011 9:57 pm

Don wrote:
jwes wrote:One idea is to run the program to be tested twice and then only use the positions where both runs chose the same move to compare with the other programs.
I have already considered that and variations of that including running all the programs multiple times. I have not rejected any approach yet as I'm still trying to get a handle on it.

You might consider using three sets of positions: of openings, middle games and end games.

I suppose a good set of opening positions could be derived by looking
at the statistics of a big set of quality games (or a high quality book)
qualifying positions could be automatically identified by looking at the frequency and expected game results of the moves.
e.g. the starting position should qualify since it has >1 moves that are often played and have comparable game results in practice.

Probably, this approach could also be used to identify middle game positions.

Allard Siemelink · Post by **Allard Siemelink** » Tue Jan 04, 2011 10:32 pm

Here are some numbers for Spark.

I am glad to see that with sim03, Spark 1.0 shows good self similarity (single core), and a proper low correlation with other engines:

------ spark-1.0 (time: 100 ms scale: 1.0) ------
99.08 spark-1.0 (time: 99 ms scale: 1.0)
50.42 Stockfish 1.9.1 JA 64bit (time: 100 ms scale: 1.0)
50.06 Stockfish 1.9.1 JA 64bit (time: 50 ms scale: 1.0)
49.02 Stockfish 1.7.1 64bit (time: 99 ms scale: 1.0)
48.96 Stockfish 1.7.1 64bit (time: 100 ms scale: 1.0)
48.69 bright-0.5c (time: 100 ms scale: 1.0)
48.63 bright-0.5c (time: 99 ms scale: 1.0)
48.19 Critter 0.70 64-bit (time: 100 ms scale: 1.0)
47.60 Komodo64 1.2 JA (time: 99 ms scale: 1.0)
47.39 Komodo64 1.1 JA (time: 100 ms scale: 1.0)
46.94 Komodo64 1.2 JA (time: 100 ms scale: 1.0)

Also, it is interesting to see how self correlation relates with time:

------ spark-dev (time: 100 ms scale: 1.0) ------
98.65 spark-dev (time: 99 ms scale: 1.0)
83.88 spark-dev (time: 50 ms scale: 1.0)
83.79 spark-dev (time: 49 ms scale: 1.0)
82.91 spark-dev (time: 200 ms scale: 1.0)
71.80 spark-dev (time: 25 ms scale: 1.0)
62.19 spark-dev (time: 12 ms scale: 1.0)

Will Singleton · Post by **Will Singleton** » Wed Jan 05, 2011 6:01 am

Don wrote:
Will Singleton wrote:
Don wrote:
Will Singleton wrote:oops, polyglot swings the other way, forgot about that. Anyway, I finally got it to work with wb2uci. (Thx, odd gunnar.)

The problem turned out to be the fact that Similar sends the command Hard, or ponder, which caused problems with amateur due to some timing issues. There's no need to specify pondering with your test, it doesn't make sense in this context. Also, you send the sd command (set depth 50) which seems unnecessary.
The way the tester works is UCI specific, so it may or may not translate well to winboard, I don't really know for sure. I am thinking about making it also support winboard protocol directly.

It first issues a "go depth 50" command and then sleeps for 100 ms, or for whatever time is specified. Then I issue the "stop" command which is defined to make the search stop immediately and return a move.

I have found that it's more reliable to do this then to send a fixed time command to the engine. Not all UCI engines honor "go movetime 100", including komodo and some engines make their own decisions about when to stop and may stop early even if you specify nodes or movetime.

The issue with spark, by the way, was that it would find a checkmate in one of the early positions and then return the checkmate move. But then after sending the stop command it would send another move. This was causing spark to be out of sync and thus would be returning nonsense moves (moves that applied to a different position.)

The fix was to send the uci "isready" command and wait for "readyok" as suggested by the Spark author so that the engine would always be in sync with the tester. Of course this probably was triggered by a Spark bug but it's possible and desirable for the interface to catch this one.

I was kind of hoping other winboard engine authors would join in the discussion, that's why I'm posting here instead of email. Anyway, thanks again for the tool.

I saw the "isready" and "readyok" in the wb2uci log, apparently that's not sent to my engine as it doesn't seem to respond. I don't implement the sd command, so I always return an error code, which is neither here nor there. Winboard engines only need a "level" command sent once (of sufficient time), then a "new" and "go" and it will search until "stop" (the "?" command).

More importantly, why do you send ponder on? My old compile should work if ponder off is specified. I tried a couple other winboard engines which also did not work, would be interesting to see if ponder is the issue.

Will
I don't send any ponder on commands. It must be wb2uci which does that. However, I can send a command to turn ponder OFF and that might work. I could send the command to search infinitely, does your engine do that?

The winboard commands for ponder off and on are Easy and Hard, respectively. Here is the logfile from a wb2uci run:

3.447: S< new
3.447: S< post
3.447: S< hard
3.447: S< level 0 1440 0
3.447: S< force
3.447: S< usermove d2d4
3.447: S< usermove g8f6
3.447: S< usermove g1f3
3.447: S< usermove g7g6
3.447: S< usermove g2g3
3.447: S< usermove f8g7
3.447: S< usermove f1g2
3.447: S< usermove e8g8
3.447: S< usermove e1g1
3.447: S< usermove d7d6
3.447: S< usermove f1e1
3.447: S< usermove b8c6
3.447: S< usermove e2e4
3.447: S< usermove e7e5
3.447: S< usermove c2c3
3.447: S< usermove c8d7
3.447: S< sd 50
3.447: S< go
3.557: C> stop
3.557: S< ?

You can see that the command Hard is issued at the same time as all the other commands, followed by Stop at 100ms. Wb2uci might send Hard by itself, likely the default. So if you specify ponder off, it will probably send the Easy command instead.

I wouldn't send the command to search infinitely, if by that you mean Analyze. Most engines support analyze mode, but some don't, including mine. So, the best plan, I think, would be to compile a version that specifies Ponder Off, and see if that works with older winboard progs like mine.

Thanks.

Will

Laskos · Post by **Laskos** » Sun Jan 16, 2011 1:39 pm

Since Miguel didn't produce the clustering diagrams for my results, I am posting the dendrograms made in SPSS.

1. For several engines we know as unrelated, and some we suspect are related. Two identical Houdinis are put to check for the noise. The distance on x-axis is important, we can empirically see that a distance more than 18-20 means unrelated, less than 18-20 suspicious of related. Noise is about 1.0-1.2. All engines at 100 ms, independently of strength.

2. Adjusted for strength top engines. Houdini is at 100 ms, the rest are at larger time, adjusted for strength. The same pattern on x-axis, distance more than 18-20 means unrelated.

Kai

Similarity Detector Available

Re: Similarity Detector & wb2uci

Re: Similarity Detector Available

Re: Similarity Detector Available

Re: Similarity Detector Available

Re: Similarity Detector Available

Re: Similarity Detector Available

Re: Similarity Detector Available

Re: Similarity Detector Available

Re: Similarity Detector & wb2uci

Re: Similarity Detector Available