comparing match results

flok · Post by **flok** » Sun Dec 07, 2014 8:55 pm

Hi,

I run a continuous tournament on 2 computers. That way I can hopefully see which version of my program performs best.
I was expecting that, given enough matches played, the results would look almost maybe even exactly the same. In reality it is far from that:

system a:

Code: Select all

Rank Name                                                   Elo    +    - games score oppo. draws
   1 HoiChess 0.10.3                                       1023   25   24  4803   97%   -44    5% 
   2 Fairy-Max 4.8Q                                         958   24   23  4678   95%   -24    6% 
   3 DeepBrutePos/2.4/Mon Sep 15 22&#58;58&#58;17 2014/1306/nm-wb   275   12   12  4678   73%    20   21% 
   4 QueenBee-turbo-488-bestval-history                     263   12   12  4712   71%    20   20% 
   5 QueenBee-MonteCarlo -MC-MM-g001fb                      175   11   11  4668   63%    43   18% 
   6 QueenBee-MonteCarlo -MC-MM-g002f-merge                 171   11   11  4681   62%    44   18% 
   7 QueenBee-MonteCarlo -MC-MM-g001h                       165   11   11  4664   62%    40   17% 
   8 QueenBee-MonteCarlo -MC-MM-g001f                       162   12   11  4605   62%    44   17% 
   9 QueenBee-MonteCarlo -MC-MM-g001db                     -185   11   11  4615   34%    67   24% 
  10 QueenBee-MonteCarlo -MC-MM-g001d                      -202   11   11  4595   33%    69   23% 
  11 QueenBee-MonteCarlo -MC-MM-g001b                      -203   12   12  4632   33%    63   11% 
  12 QueenBee-MonteCarlo -MC-g007                          -264   12   12  4725   29%    63    8% 
  13 ParisHilton-bf                                        -473   20   21  1572   16%    42   24% 
  14 QueenBee-MonteCarlo -gnu-MC25                         -483   13   13  4688   13%    87   19% 
  15 ParisHilton                                           -543   14   14  4531    8%   121   13% 
  16 POS v1.21 / Fri May 15 17&#58;29&#58;14 2009                  -838   35   39  2691    2%   173    1%

system b:

Code: Select all

Rank Name                                                   Elo    +    - games score oppo. draws
   1 HoiChess 0.10.3                                        960   24   23  5361   97%   -64    5%
   2 Fairy-Max 4.8T                                         902   23   23  5360   96%   -60    5%
   3 QueenBee-turbo-488-bestval-history                     206   12   11  5346   74%   -13   19%
   4 QueenBee-MonteCarlo -MC-MM-g001f                       174   11   11  5361   71%   -11   14%
   5 QueenBee-MonteCarlo -MC-MM-g001fb                      161   11   11  5330   70%   -10   14%
   6 DeepBrutePos/2.4/Mon Sep 15 22&#58;58&#58;17 2014/1306/nm-wb   154   11   11  5360   71%   -10   23%
   7 QueenBee-MonteCarlo -MC-MM-g001g                      -136   10   10  5346   43%     9   21%
   8 QueenBee-MonteCarlo -MC-MM-g001c                      -156    9    9  5361   42%    11   35%
   9 QueenBee-MonteCarlo -MC-MM-g001db                     -160    9    9  5361   41%    11   34%
  10 QueenBee-MonteCarlo -MC-MM-g001d                      -163    9    9  5361   41%    11   35%
  11 QueenBee-MonteCarlo -MC-MM-g001b                      -172   10   10  5360   39%    12   17%
  12 POS v1.21 / Fri May 15 17&#58;29&#58;14 2009                  -185    9    9  5361   37%    13   39%
  13 QueenBee-MonteCarlo -MC-g007                          -215   10   10  5331   35%    15   13%
  14 QueenBee-MonteCarlo -gnu-MC25                         -422   10   10  5331   15%    28   30%
  15 ParisHilton-bf                                        -451   11   11  5339   15%    30   21%
  16 ParisHilton                                           -497   11   11  5359   12%    33   17%

While typing this posting I realized that not only the cpu clock is different (but that is for all engines on each system) but also the cache size is different (A: 12MB for 6 cores with 2 threads each, B: 4MB for 4 cores). That probably explains it.
If not and you have an idea why: I'd like to hear from you!

Sven · Post by **Sven** » Mon Dec 08, 2014 12:02 am

Hi Folkert,

even the set of players is not identical on the two systems. Only system A has seen g001h and g002f-merge which belong to the slightly stronger versions of QueenBee-MonteCarlo, while only system B has seen g001c and g001g which are a bit weaker. So comparing both rating pools requires at least to ignore all games of those four engine versions mentioned above so that you have 14 players present in both pools. Furthermore you have the option of setting the rating of one engine to a fixed value, say HoiChess := Elo 2100, to simplify any comparison of the two pools.

Apart from that, of course you may expect different results for all SMP engines in your list (e.g. DBP) due to the different number of cores, i.e. an SMP engine should (might) perform better against a single-core engine on a 6-core machine compared to a 4-core machine.

Ferdy · Post by **Ferdy** » Tue Dec 09, 2014 3:21 pm

Folkert wrote:I run a continuous tournament on 2 computers. That way I can hopefully see which version of my program performs best.

Combine the pgn files let's see the result.

flok · Post by **flok** » Wed Dec 10, 2014 7:54 pm

I've combined all pgns, 92k games, and set the strengt of Fairy-Max 4.8Q.

Code: Select all

Rank Name                                                   Elo    +    - games score oppo. draws 
   1 HoiChess 0.10.3                                       1982   16   16 12124   97%   948    5% 
   2 Fairy-Max 4.8Q                                        1928   22   21  5665   95%   993    6% 
   3 Fairy-Max 4.8T                                        1919   22   21  6319   96%   920    5% 
   4 QueenBee-turbo-488-bestval-history                    1218    8    8 12025   73%  1002   20% 
   5 DeepBrutePos/2.4/Mon Sep 15 22&#58;58&#58;17 2014/1306/nm-wb  1194    8    8 11986   72%  1004   22% 
   6 QueenBee-MonteCarlo -MC-MM-g001f                      1152    8    8 11892   67%  1009   15% 
   7 QueenBee-MonteCarlo -MC-MM-g001fb                     1151    8    8 11948   67%  1009   16% 
   8 QueenBee-MonteCarlo -MC-MM-g002f-merge                1148   10   10  5650   62%  1053   18% 
   9 QueenBee-MonteCarlo -MC-MM-g001h                      1147   10   10  5636   62%  1052   17% 
  10 ./pos.sh                                              1012  371  562     1    0%  1151    0% 
  11 QueenBee-MonteCarlo -MC-MM-g001g                       833    9    9  6305   43%   992   21% 
  12 QueenBee-MonteCarlo -MC-MM-g001db                      817    7    7 11911   38%  1032   29% 
  13 QueenBee-MonteCarlo -MC-MM-g001c                       812    9    8  6320   42%   993   35% 
  14 QueenBee-MonteCarlo -MC-MM-g001d                       810    7    7 11894   37%  1033   29% 
  15 QueenBee-MonteCarlo -MC-MM-g001b                       804    7    7 11941   36%  1032   14% 
  16 QueenBee-MonteCarlo -MC-g007                           757    7    7 12003   32%  1035   11% 
  17 POS v1.21 / Fri May 15 17&#58;29&#58;14 2009                   706    8    8  9542   25%  1053   26% 
  18 QueenBee-MonteCarlo -gnu-MC25                          550    8    7 11970   14%  1050   25% 
  19 ParisHilton-bf                                         529    9    9  7868   15%  1021   22% 
  20 ParisHilton                                            475    9    9 11814   10%  1060   15%

Ferdy · Post by **Ferdy** » Thu Dec 11, 2014 7:33 am

flok wrote:I've combined all pgns, 92k games, and set the strengt of Fairy-Max 4.8Q.

Code: Select all

Rank Name                                                   Elo    +    - games score oppo. draws 
   1 HoiChess 0.10.3                                       1982   16   16 12124   97%   948    5% 
   2 Fairy-Max 4.8Q                                        1928   22   21  5665   95%   993    6% 
   3 Fairy-Max 4.8T                                        1919   22   21  6319   96%   920    5% 
   4 QueenBee-turbo-488-bestval-history                    1218    8    8 12025   73%  1002   20% 
   5 DeepBrutePos/2.4/Mon Sep 15 22&#58;58&#58;17 2014/1306/nm-wb  1194    8    8 11986   72%  1004   22% 
   6 QueenBee-MonteCarlo -MC-MM-g001f                      1152    8    8 11892   67%  1009   15% 
   7 QueenBee-MonteCarlo -MC-MM-g001fb                     1151    8    8 11948   67%  1009   16% 
   8 QueenBee-MonteCarlo -MC-MM-g002f-merge                1148   10   10  5650   62%  1053   18% 
   9 QueenBee-MonteCarlo -MC-MM-g001h                      1147   10   10  5636   62%  1052   17% 
  10 ./pos.sh                                              1012  371  562     1    0%  1151    0% 
  11 QueenBee-MonteCarlo -MC-MM-g001g                       833    9    9  6305   43%   992   21% 
  12 QueenBee-MonteCarlo -MC-MM-g001db                      817    7    7 11911   38%  1032   29% 
  13 QueenBee-MonteCarlo -MC-MM-g001c                       812    9    8  6320   42%   993   35% 
  14 QueenBee-MonteCarlo -MC-MM-g001d                       810    7    7 11894   37%  1033   29% 
  15 QueenBee-MonteCarlo -MC-MM-g001b                       804    7    7 11941   36%  1032   14% 
  16 QueenBee-MonteCarlo -MC-g007                           757    7    7 12003   32%  1035   11% 
  17 POS v1.21 / Fri May 15 17&#58;29&#58;14 2009                   706    8    8  9542   25%  1053   26% 
  18 QueenBee-MonteCarlo -gnu-MC25                          550    8    7 11970   14%  1050   25% 
  19 ParisHilton-bf                                         529    9    9  7868   15%  1021   22% 
  20 ParisHilton                                            475    9    9 11814   10%  1060   15%

The best version is more clearer to see, than the previous 2 lists.
Generate or get the head to head results, lets see who is better between
turbo and nm-wb?

Then remove all games from players below 400 from turbo's rating of 1218.
diff = 1218 - 400 = 818,
So remove all games starting from players placed 12. Then run the bayeselo let's see how the lower rated players affect the remaining engines' rating.

hgm · Post by **hgm** » Thu Dec 11, 2014 10:40 am

Are HoiChess and Fairy-Max losing any points at all to the others, or only against each other? If the latter is the case, it is not really useful to include them.

It might be more meaningful to include N.E.G. 0.3d in the gauntlet. Or micro-Max 1.6 (which has no hash table or QS). The two Fairy-Max versions you play should be practically identical for normal Chess (the difference being that the later one supports more variants). Also useful would be to include micro-Max 1.6 versions limited to 2, 3, 4... ply. That would tell you much more than running engines that score 99+%, except against each other.

The most shocking of all this is that HoiChess seems to rank above Fairy-Max, which is not as it should be. Is this because you are playing at fixed depth or fixed time per move, rather than classical TC (for which the rating lists are made)? Are there any time forfeits?

flok · Post by **flok** » Thu Dec 11, 2014 9:46 pm

hgm wrote:Are HoiChess and Fairy-Max losing any points at all to the others, or only against each other? If the latter is the case, it is not really useful to include them.

Unfortunately to much data to filter that out. Unless there's a linux command line tool I don't know of yet which can do so.

It might be more meaningful to include N.E.G. 0.3d in the gauntlet. Or micro-Max 1.6 (which has no hash table or QS). The two Fairy-Max versions you play should be practically identical for normal Chess (the difference being that the later one supports more variants). Also useful would be to include micro-Max 1.6 versions limited to 2, 3, 4... ply. That would tell you much more than running engines that score 99+%, except against each other.

Yes but I use my own tournament software which requires "setboard" in xboard (or uci) which is not often supported unfortunately.
The reason I use my own software is because I find that other (cutecli or so?) too nonintuitive.
With my program I add every engine with:
--puppet exe-xboard,./myxboardprogram.sh
or
--puppet exe-uci,/usr/local/bin/redqueen
Instead of exe I can also use tcp/udp in case the engine is on a remote system.

The most shocking of all this is that HoiChess seems to rank above Fairy-Max, which is not as it should be. Is this because you are playing at fixed depth or fixed time per move, rather than classical TC (for which the rating lists are made)? Are there any time forfeits?

Each game has a limit of 50 seconds in total with a maximum of 50 moves per side.

flok · Post by **flok** » Thu Dec 11, 2014 9:50 pm

Ferdy wrote:
flok wrote:I've combined all pgns, 92k games, and set the strengt of Fairy-Max 4.8Q.

The best version is more clearer to see, than the previous 2 lists.
Generate or get the head to head results, lets see who is better between
turbo and nm-wb?

Then remove all games from players below 400 from turbo's rating of 1218.
diff = 1218 - 400 = 818,
So remove all games starting from players placed 12. Then run the bayeselo let's see how the lower rated players affect the remaining engines' rating.

Hi Ferdinand,

QB-turbo is basically deepbrutepos in a c++ version. Not very much tuned and with a broken tt etc.
Personally I'm mostly interested to know which of these plays best:

Code: Select all

   5 DeepBrutePos/2.4/Mon Sep 15 22&#58;58&#58;17 2014/1306/nm-wb  1194    8    8 11986   72%  1004   22% 
   6 QueenBee-MonteCarlo -MC-MM-g001f                      1152    8    8 11892   67%  1009   15% 
   7 QueenBee-MonteCarlo -MC-MM-g001fb                     1151    8    8 11948   67%  1009   16% 
   8 QueenBee-MonteCarlo -MC-MM-g002f-merge                1148   10   10  5650   62%  1053   18% 
   9 QueenBee-MonteCarlo -MC-MM-g001h                      1147   10   10  5636   62%  1052   17%

That is: which montecarlo version. and it looks like it'll be one of these I think.
So I think I should purge (backup that is) the current results and start fresh with only these and look at their results after a while.

hgm · Post by **hgm** » Thu Dec 11, 2014 11:12 pm

flok wrote:The reason I use my own software is because I find that other (cutecli or so?) too nonintuitive.

So what is wrong with XBoard? If you think the tournament manager is too complicated, you can always run a few old-fashioned two-player maches, through

xboard -fcp "QueenBee-MonteCarlo -MC-MM-g001f" -scp "wine NEG.exe" -mg 100

flok · Post by **flok** » Fri Dec 12, 2014 10:33 am

hgm wrote:
flok wrote:The reason I use my own software is because I find that other (cutecli or so?) too nonintuitive.
So what is wrong with XBoard? If you think the tournament manager is too complicated, you can always run a few old-fashioned two-player maches, through

xboard -fcp "QueenBee-MonteCarlo -MC-MM-g001f" -scp "wine NEG.exe" -mg 100

- it is not headless; the systems I use cannot be accessed using any x11/vnc/whatever protocol, just ssh
- no uci support requiring extra effort to create polyglot scripts

Don't get me wrong: I love xboard, it is the best, but it may not be the best fit for my needs in this.

comparing match results

comparing match results

Re: comparing match results

Re: comparing match results

Re: comparing match results

Re: comparing match results

Re: comparing match results

Re: comparing match results

Re: comparing match results

Re: comparing match results

Re: comparing match results