Looking for automatic Engine Testing Software

chrisw · Post by **chrisw** » Fri Jul 24, 2020 10:28 am

OliverBr wrote: ↑Thu Jul 23, 2020 10:18 pm
brianr wrote: ↑Thu Jul 23, 2020 9:41 pm For 4 and 8 CPU matches I use 3 sec inc.
This is quite a lot.
I like to play 40/90 which is more than 2 seconds per move. But this is very slow and I am not a very patient person.
Even with a 32 Core Machine this is needs several hours to play 2000 games in order to get reliable results.

I am still fighting with myself if I focus on my engine fighting each other or a tournament with other engines (e.g. Fruit and Glaurung).
Problem with other engines is: The results becomes much more unpredictable. Even after 1800 games the Elo difference varies from 180 to 210 (to Fruit).

Another question: On a 8 Core Cpu, do you play 8 concurrent games or 16 (because there are 16 threads)?

Windows Task Manager, you can monitor CPU usage for each core and with GPU-X, the temperature of your CPU. Pushing either of those to limits probably not a good idea. You could also send nps to a logfile and monitor that. If your 20 core machine runs 40 threads at, let's say 50% of nps when running 20 threads, then you defeated the purpose of using 2x as many threads (may as well have just halved the game time).

OliverBr · Post by **OliverBr** » Fri Jul 24, 2020 3:16 pm

chrisw wrote: ↑Fri Jul 24, 2020 10:28 am Windows Task Manager, you can monitor CPU usage for each core and with GPU-X, the temperature of your CPU. Pushing either of those to limits probably not a good idea. You could also send nps to a logfile and monitor that. If your 20 core machine runs 40 threads at, let's say 50% of nps when running 20 threads, then you defeated the purpose of using 2x as many threads (may as well have just halved the game time).

I am using a remote Linux Server with 32 Cores and 64 Threads. I am still to find out how to monitor the cpu temperature. Until now concurrency = 32 and I get a fine stable load of 32.

jdart · Post by **jdart** » Fri Jul 24, 2020 4:06 pm

OliverBr wrote: ↑Thu Jul 23, 2020 10:18 pm I like to play 40/90 which is more than 2 seconds per move. But this is very slow and I am not a very patient person.
Even with a 32 Core Machine this is needs several hours to play 2000 games in order to get reliable results.

I am still fighting with myself if I focus on my engine fighting each other or a tournament with other engines (e.g. Fruit and Glaurung).
Problem with other engines is: The results becomes much more unpredictable. Even after 1800 games the Elo difference varies from 180 to 210 (to Fruit).

Another question: On a 8 Core Cpu, do you play 8 concurrent games or 16 (because there are 16 threads)?

Standard time control used by Stockfish for testing is 1:0+0.6 (this is their "slow" time control). I use something similar. You should scale this by machine speed, though. I use 1:0+0.6 on my dual Xeon 2690x3, and I scale it up on the other slower machines I have, according to the NPS on the machine (this is also what OpenBench does).

Standard practic for most of the strong engines including Stockfish is to use matches of the new candidate version against the previous commit. I do sometimes run gauntlet matches against other engines for gauging progress, but I don't any longer routinely use these for testing changes.

My practice is to not run more concurrent matches than there are physical cores on a machine.

--Jon

chrisw · Post by **chrisw** » Fri Jul 24, 2020 5:54 pm

OliverBr wrote: ↑Fri Jul 24, 2020 3:16 pm
chrisw wrote: ↑Fri Jul 24, 2020 10:28 am Windows Task Manager, you can monitor CPU usage for each core and with GPU-X, the temperature of your CPU. Pushing either of those to limits probably not a good idea. You could also send nps to a logfile and monitor that. If your 20 core machine runs 40 threads at, let's say 50% of nps when running 20 threads, then you defeated the purpose of using 2x as many threads (may as well have just halved the game time).
I am using a remote Linux Server with 32 Cores and 64 Threads. I am still to find out how to monitor the cpu temperature. Until now concurrency = 32 and I get a fine stable load of 32.

Ah, well, if it’s remote and time rented, let them worry about overheating, you only need worry about nps, logging is probably the answer. It all depends on how the system handles your core demands, what can happen if you push the limits is that one of your (32?) nps rates gets hammered while the others are okay.
I have two six cores and one four core, the two sixes are fine with six matches concurrent, the four isn’t so i dropped it to three, and now it’s fine. Jealous of your 32. Been looking at possibilities, but it’s a maze out there.

OliverBr · Post by **OliverBr** » Fri Jul 24, 2020 10:32 pm

chrisw wrote: ↑Fri Jul 24, 2020 5:54 pm Jealous of your 32. Been looking at possibilities, but it’s a maze out there.

Actually, it's less expensive than I thought before, about 140 EUR/month. If you consider that the CPU alone costs 2500 EUR, I find the price quite good.
Only drawback, while not for me: You have to use Linux, because with Windows you really become poor.

https://www.hetzner.com/dedicated-rootserver/ax161

PS: One reason I gave up developing OliThink about 2010 was because I didn't have the possibilities to run large tests. It was nearly impossible to say if the new version is better than the previous until it was played in a large, time consuming tournament. I am not the most patient person and still it's going too slow for me with 32-core...

OliverBr · Post by **OliverBr** » Sun Jul 26, 2020 4:35 pm

jdart wrote: ↑Fri Jul 24, 2020 4:06 pm Standard practic for most of the strong engines including Stockfish is to use matches of the new candidate version against the previous commit. I do sometimes run gauntlet matches against other engines for gauging progress, but I don't any longer routinely use these for testing changes.
--Jon

Hi Jon,
I have just installed Arasan 22.1 on my test machine and I have to say, I am really impressed!
Arasan humiliates OliThink with something like 200-1-8. Last time I checked about 2009 it wasn't that strong (Version 11.7).
How could you improve 700 ELO points? What were the most significant steps?

PS: Here is the only won game of OliThink agains Arasan 22.1, despite the fact was trailing with RB against RR for most of the time.

[pgn][Event "?"]
[Site "?"]
[Date "2020.07.26"]
[Round "14"]
[White "OliThink 5.5.9c"]
[Black "Arasan 22.1"]
[Result "1-0"]

1. e3 a6 2. Be2 c6 3. Nc3 e6 4. f4 d5 5. Nf3 Nf6 6. O-O Bd6 7. Ne5 b5 8. a4 b4
9. Na2 O-O 10. b3 a5 11. Bb2 Bb7 12. Kh1 Nbd7 13. Nc1 Ne4 14. Nxd7 Qxd7 15. Nd3
Qc7 16. Nf2 Nxf2+ 17. Rxf2 f6 18. Bd3 e5 19. Qh5 e4 20. Be2 Qf7 21. Qxf7+ Kxf7
22. g4 Ba6 23. Bxa6 Rxa6 24. Rg1 c5 25. g5 f5 26. h4 Bc7 27. h5 Rg8 28. Rfg2 Ke7
29. g6 h6 30. Rd1 c4 31. Bd4 Ke6 32. Rb1 Bb6 33. Be5 Bd8 34. c3 Rb6 35. Rgg1 Bf6
36. Bc7 Ra6 37. cxb4 axb4 38. a5 c3 39. dxc3 Rc6 40. Bb6 bxc3 41. Rbc1 c2
42. Rg2 Rgc8 43. Bd4 Be7 44. Bb2 d4 45. Bxd4 Ba3 46. Rgg1 Ra8 47. Bb6 Rc3
48. Bd4 Rc6 49. Bb6 Rb8 50. Kg2 Ke7 51. Kh3 Ke6 52. Kg2 Bb4 53. Kf1 Rbc8 54. Bd4
R8c7 55. Be5 Rd7 56. Bd4 Bd2 57. Ke2 Bxa5 58. Rg2 Bb4 59. Kf1 Rdc7 60. Bb2 Kd5
61. Kg1 Ke6 62. Kf1 Rc8 63. Kg1 R6c7 64. Re2 Ke7 65. Kh2 Rc5 66. Bd4 R5c7
67. Bb2 Ke8 68. Rf2 Kf8 69. Be5 Rc6 70. Bb2 Ke8 71. Rg2 R8c7 72. Re2 Kf8 73. Rg2
Ke8 74. Re2 Kf8 75. Rg2 Kg8 76. Re2 Rc8 77. Rg2 Kf8 78. Rf2 Ke7 79. Rg2 Kd7
80. Re2 Ke7 81. Rg2 Bd6 82. Rd2 Kf8 83. Rd5 Kg8 84. Rxf5 Rb6 85. Rf7 Bf8 86. b4
Rxb4 87. Rxf8+ Rxf8 88. Rxc2 Rbb8 89. Be5 Rfc8 90. Ra2 Rb3 91. Bd4 Rb1 92. Kh3
Rb7 93. Ra4 Kh8 94. Be5 Rf8 95. Rxe4 Ra8 96. Rc4 Rd7 97. e4 Kg8 98. Kg4 Rda7
99. Rd4 Re8 100. Rd3 Rea8 101. Rd2 Re7 102. Rd6 Rae8 103. Rd3 Kh8 104. Kg3 Rb7
105. Kf2 Kg8 106. Ke3 Ra7 107. Kf3 Kh8 108. Bd4 Ra2 109. Rb3 Ra4 110. Be5 Ra7
111. Bb2 Rf8 112. f5 Rc7 113. Ke3 Kg8 114. Be5 Ra7 115. Bd4 Rc7 116. Rb4 Rd7
117. Ra4 Rb7 118. Rc4 Re8 119. e5 Rb1 120. Kf2 Rd1 121. f6 1-0
[/pgn]

jdart · Post by **jdart** » Sun Jul 26, 2020 10:26 pm

How could you improve 700 ELO points? What were the most significant steps?

11.7 was a long time ago. There were a lot of steps. There is actually a lengthy changelog if you want to see: https://github.com/jdart1/arasan-chess/ ... oc/CHANGES

--Jon

OliverBr · Post by **OliverBr** » Sun Jul 26, 2020 11:29 pm

jdart wrote: ↑Sun Jul 26, 2020 10:26 pm
11.7 was a long time ago. There were a lot of steps. There is actually a lengthy changelog if you want to see: https://github.com/jdart1/arasan-chess/ ... oc/CHANGES

--Jon

It would be interesting to know which steps were most strength gaining.

PS: In this very moment I let Leela (with a GTX1080Ti) analyze this game I posted. She sees only one blunder and losing move in Arasan's game and this is 86...Rxb4??. OliThink did not see a win here, it just exchanged material and got rid off the nasty free pawn.

Here is the result of a 1000-game match between OliThink 5.5.9d and Arasan 22.1:

Code: Select all

   # PLAYER             :  RATING  ERROR  POINTS  PLAYED   (%)    W    D    L  D(%)  CFS(%)
   1 Arasan 22.1        :     788     94   989.0    1000  98.9  984   10    6   1.0     100
   2 OliThink 5.5.9d    :       0   ----    11.0    1000   1.1    6   10  984   1.0     ---

White advantage = -2.80 +/- 2.02
Draw rate (equal opponents) = 6.11 % +/- 4.91

PS2:
This may be a mistake.. Arasan 22.1 resigned while winning. It looks like it happened one other time, the other wins were correct.
[pgn]
[Event "?"]
[Site "?"]
[Date "2020.07.26"]
[Round "479"]
[White "Arasan 22.1"]
[Black "OliThink 5.5.9d"]
[Result "0-1"]

1. d4 Nf6 2. a3 g6 3. e3 Bh6 4. Nf3 d6 5. Bd3 c5 6. Nc3 Nc6 7. d5 Nb8 8. e4 Bxc1
9. Qxc1 h6 10. O-O Na6 11. Qe3 Nc7 12. b4 Ng4 13. Qd2 cxb4 14. axb4 Bd7 15. e5
dxe5 16. Rfe1 Kf8 17. Nxe5 Nxe5 18. Rxe5 Bc8 19. h4 Qd6 20. Rae1 Qxb4 21. Rxe7
Ne6 22. dxe6 Bxe6 23. R7xe6 fxe6 24. Bxg6 Qxh4 25. Rxe6 Rd8 26. Nd5 Rh7 27. Bxh7
Kf7 28. Re7+ Qxe7 29. Qf4+ Ke8 30. Bg6+ Kd7 0-1
[/pgn]

jdart · Post by **jdart** » Mon Jul 27, 2020 2:58 am

OliverBr wrote: ↑Sun Jul 26, 2020 11:29 pm
This may be a mistake.. Arasan 22.1 resigned while winning. It looks like it happened one other time, the other wins were correct.

Could be a bug. I'll see if I can reproduce. I usually run matches with the "-t" flag to Arasan on the command line, which causes it to put out a lot of debug information. But without that log it is hard for me to tell what might be happening.

--Jon

OliverBr · Post by **OliverBr** » Mon Jul 27, 2020 6:01 pm

jdart wrote: ↑Mon Jul 27, 2020 2:58 am
OliverBr wrote: ↑Sun Jul 26, 2020 11:29 pm
This may be a mistake.. Arasan 22.1 resigned while winning. It looks like it happened one other time, the other wins were correct.
Could be a bug. I'll see if I can reproduce. I usually run matches with the "-t" flag to Arasan on the command line, which causes it to put out a lot of debug information. But without that log it is hard for me to tell what might be happening.

--Jon

If you want, the next time I do a 1000 games battle against ArasanX I can run it with the "-t" flag (and the same flag for cutechess-cli). If there is another incident I can provide you the log file even I expect it to be huge. It will get a little messy with 32 concurrent games, this I already know

Looking for automatic Engine Testing Software

Re: Looking for automatic Engine Testing Software

Re: Looking for automatic Engine Testing Software

Re: Looking for automatic Engine Testing Software

Re: Looking for automatic Engine Testing Software

Re: Looking for automatic Engine Testing Software

Re: Looking for automatic Engine Testing Software

Re: Looking for automatic Engine Testing Software

Re: Looking for automatic Engine Testing Software

Re: Looking for automatic Engine Testing Software

Re: Looking for automatic Engine Testing Software