Arasan test suite update

BubbaTough · Post by **BubbaTough** » Tue Sep 23, 2008 6:47 am

I guess that human judgement is wrong about as often as computer judgement is wrong. And that is a surprisingly large number. In other words, old test sets that have not been carefully scrutinized are always full of bugs.

Note that I never recommended relying on naked human judgement, but human judgement aided by use of computer tools. I would say that the judgement of humans that get to use computers as tools is significantly better than computer (or human) judgement on its own. Those old test set errors are situations where if a human looks at the computer analysis they say, "of course!" which means that the human using a computer is not wrong at all about those positions. It would be relatively easy for me to come up with a dozen positions computers will get wildly wrong, but hard for me (or computers) to come up with positions that a human using a computer as a tool would get wrong. Someday this may change, but in my opinion we are many years away from it.

-Sam

Dann Corbit · Post by **Dann Corbit** » Tue Sep 23, 2008 7:29 am

BubbaTough wrote:
I guess that human judgement is wrong about as often as computer judgement is wrong. And that is a surprisingly large number. In other words, old test sets that have not been carefully scrutinized are always full of bugs.
Note that I never recommended relying on naked human judgement, but human judgement aided by use of computer tools. I would say that the judgement of humans that get to use computers as tools is significantly better than computer (or human) judgement on its own. Those old test set errors are situations where if a human looks at the computer analysis they say, "of course!" which means that the human using a computer is not wrong at all about those positions. It would be relatively easy for me to come up with a dozen positions computers will get wildly wrong, but hard for me (or computers) to come up with positions that a human using a computer as a tool would get wrong. Someday this may change, but in my opinion we are many years away from it.

-Sam

Consider a super GM playing against a top level program at decent time control (e.g. G/90 or slower).
Every position that they face can be considered as an EPD test problem.
The computer will fail to understand some of the GM moves.
The GM will fail to understand some of the computer moves.
On the average, the computer will understand better than the GM.
At correspondence chess, the GM may put the computer in a bind more often than vice-versa, but those days too, shall end.

When it comes to judgement passed by top programs and top level GMs, I tend to trust neither as being correct until there is absolute proof of the judgement. Of course, both know better than I do and their suggestion is better than mine. But I would not consider either sort of judgement as 'cast in stone' because both sources are know to have errors in judgement.

BubbaTough · Post by **BubbaTough** » Tue Sep 23, 2008 8:25 am

Consider a super GM playing against a top level program at decent time control (e.g. G/90 or slower).
Every position that they face can be considered as an EPD test problem.

Using my definitions games do not really contain an EPD at every move, because in most positions there are many moves that lead to the same theoretical result (and thus have no best move). Relatively few of the positions contain only a single move that preserves a theoretical result, and many of those are trivial (piece recapture). Those few positions that do contain single best moves where those moves are difficult to find are not always easy to identify, for human or computer. Clearly there are positions humans are better at, and positions computers are better at...but it is quite unclear whether there are ANY positions where humans using computers are at a disadvantage. Certainly centaur matches, and the fact that some postal players almost always win against others armed with computers, indicate that for now strong humans using computers are your best bet in terms of generating trustworthy judgments.

jdart · Post by **jdart** » Tue Sep 23, 2008 3:46 pm

I guess that 98% of moves where score is +300 centipawns {given a long search with a strong engine} win the game.

Also as a practical matter the majority of the test positions I have been posting come from games - and in many cases the test move was actually played by the winning side. (Some positions are from analyzed sidelines, are better alternatives a player failed to find, or are "avoid moves" where a blunder occurred).

I guess that 25% of the moves where one strong program chooses move X and another strong program chooses Y after a long search the move with the lower score is really better.

There was a study showing that the frequency of pv switches (having the "best move" change) didn't really decrease with increasing search depth. So it's always possible (until you hit TBs or find a mate) that more search will give you a different result - it's even probable. But it is possible to say with some certainty that for realistic search depths on current hardware a certain move is best - what that really means is a better alternative is outside the range of search hardware/software we currently have.

--Jon

BubbaTough · Post by **BubbaTough** » Tue Sep 23, 2008 4:41 pm

there was a study showing that the frequency of pv switches (having the "best move" change) didn't really decrease with increasing search depth.

Yes I have heard that, but its one of those studies whose conclusion seems unlikely enough that it makes me doubt the method. Its hard to imagine that if you turned off hashing (which can artificially mess with search depth) that depth 1 pv does not change more than depth 10. The results would certainly be explained by leaving hashing on which causes lower depth nodes to change pv a lot less (because they are not really lower depth).

-Sam

jdart · Post by **jdart** » Tue Sep 23, 2008 5:23 pm

I don't think they were making this observation measuring from depth 1. At depth 1 you always have lousy move ordering. By time time you've done several iterations it's better so you'll get a "normal" range of fail highs or lows over the move list.

--Jon

jdart · Post by **jdart** » Tue Sep 23, 2008 6:37 pm

Re 10.15 I agree Kf6 does not save Black although it may be the best of a set of bad alternatives. Maybe not a good test if you like to see a winning move. Re 10.108 I don't really understand why Rybka doesn't get this because I think the sac is sound. Output at 45min/move:

found 5-man tablebases in directory c:\chess\tb
"arasan10.15" bm Kf6
result: Kf6 score: -1.96 ++ solved in 8.44 sec. (25.32M nodes)
Kf6 Kxf4 Ke6 h4 Kd6 Kg4 Kc7 Kh5 Nc5 Kxh6 a4 h5 a3 Bb1 Ne6 Kh7 Kxb7 h6 Kc7 Kg6 Kd7 Kf6
result(2): a4 score: -3.04 ** not solved in 2400.13 secs. (
10198.93M nodes)
a4 Bxa4 Kf5 h4 Nb8 Bc2+ Ke5 Kg4 Nd7 Ba4 Nb8 Bd1 Na6 Bf3 Kf6 Kxf4 Nb8 Be2
result(3): h5 score: -2.61 ** not solved in 2400.02 secs. (
10177.85M nodes)
h5 h4+ Kxh4 Kxf4 Kh3 Ke5 Kg4 Bd3 Nb8 Kd6 a4 Kc7 a3 Bc4 Na6+ Bxa6 a2 b8=Q a1=Q Qb4+ Kg5 Qd2+ Kg6 Bd3+ Kf7 Bc4+ Kg6 Qd3+ Kh6 Qe3+ Kg6 Qg3+ Kh7 Qg8+
"arasan10.108" "Crafty-Arasan, ACCA Americas' Ch 2007" bm Bxh6
result: Bxh6 score: +3.40 ++ solved in 50.91 sec. (148.71M nodes)
Bxh6 gxh6 Qxh6 Ng7 Nf4 Qxd4 Ng4 Rd8 Nxf6+ Bxf6 Bh7+ Kf8 Rxd4 Bxd4 Qg5 Ke8 Ng6 f6 Qh6 Rd7 Nf4 Bxb2 Nxe6 Nxe6 Rxe6+ Kd8
result(2): Bc2 score: +1.61 ** not solved in 2400.03 secs. (
6809.98M nodes)
Bc2 Nh7 Bxh6 gxh6 Qxh6 f5 Rd3 Bf6 Rh3 Qc7 Rg3+ Bg7 Bb3 Qe7 Nf4 Rf6 Qh5 Nc7 Nfg6 Qd8 Rh3
result(3): g4 score: +1.53 ** not solved in 2400.14 secs. (
6742.34M nodes)
g4 Nh7 Bxh6 gxh6 Qxh6 f5 gxf5 Rf6 Qe3 Ng7 Ng4

jwes · Post by **jwes** » Tue Sep 23, 2008 9:33 pm

jdart wrote:This problem (10.1) has a "shortest mate" solution but there are other moves that are forced mates, too, just longer. So I agree in this case (aind for similar mate problems) that you shouldn't just count the shortest mate as correct. Some engines will hit on a longer mate first.

But I disagree in general that all "winning" moves should be equivalent solutions, if you mean by "win" less than a mate score. I think if a move gives a superior eval (you can pick your number but I generally like to see +1 pawn at least over alternatives) then it can be considered best.

--Jon

I think a good problem should have only move that is a clear win (or draw). I don't think that an absolute eval difference is the right way, e.g. if one move gives +11 and another is +10, I would count them both as solutions, but if one move gives 1.5 and another move is 0.5, I would count only the first solution, even if the second actually can be nursed to a win too deep to see.

Dann Corbit · Post by **Dann Corbit** » Tue Sep 23, 2008 9:53 pm

BubbaTough wrote:
Consider a super GM playing against a top level program at decent time control (e.g. G/90 or slower).
Every position that they face can be considered as an EPD test problem.
Using my definitions games do not really contain an EPD at every move, because in most positions there are many moves that lead to the same theoretical result (and thus have no best move). Relatively few of the positions contain only a single move that preserves a theoretical result, and many of those are trivial (piece recapture). Those few positions that do contain single best moves where those moves are difficult to find are not always easy to identify, for human or computer.

I guess that on average there are about 3 good moves for most positions. For most of these, one will be a little better than the others.

I think that the most important test moves are the quiet moves and not the tactical shots. Computers are really good at finding the tactical shots but typically not that great at finding the good quiet moves.

Clearly there are positions humans are better at, and positions computers are better at...but it is quite unclear whether there are ANY positions where humans using computers are at a disadvantage. Certainly centaur matches, and the fact that some postal players almost always win against others armed with computers, indicate that for now strong humans using computers are your best bet in terms of generating trustworthy judgments.

At some point, due to the inevitable progress of hardware, humans won't be able to contribute anything to the analysis of computers. For instance, if Rybka 3.0 thought for a year about any position I guess Rybka's decision about that position will be superior to any human's even today. So when we get computers that are 365.24 * 86400 times faster than today, they will outthink us in one second. That may seem distant, but it is 25 years away.

jdart · Post by **jdart** » Tue Sep 23, 2008 10:06 pm

I think this one is very hard but Ke4 is best:

"arasan10.192" "Taborov-Vovk, Kiev 1993" bm Ke4
result: Ke4 score: +4.24 ++ solved in 409.75 sec. (1643.27M nodes
)
Ke4 Ke6 Kd4 Ke7 Kc3 a5 Kd4 h6 h3 h5 h4 c3 Kxc3 Ke6 Kd4 b4 axb4 axb4 Kc4 b3 Kxb3
Ke7 c7 Kd7
result(2): Kd4 score: +2.20 ** not solved in 2400.03 secs. (
9271.69M nodes)
Kd4 a5 h4 h5 a4 c3 Kxc3 b4+ Kd3 Ke6 Kd4 Ke7 Kc4 Ke6 Kd3 Ke7 Kd4 Ke6 Kc4 Ke7 Kb5
b3 Kxa5 b2 c7 b1=Q c8=Q Qe1+ Ka6 Qxh4
result(3): h4 score: +2.20 ** not solved in 2400.08 secs. (
8998.25M nodes)
h4 a5 Kd4 h5 a4 c3 Kxc3 b4+ Kc2 Ke6 Kb2 Ke7 Kb1 Ke6 Kc2 Ke7 Kd3 Ke6

Arasan test suite update

Re: Arasan test suite update

Re: Arasan test suite update

Re: Arasan test suite update

Re: Arasan test suite update

Re: Arasan test suite update

Re: Arasan test suite update

Re: 10.15 and 10.108

Re: Arasan test suite update

Re: Arasan test suite update

Re: 10.192