I guess that human judgement is wrong about as often as computer judgement is wrong. And that is a surprisingly large number. In other words, old test sets that have not been carefully scrutinized are always full of bugs.
Note that I never recommended relying on naked human judgement, but human judgement aided by use of computer tools. I would say that the judgement of humans that get to use computers as tools is significantly better than computer (or human) judgement on its own. Those old test set errors are situations where if a human looks at the computer analysis they say, "of course!" which means that the human using a computer is not wrong at all about those positions. It would be relatively easy for me to come up with a dozen positions computers will get wildly wrong, but hard for me (or computers) to come up with positions that a human using a computer as a tool would get wrong. Someday this may change, but in my opinion we are many years away from it.
I guess that human judgement is wrong about as often as computer judgement is wrong. And that is a surprisingly large number. In other words, old test sets that have not been carefully scrutinized are always full of bugs.
Note that I never recommended relying on naked human judgement, but human judgement aided by use of computer tools. I would say that the judgement of humans that get to use computers as tools is significantly better than computer (or human) judgement on its own. Those old test set errors are situations where if a human looks at the computer analysis they say, "of course!" which means that the human using a computer is not wrong at all about those positions. It would be relatively easy for me to come up with a dozen positions computers will get wildly wrong, but hard for me (or computers) to come up with positions that a human using a computer as a tool would get wrong. Someday this may change, but in my opinion we are many years away from it.
-Sam
Consider a super GM playing against a top level program at decent time control (e.g. G/90 or slower).
Every position that they face can be considered as an EPD test problem.
The computer will fail to understand some of the GM moves.
The GM will fail to understand some of the computer moves.
On the average, the computer will understand better than the GM.
At correspondence chess, the GM may put the computer in a bind more often than vice-versa, but those days too, shall end.
When it comes to judgement passed by top programs and top level GMs, I tend to trust neither as being correct until there is absolute proof of the judgement. Of course, both know better than I do and their suggestion is better than mine. But I would not consider either sort of judgement as 'cast in stone' because both sources are know to have errors in judgement.
Consider a super GM playing against a top level program at decent time control (e.g. G/90 or slower).
Every position that they face can be considered as an EPD test problem.
Using my definitions games do not really contain an EPD at every move, because in most positions there are many moves that lead to the same theoretical result (and thus have no best move). Relatively few of the positions contain only a single move that preserves a theoretical result, and many of those are trivial (piece recapture). Those few positions that do contain single best moves where those moves are difficult to find are not always easy to identify, for human or computer. Clearly there are positions humans are better at, and positions computers are better at...but it is quite unclear whether there are ANY positions where humans using computers are at a disadvantage. Certainly centaur matches, and the fact that some postal players almost always win against others armed with computers, indicate that for now strong humans using computers are your best bet in terms of generating trustworthy judgments.
I guess that 98% of moves where score is +300 centipawns {given a long search with a strong engine} win the game.
Also as a practical matter the majority of the test positions I have been posting come from games - and in many cases the test move was actually played by the winning side. (Some positions are from analyzed sidelines, are better alternatives a player failed to find, or are "avoid moves" where a blunder occurred).
I guess that 25% of the moves where one strong program chooses move X and another strong program chooses Y after a long search the move with the lower score is really better.
There was a study showing that the frequency of pv switches (having the "best move" change) didn't really decrease with increasing search depth. So it's always possible (until you hit TBs or find a mate) that more search will give you a different result - it's even probable. But it is possible to say with some certainty that for realistic search depths on current hardware a certain move is best - what that really means is a better alternative is outside the range of search hardware/software we currently have.
there was a study showing that the frequency of pv switches (having the "best move" change) didn't really decrease with increasing search depth.
Yes I have heard that, but its one of those studies whose conclusion seems unlikely enough that it makes me doubt the method. Its hard to imagine that if you turned off hashing (which can artificially mess with search depth) that depth 1 pv does not change more than depth 10. The results would certainly be explained by leaving hashing on which causes lower depth nodes to change pv a lot less (because they are not really lower depth).
I don't think they were making this observation measuring from depth 1. At depth 1 you always have lousy move ordering. By time time you've done several iterations it's better so you'll get a "normal" range of fail highs or lows over the move list.
Re 10.15 I agree Kf6 does not save Black although it may be the best of a set of bad alternatives. Maybe not a good test if you like to see a winning move. Re 10.108 I don't really understand why Rybka doesn't get this because I think the sac is sound. Output at 45min/move:
jdart wrote:This problem (10.1) has a "shortest mate" solution but there are other moves that are forced mates, too, just longer. So I agree in this case (aind for similar mate problems) that you shouldn't just count the shortest mate as correct. Some engines will hit on a longer mate first.
But I disagree in general that all "winning" moves should be equivalent solutions, if you mean by "win" less than a mate score. I think if a move gives a superior eval (you can pick your number but I generally like to see +1 pawn at least over alternatives) then it can be considered best.
--Jon
I think a good problem should have only move that is a clear win (or draw). I don't think that an absolute eval difference is the right way, e.g. if one move gives +11 and another is +10, I would count them both as solutions, but if one move gives 1.5 and another move is 0.5, I would count only the first solution, even if the second actually can be nursed to a win too deep to see.
Consider a super GM playing against a top level program at decent time control (e.g. G/90 or slower).
Every position that they face can be considered as an EPD test problem.
Using my definitions games do not really contain an EPD at every move, because in most positions there are many moves that lead to the same theoretical result (and thus have no best move). Relatively few of the positions contain only a single move that preserves a theoretical result, and many of those are trivial (piece recapture). Those few positions that do contain single best moves where those moves are difficult to find are not always easy to identify, for human or computer.
I guess that on average there are about 3 good moves for most positions. For most of these, one will be a little better than the others.
I think that the most important test moves are the quiet moves and not the tactical shots. Computers are really good at finding the tactical shots but typically not that great at finding the good quiet moves.
Clearly there are positions humans are better at, and positions computers are better at...but it is quite unclear whether there are ANY positions where humans using computers are at a disadvantage. Certainly centaur matches, and the fact that some postal players almost always win against others armed with computers, indicate that for now strong humans using computers are your best bet in terms of generating trustworthy judgments.
At some point, due to the inevitable progress of hardware, humans won't be able to contribute anything to the analysis of computers. For instance, if Rybka 3.0 thought for a year about any position I guess Rybka's decision about that position will be superior to any human's even today. So when we get computers that are 365.24 * 86400 times faster than today, they will outthink us in one second. That may seem distant, but it is 25 years away.