Laskos wrote:syzygy wrote:Laskos wrote:syzygy wrote:(Maybe Kai should have taken the elo gain per core observed in other threads rounded to one decimal and based on that reconstruct the times to say 3 decimals? Would have made a fine scientific paper. However, this is just a forum thread.)
The problem is the times are not universal, they depend on TC or depth. (...)
Please know that what I wrote between parentheses was purposeful nonsense
The way you measured it was the proper way to do it. Bob did it in exactly the same way not long ago. See
here,
here and
here:
bob wrote:To measure this "the extra search nodes can hall" one can do a very simple test.
Play a series of matches to fixed depth using 1 cpu, then 2 cpus, then 4. Since the depth is fixed, the time to complete is irrelevant, and the only change will be the search overhead. If it does help, the 4 cpu fixed depth program should perform better than the 1 cpu fixed depth program.
I can run that test although I already am almost 100% certain about the outcome, thanks to past testing...
bob wrote:This is going to only answer one question, "Is the parallel search overhead all wasted effort or is there some accuracy/quality gained from it?"
bob wrote:After the first two-thread match is almost complete, zero change in Elo from the two 1-thread runs. Got another 2 thread run for verification queued up, and then 4 thread runs. Should be able to answer this subject definitively tomorrow if nothing goes wrong overnight.
Thanks Ronald, I was unaware of these posts. So, Bob was not so out of his depth as he seemed to be in this thread. The result in my first post contradicted his "no gain" rule (belief?), and he started to post some preposterous arguments. A bit strange to deny and distort hard facts on the very shaky grounds that it cannot happen. "That can not possibly be, because it could never possibly be." (Chekhov)
Here's some context. Dating back to the 70's. Including names such as Marsland, Schaeffer, Newborn, in the beginning, and ending with names like Hsu, Hyatt, etc in recent years.
EVERYONE has used parallel search to increase depth. Until such point that the search depth doesn't improve further. Schaeffer with Sun Phoenix comes to mind, using part of the nodes to run a normal search, part running a material-only search. The ONLY reason anyone has done anything similar is that they reach a point (Schaeffer's program ran on a local network of suns interconnected with 10mbit/sec ethernet, so communication was a big issue) where additional processors don't help with the depth at all and the speedup flattens out (or even turns south in some cases).
So, depth is key, and always has been. For this "widening" to be realistic, we first assume that Komodo's parallel search simply can't produce a reasonable speedup with a paltry 4 cpus. Where EVERYONE is getting something in the range of 3.1-3.3, Komodo falls short. That leaves one possibility, searching wider (less pruning) to make the tree bigger, where it is easier to search in parallel, but with less depth increase. But the width will help. Enough to offset the loss in depth? Has not happened to date. Particularly with just 4 cores. Hence I consider this to be beyond unlikely...
ANY parallel search searches a wider tree. Most know that YBW is the common "trigger" criterion. Once you have searched one move at any particular node, you can reasonably assume that is an ALL node that needs every branch searched. A split there is perfect. The fly in the ointment is that we are only right about 90% of the time (At least in Crafty, 90+% of the time where a fail-high occurs, it happens on the first move). That means that 10% of the splits incur some overhead. If you use 4 cores, in the worse case you waste 3 of the cores as the second move causes the fail high. In fewer cases, there is less overhead when you fail high on the 3rd, and finally no overhead if you fail high on the 5th. That makes the tree wider. But to date, no experimental results has shown that such overhead improves the quality. I have run several such tests, solely to verity that the parallel search is not broken (a searched to fixed depth with 1, 2 or 4 cores should produce about the same Elo against a common set of opponents. If it drops with > 1 core, I know I have a bug.
So experience, and lots of it, by MANY people, has shown that going for more depth offers the greatest return from additional cores. For a while. I have run thousands of tests with 8-12 cores, quite a few with 16, and just a few (ignoring Cray Blitz) with 32. And even some with Eugene Nalimov using a 64-cpu Itanium box Microsoft had at the time. And in every case, additional cores added additional depth. For 1-12 there is no doubt this is the way to go for every program I have seen that uses a parallel search (reasonable one, forget the hash table nonsense and such).
The fact that Komodo searches a wider tree just means it searches a wider tree. Why? No idea. I can think of several programming bugs that would cause this. One simple one as an example: Suppose you use classic LMR where you sort the moves and as you step through the move list you increase the reduction every few moves. Now suppose you split after searching the first move, and each thread counts just the moves it searches. Since each thread searches 1/4 of the moves on average, each thread will not search as many moves as just one would, the moves searched counter doesn't get nearly as large, and the reductions don't reach the same level. That's the sort of bug I have had in the past, and one I would look for if the size of my fixed depth parallel search grew faster than about 30% per core added. Is that the case here? Absolutely unknown, and there is no data to suggest any sort of explanation other than "it happens."
My quibble with the discussion has been that the term "SMP efficiency" is well-defined and is purely based on parallel performance improvements. If your parallel search intentionally does NOT search the same tree as the serial search, then the term is not applicable. If it unintentionally does not search the same tree, that just reduces the SMP efficiency because you are searching nodes that the serial search does not NEED to search (because of alpha/beta backward pruning or normal forward pruning or reductions). Here we can't even mention SMP efficiency since we have no way to compute t(4cores)/t(1core), since no time was provided.
Those points summarize what I have repeated several times by now. Nothing new. Nothing controversial. IF Komodo intentionally changes the shape of the tree, then that would be interesting. If the parallel search can't search the same tree at least 3x faster, that is also interesting since many others are doing so.
My judgement, based on doing parallel search for 35 years now (first parallel version of blitz was 1978 using a dual-cpu Univac 1100/20 machine) is that depth comes first, until you begin to reach asymptotic behavior, at which time you might try to alter the asymptote, or else use the remaining cpu power to do something else, such as the previously mentioned material-only tactical verification search and the like...
I don't claim to know more about parallel search than anyone else, as applied to alpha/beta and chess. But I would certainly claim to not know any less than anyone else currently (or previously) involved. Experience is a necessary tool. And that comes from actually doing it, as opposed to guessing about it.
That's my take on the discussion to date...