Threads test incl. Stockfish 5 and Komodo 8

bob · Post by **bob** » Thu Oct 16, 2014 2:37 am

mvk wrote:
syzygy wrote:Btw, note how Bob now keeps clinging to "super-linear speedup = bug" when nobody has made a claim of super-linear speedup for Komodo or any other "widening" smp implementation.
These statements make sense if you view traversal and tree shaping as separate processes: first you define a tree, and then you search it as efficiently as possible. The search, in essence, doesn't impact the effective tree. This is "depth thinking". But that is not how modern programs work. Modern programs dynamically shape the tree through iteration-to-iteration interaction. Their parallel counterparts accept that splitting is going to be imperfect and inevitably generates excess nodes. But then they make an extra step and try to influence where these excess nodes will go, and change the effective tree shape, so at least the can get something in return. The tree shaping vs. traversal interaction is not new, it is what propelled programs such as Shredder and followers ahead of the pack. Newton vs quantum mechanics. Both view are right, in their own domain. (But here we are less interested in the domain of 1980s style programs of course.)

All well and good, but with one key omission. "minimal tree" means EXACTLY what it says. The tree a single thread would search to reach depth D. The goal then becomes to search that same exact tree, no extra nodes, to depth D, but in D/N time (N = number of processors).

NOBODY wants to search a bushier tree. I don't believe that. I don't think anybody here believes that. It is just another point to argue about. If bushier is better, why not make the serial search bushier?

This "well, we can't get ALL of the performance from the N cores to search deeper, so why don't we use part of that performance to make the tree bushier?" just does not pass any "smell test." That the effect is observed does NOT imply that it is intentional. It suggests what we already know, namely that the parallel search is bushier for the very reasons I laid out in the Journal of Parallel Computing a long while back. It is NOT bushier because that is desirable, it is bushier because that is a basic property of alpha/beta that can not be avoided, until such time as we can produce perfect move ordering. By then, we will obviously have solved chess anyway.

bob · Post by **bob** » Thu Oct 16, 2014 2:38 am

syzygy wrote:
bob wrote:if that kind of super-linear speedup happens consistently.
For the Nth time, nobody is claiming a super-linear speedup. Your whole argument is a non-starter to begin with.

super-linear and this "widening" are members of the same set, "urban legend".

michiguel · Post by **michiguel** » Thu Oct 16, 2014 2:45 am

bob wrote:
syzygy wrote:
bob wrote:if that kind of super-linear speedup happens consistently.
For the Nth time, nobody is claiming a super-linear speedup. Your whole argument is a non-starter to begin with.
super-linear and this "widening" are members of the same set, "urban legend".

Do you realize your logic is totally flawed? it is a gigantic straw man. You bring something unrelated, say that they are, just to bring the other down.

And I am sure that Komodo is not a urban legend.

Miguel

michiguel · Post by **michiguel** » Thu Oct 16, 2014 2:50 am

bob wrote:
mvk wrote:
syzygy wrote:Btw, note how Bob now keeps clinging to "super-linear speedup = bug" when nobody has made a claim of super-linear speedup for Komodo or any other "widening" smp implementation.
These statements make sense if you view traversal and tree shaping as separate processes: first you define a tree, and then you search it as efficiently as possible. The search, in essence, doesn't impact the effective tree. This is "depth thinking". But that is not how modern programs work. Modern programs dynamically shape the tree through iteration-to-iteration interaction. Their parallel counterparts accept that splitting is going to be imperfect and inevitably generates excess nodes. But then they make an extra step and try to influence where these excess nodes will go, and change the effective tree shape, so at least the can get something in return. The tree shaping vs. traversal interaction is not new, it is what propelled programs such as Shredder and followers ahead of the pack. Newton vs quantum mechanics. Both view are right, in their own domain. (But here we are less interested in the domain of 1980s style programs of course.)
All well and good, but with one key omission. "minimal tree" means EXACTLY what it says. The tree a single thread would search to reach depth D. The goal then becomes to search that same exact tree, no extra nodes, to depth D, but in D/N time (N = number of processors).

NOBODY wants to search a bushier tree. I don't believe that. I don't think anybody here believes that. It is just another point to argue about. If bushier is better, why not make the serial search bushier?

This "well, we can't get ALL of the performance from the N cores to search deeper, so why don't we use part of that performance to make the tree bushier?" just does not pass any "smell test."

If you are resorting to your smell, then we can finish the discussion. Arguments are not compatible with smell tests.

Miguel

That the effect is observed does NOT imply that it is intentional. It suggests what we already know, namely that the parallel search is bushier for the very reasons I laid out in the Journal of Parallel Computing a long while back. It is NOT bushier because that is desirable, it is bushier because that is a basic property of alpha/beta that can not be avoided, until such time as we can produce perfect move ordering. By then, we will obviously have solved chess anyway.

bob · Post by **bob** » Thu Oct 16, 2014 2:51 am

Laskos wrote:
bob wrote:
Where do "I claim 1.9x speedup from 16 cores to 32 cores?" Never have, never will. Unless you talk about my simple approximation, which has NEVER been a claim of anything, just a simple estimate. You misuse the word "claim" badly here. I claim that is just a rough estimate. and I've made the claim much weaker for 16-32 and beyond as well.
So, you say that formula doesn't apply anywhere near 16 or 32 threads. Couldn't you just remember 3 numbers for doublings? "Approximation" a la Hyatt.

Grow up. When was that formula developed? Back when most parallel PCs were dual-cpus and I had a quad-pentium-pro box. And then the 8-core opteron fit that 3 points of data 1, 2 and 4 quite accurately as well. And 12 is right in there as well, testing I have done a lot of having a cluster of 70 12-core nodes.

It is old. I have not tried to re-calibrate it as of yet, although I suppose now that we have 16 cores and up, even on a single chip, it is about time. But it is not particularly easy when a more accurate formula will need N different sets of coefficients since not all parallel machines are created equally.

The number is highly accurate for most, certainly through 8 cores it is even a bit pessimistic as previous data has shown. Beyond 16 I have not claimed any sort of error estimate. If you choose to use it beyond where it was really intended, then don't blame me for something I have never claimed.

Now guy, you said that you got x32 effective speedup (time to strength) on 64 core Itanium box. Then smirked when I said it IS very high effective speed-up, with you citing some papers from 80es for DTS on completely different architecture. You know what x32 _effective_speed_up means in terms of _average_ effective speed-up for doubling time to 64 threads?

I claimed a 32x parallel time-to-depth speedup on 64 cores. Nothing about time to strength. I have not personally tried to quantify speedup-to-Elo in my testing, because 99% of my testing is not using parallel search anyway. I also stated that 32X does not particularly impress me. We saw some numbers by Feldman quite a few years ago showing a speedup of something like 256 on a 512 node hyper-cube type box. None of us considered THAT to be particularly interesting. I don't have any of my journals here at home or I would try to provide a citation. The 256 might be off a bit up or down I do not recall. But none of us (those working on parallel search) consider a 50% hardware loss as acceptable.

1.78, per each doubling, 6 doublings.

That is VERY HIGH effective speed-up for 6 doublings. Either hardware is very different or Crafty is exceptionally, almost linearly scaling engine. What V. Rajlich reported years ago for effective speed-up of Rybka was a series of doublings to 16 cores looking like that:
1.8*1.7*1.6*1.5
The total is 7.3, already below half of 16 cores used. That's why I consider x32 for Crafty on 64 core box very high.

Would you stand for a bet:

If Andreas bothers to run 32 cores test on his Dual Opteron, I claim Komodo will gain 1.34 effective speed-up, or 38 Elo points, with log model, your claim is that it will gain 1.78 in effective speed-up from 16 to 32 threads (from your Itanium 64 core tests with claimed x32 effective speed-up), or 75 Elo points on that hardware at Andreas' 60''+0.05'' time control. Whoever is closer in numbers wins the bet, whoever loses has to post here "I AM AN IDIOT".

I may open a plea thread to ask Andreas to perform such a test, either Komodo 32 threads against Komodo 1 thread, or even better, Komodo 32 threads against Komodo 16 threads.

Agreed on such a bet?

Why would I bet anything on hardware I have not used? You said "dual opteron" which has 16 cores per CPU? Not a very good machine with a huge cache and memory bottleneck. My original approximation was tested in the Fierz thread I mentioned back at least 15 years ago if not more, on an 8-cpu opteron box that was 8 individual chips, not one chip with 8 cores. Hence no cache bottleneck, just memory. Today there are a ton of variables. I have a machine that is driving me nuts as I have mentioned. Capable of 60M nodes per second (2 x 6 cores) but I can only reach 45-50M for this same problem. Can it be fixed? Don't know as of yet. So betting on hardware, particularly hardware that comes with a built-in bottleneck, doesn't make a lot of sense and wouldn't show a lot of judgement. I have run on a few odd machines where I could not get a 50% speedup. A particular quad-core laptop comes to mind that had a horrible memory interface.

I'm not a kid. I don't gamble. I bet on sure things where I know the outcome before the bet is made. Betting on unknown hardware is not exactly in that class, sorry.

bob · Post by **bob** » Thu Oct 16, 2014 2:54 am

syzygy wrote:
Laskos wrote:Then smirked when I said it IS very high effective speed-up, with you citing some papers from 80es for DTS on completely different architecture.
Those numbers were based on 24 positions searched up to 6 ply or so, so have to be taken cum grano salis. (Not to mention that the node counts listed are not real but reconstructed.)

Sorry, the depths were NOT 6 plies. They were 10 plies and beyond. This was a pretty fast box. The node counts are irrelevant, but the times reported are dead accurate. All are time-to-depth. The paper I cited was from the mid 90's, I think perhaps the last ACM event in 1994 but I am not sure. They were for a pretty good architecture with an excellent memory system, not a single bank like today's PCs.

The point for those positions were that they were from a game, consecutive moves, which let the hash table carry through from move to move to show the speedup throughout an actual game as opposed to the speedup on random positions.

bob · Post by **bob** » Thu Oct 16, 2014 2:56 am

michiguel wrote:
bob wrote:
syzygy wrote:
bob wrote:if that kind of super-linear speedup happens consistently.
For the Nth time, nobody is claiming a super-linear speedup. Your whole argument is a non-starter to begin with.
super-linear and this "widening" are members of the same set, "urban legend".
Do you realize your logic is totally flawed? it is a gigantic straw man. You bring something unrelated, say that they are, just to bring the other down.

And I am sure that Komodo is not a urban legend.

Miguel

No but intentional widening rather than searching deeper is certainly a legend.

bob · Post by **bob** » Thu Oct 16, 2014 2:57 am

lkaufman wrote:
x3 wrote:SF5 8-16 cores +37 elo 3200 games
K8 8-16 cores +45 elo 1600 games
SF5 16-24 cores +7 elo 1600 games
K8 16-24 cores +18 elo 1480 gsames

3m+1s
8 move book repeating

x3
It's interesting that Andreas reported only 4 elo more for SF5 on 16 cores than on 8 (each running against 1 core), while you got 37 in direct match, both at least 3k games. Way beyond error margin. I wonder if the difference is due to direct match vs. matches with 1 core, or to different time limit, or something else? Any idea?

remember if you are playing A vs A' (same program, one with 1 cpu, one with 2) the rating difference is usually exaggerated compared to A vs B, C and D, and then A' vs B, C and D.

mjlef · Post by **mjlef** » Thu Oct 16, 2014 2:57 am

bob wrote:
mvk wrote:
syzygy wrote:Btw, note how Bob now keeps clinging to "super-linear speedup = bug" when nobody has made a claim of super-linear speedup for Komodo or any other "widening" smp implementation.
These statements make sense if you view traversal and tree shaping as separate processes: first you define a tree, and then you search it as efficiently as possible. The search, in essence, doesn't impact the effective tree. This is "depth thinking". But that is not how modern programs work. Modern programs dynamically shape the tree through iteration-to-iteration interaction. Their parallel counterparts accept that splitting is going to be imperfect and inevitably generates excess nodes. But then they make an extra step and try to influence where these excess nodes will go, and change the effective tree shape, so at least the can get something in return. The tree shaping vs. traversal interaction is not new, it is what propelled programs such as Shredder and followers ahead of the pack. Newton vs quantum mechanics. Both view are right, in their own domain. (But here we are less interested in the domain of 1980s style programs of course.)
All well and good, but with one key omission. "minimal tree" means EXACTLY what it says. The tree a single thread would search to reach depth D. The goal then becomes to search that same exact tree, no extra nodes, to depth D, but in D/N time (N = number of processors).

NOBODY wants to search a bushier tree. I don't believe that. I don't think anybody here believes that. It is just another point to argue about. If bushier is better, why not make the serial search bushier?

This "well, we can't get ALL of the performance from the N cores to search deeper, so why don't we use part of that performance to make the tree bushier?" just does not pass any "smell test." That the effect is observed does NOT imply that it is intentional. It suggests what we already know, namely that the parallel search is bushier for the very reasons I laid out in the Journal of Parallel Computing a long while back. It is NOT bushier because that is desirable, it is bushier because that is a basic property of alpha/beta that can not be avoided, until such time as we can produce perfect move ordering. By then, we will obviously have solved chess anyway.

I do not want to get too involved in this conversation, since whatever I say would just help others figure out what we have figured out. But here is what I think. Recent programs do not find the optimal path since they no longer examine the full tree needed by alpha-beta to prove the best move has been found. They find a "good enough and generally right" move. In a world without selectivity, a lot of what Bob says is true. But that world ended a long while ago. In many ways, strong chess programs are now way closer to Shannon's Type B than Type A searchers. So given that the search is Type B, and flawed, how do we make it less flawed? What Komodo does seems to be working well, and it repairs some of the overlooked lines in a cost/time effective way. You can either look at the data, whihc I think mos will agree is pretty convincing, or stick with Type A searches.

syzygy · Post by **syzygy** » Thu Oct 16, 2014 8:52 am

bob wrote:I'm not "clinging" to anything. super-linear speedup and effective "widening" are just two mythical ideas that point out a flaw in the basic search algorithm, not really related to the parallel search at all.

Super-linear speedup is not a topic of this thread, EXCEPT that you are trying to move the thread towards it.

Threads test incl. Stockfish 5 and Komodo 8

Re: Threads test incl. Stockfish 5 and Komodo 8

Re: Threads test incl. Stockfish 5 and Komodo 8

Re: Threads test incl. Stockfish 5 and Komodo 8

Re: Threads test incl. Stockfish 5 and Komodo 8

Re: Threads test incl. Stockfish 5 and Komodo 8

Re: Threads test incl. Stockfish 5 and Komodo 8

Re: Threads test incl. Stockfish 5 and Komodo 8

Re: Threads test incl. Stockfish 5 and Komodo 8

Re: Threads test incl. Stockfish 5 and Komodo 8

Re: Threads test incl. Stockfish 5 and Komodo 8