Threads test incl. Stockfish 5 and Komodo 8

bob · Post by **bob** » Thu Oct 16, 2014 6:12 pm

OK, since this is going pretty far into never-never land, here is a question:

Let's just suppose that you have a parallel program you have worked on and the best parallel speedup you can get is 4x on your 8-core box. That means that your tree is roughly twice as big as it ought to be, which limits you to a speedup of 4 at best. You can't figure out any way to get rid of that 2x overhead, so you decide to punt on getting any faster. So you now decide "OK, I can't go any deeper, I simply can't figure out a way to improve the search overhead enough, so I am going to use those "wasted" 4 processors to search less selectively."

Explain HOW you are going to pull that off. You are already searching 2x the minimal tree for this position. Most of which is not helping you at all. HOW are you going to redirect this effort to searching stuff you might be pruning or reducing erroneously? If you do ANYTHING to the tree, it is going to be 2x larger than the minimal tree already, and you couldn't figure out a way to stop that. So how do you magically reduce that overhead so that you can redirect it into a part of the tree that you might be pruning too aggressively? ANYTHING you do suffers the 2x overhead of your current parallel search. If you can eliminate part of that 2x to add some other nodes, why didn't you just eliminate that part of the 2x to start with which would let your parallel search go deeper?

So turn this around. You already have a way to search the minimal tree, which means no overhead or an 8x speedup on 8 cores. WHY would you consider giving up part of that 8x speedup to reduce the pruning or reducing, yet you won't do that same thing on the 1-cpu version? If you had a single cpu that would go 8x faster, would you adjust your pruning/reductions to be less aggressive? Because that is EXACTLY the decision you made with the 8 core parallel search.

This entire discussion is so far out in left field it is out of the stadium. As I have repeatedly stated, if you would intentionally reduce the pruning/reductions in the N core search, you should do the SAME thing in the 1-core search. Because you do NOT get to just arbitrarily say "I am going to redirect those overhead nodes (beyond the minimal tree) into what _I_ want. The tree search move ordering is the best you could do already, so the overhead will always be there, spread throughout the tree, as nothing but extra nodes that you can't pick and choose where they occur.

If you can give me a viable method to replace those overhead nodes with nodes you pick and choose at execution time, I'll concede this could work. I know that _I_ can't reduce overhead beyond a certain level, because perfect move ordering is impossible. And more importantly, it is impossible to predict where the move ordering will be better and where it will be worse. It is what it is, where it is, and when it is.

If you can convince me of some reason why you would search with one set of pruning/reduction policies with a processor that can search X nodes per second, but you would use a different set of policies if that processor could search N * X nodes per second, I'll concede this could work.

I do know that no current program I have looked at, mine included, changes its behavior just because the hardware is faster. Yes we might reduce more or use a larger R value for null-move as the search goes deeper, but that is an algorithmic decision based on depth, not on speed. And the parallel search will see that same benefit when it too goes deeper with more cores.

parallel search is about NOTHING but speed/depth. NOT searching extra nodes. That searching extra nodes (the 1 vs N core fixed depth testing) makes the N core a little stronger than the 1 core program) is not that surprising. You can do the same by reducing the LMR reductions or null-move R value. Fixed depth will see the less selective program win more games. But who cares? When they play using a normal timed search, the reduced selectivity program will lose more games unless the reductions and such are simply too aggressive. Personally, that is how I tune my LMR/null-search/etc, timed games with the two versions competing against a gauntlet. The best version wins. I can guarantee you that if I COULD re-direct my overhead nodes however I want, I would NOT redirect them into reducing selectivity, I would redirect them into more depth. That would always be better. If "widening" works in a parallel search, it works in a sequential search just as well. If it doesn't work in one, it doesn't work in the other, it is simply "accidental".

bob · Post by **bob** » Thu Oct 16, 2014 6:12 pm

syzygy wrote:
bob wrote:I'm not "clinging" to anything. super-linear speedup and effective "widening" are just two mythical ideas that point out a flaw in the basic search algorithm, not really related to the parallel search at all.
Super-linear speedup is not a topic of this thread, EXCEPT that you are trying to move the thread towards it.

No I am not.

bob · Post by **bob** » Thu Oct 16, 2014 8:21 pm

I just realized that A couple of paragraphs were deleted by accident, probably via "thumb taps" on my touchpad.

In the part where I mentioned an 8 core box and a 4x speedup, the ONLY think you can influence is the 4x speedup part. The overhead is NOT going away, otherwise you would have already reduced it. So ALL you have to play with is that 4x. If you take 2x of that to make the tree "wider" you are now down to a 2x speedup, because the 4x overhead is not going anywhere, and you just added more. But that is not the end. The "widening" you just did is going to cost you that SAME 2x overhead that the original search had (8 cpus = 4x speedup, because the tree is twie as big as it should be). So your brand-new 2x speedup is going to drop to closer to 1x, because that widening you added is going to gain another 50% search overhead as well.

As I mentioned, it is simply impossible to re-direct work done on overhead nodes to something else. If we could do that, we would re-direct that into the toilet and get rid of the overhead and take the 8x speedup instantly.

Laskos · Post by **Laskos** » Thu Oct 16, 2014 8:36 pm

bob wrote:
I'm not a kid. I don't gamble. I bet on sure things where I know the outcome before the bet is made. Betting on unknown hardware is not exactly in that class, sorry.

Sure, you are not a better, and betters are kids. Simply stating, you cannot stand by your arguments. You claim that 32x effective speed-up on 64 core Itanium box was "bad", probably needed to be 45x to fit your linear formula. My model is not even mine, the series
1.8,1.7,1.6,1.5 to 16 cores on Xeons and Opterons are V. Rajlich numbers for Rybka, which show clearly a logarithmic effective speed-up He never claimed it's bad. Continuing the series to 64 cores it would give 13.4x effective speed-up from 1 to 64 cores. He never said it's good or bad.

Assuming your "bad" scaling of 32x on 64 core box, i.e. 1.78 per doubling from 16 to 32 threads, or 75 Elo points gain at 60''+0.05"" time control, and mine 1.34 per doubling from 16 to 32 threads, or 38 Elo points gain, the next Andreas' test will show who is right. I do stand by my claims. You just stand by your phraseology "I know everything, when I don't know, I am correct anyway".

bob · Post by **bob** » Thu Oct 16, 2014 10:22 pm

Laskos wrote:
bob wrote:
I'm not a kid. I don't gamble. I bet on sure things where I know the outcome before the bet is made. Betting on unknown hardware is not exactly in that class, sorry.
Sure, you are not a better, and betters are kids. Simply stating, you cannot stand by your arguments. You claim that 32x effective speed-up on 64 core Itanium box was "bad", probably needed to be 45x to fit your linear formula. My model is not even mine, the series
1.8,1.7,1.6,1.5 to 16 cores on Xeons and Opterons are V. Rajlich numbers for Rybka, which show clearly a logarithmic effective speed-up He never claimed it's bad. Continuing the series to 64 cores it would give 13.4x effective speed-up from 1 to 64 cores. He never said it's good or bad.

Assuming your "bad" scaling of 32x on 64 core box, i.e. 1.78 per doubling from 16 to 32 threads, or 75 Elo points gain at 60''+0.05"" time control, and mine 1.34 per doubling from 16 to 32 threads, or 38 Elo points gain, the next Andreas' test will show who is right. I do stand by my claims. You just stand by your phraseology "I know everything, when I don't know, I am correct anyway".

How about sticking to what I write as opposed to what you wish I had written. I didn't mention my linear formula. I simply stated the obvious. 32x on 64 cores is bad, period. Why don't you look at some of the "grand challenge" stuff in computer science. One example was a speedup of over 1000x on a 1024 node cluster (not chess). THAT is considered acceptable. 512x on 1024 cores is not. Wasting 50% of the hardware when the hardware is not exactly cheap is not so good.

My formula was quite accurate when it was developed, when 8 or 16 processors was the most anybody had access to. I've not tested much or modified the approximation to fit larger numbers of cores. You simply want to nit-pick and apply that approximation where it was never intended to be applied, and then whine and complain when it is not as accurate as you want. It IS an approximation. It DOES work very accurately through 8, and is pretty accurate through 16. Beyond that, caveat emptor. You don't HAVE to use that formula. I don't. I simply run the tests since each hardware platform is different, anyway, and there is no longer any "one size fits all" type of speedup approximation. Too many questions. NUMA or UMA? How many cores share a L2 or L3 cache? Is memory interleaved or not? etc. All of those affect the speedup.

syzygy · Post by **syzygy** » Thu Oct 16, 2014 10:24 pm

bob wrote:Explain HOW you are going to pull that off. You are already searching 2x the minimal tree for this position. Most of which is not helping you at all.

Note that if some of it is helping, we're already done. That some part of the extra nodes that are helping is referred to as "widening".

This is the thing, nothing more, nothing less.

Widening == not all "extra" nodes are worthless.

bob · Post by **bob** » Fri Oct 17, 2014 12:06 am

mjlef wrote:
bob wrote:
mvk wrote:
syzygy wrote:Btw, note how Bob now keeps clinging to "super-linear speedup = bug" when nobody has made a claim of super-linear speedup for Komodo or any other "widening" smp implementation.
These statements make sense if you view traversal and tree shaping as separate processes: first you define a tree, and then you search it as efficiently as possible. The search, in essence, doesn't impact the effective tree. This is "depth thinking". But that is not how modern programs work. Modern programs dynamically shape the tree through iteration-to-iteration interaction. Their parallel counterparts accept that splitting is going to be imperfect and inevitably generates excess nodes. But then they make an extra step and try to influence where these excess nodes will go, and change the effective tree shape, so at least the can get something in return. The tree shaping vs. traversal interaction is not new, it is what propelled programs such as Shredder and followers ahead of the pack. Newton vs quantum mechanics. Both view are right, in their own domain. (But here we are less interested in the domain of 1980s style programs of course.)
All well and good, but with one key omission. "minimal tree" means EXACTLY what it says. The tree a single thread would search to reach depth D. The goal then becomes to search that same exact tree, no extra nodes, to depth D, but in D/N time (N = number of processors).

NOBODY wants to search a bushier tree. I don't believe that. I don't think anybody here believes that. It is just another point to argue about. If bushier is better, why not make the serial search bushier?

This "well, we can't get ALL of the performance from the N cores to search deeper, so why don't we use part of that performance to make the tree bushier?" just does not pass any "smell test." That the effect is observed does NOT imply that it is intentional. It suggests what we already know, namely that the parallel search is bushier for the very reasons I laid out in the Journal of Parallel Computing a long while back. It is NOT bushier because that is desirable, it is bushier because that is a basic property of alpha/beta that can not be avoided, until such time as we can produce perfect move ordering. By then, we will obviously have solved chess anyway.
I do not want to get too involved in this conversation, since whatever I say would just help others figure out what we have figured out. But here is what I think. Recent programs do not find the optimal path since they no longer examine the full tree needed by alpha-beta to prove the best move has been found. They find a "good enough and generally right" move. In a world without selectivity, a lot of what Bob says is true. But that world ended a long while ago. In many ways, strong chess programs are now way closer to Shannon's Type B than Type A searchers. So given that the search is Type B, and flawed, how do we make it less flawed? What Komodo does seems to be working well, and it repairs some of the overlooked lines in a cost/time effective way. You can either look at the data, whihc I think mos will agree is pretty convincing, or stick with Type A searches.

If you "repair some of the overlooked lines" why don't you do that in the serial search? More importantly, how can you somehow cause that "search overhead" the parallel search incurs to be applied exactly how/where you want it?

I'm not talking about shannon-a at all. I've not used that ever, in fact, everything I have ever done has had a shannon-b look and feel. But that doesn't change my basic point. Parallel search incurs overhead that can only be fixed with that which is impossible to obtain, perfect move ordering. The overhead happens at completely unpredictable points in the search (otherwise we would predict where and just not split at such positions).

Yes you can widen things up. But if if works in the parallel search, why won't it work in the sequential search. You have no control over how either behaves, independently.

A non-alpha/beta searcher I might buy. But alpha/beta is well-known as are the problems with searching such a tree in parallel. I simply believe that if you can gain by doing less selectivity in the parallel search, you can get a similar gain by doing less selectivity in the sequential search, since there is no way to simply "redirect the overhead nodes" into the places where you want less selectivity. Those nodes will show up where they show up, with no way to predict where they will be nor how many there will be at one particular point.

In this specific area of tree search, I am not exactly a "dinosaur" from the 80's. That old argument gets a bit tired and worn out, and is usually a "last resort" argument.

michiguel · Post by **michiguel** » Fri Oct 17, 2014 12:09 am

bob wrote:
michiguel wrote:
bob wrote:
syzygy wrote:
bob wrote:if that kind of super-linear speedup happens consistently.
For the Nth time, nobody is claiming a super-linear speedup. Your whole argument is a non-starter to begin with.
super-linear and this "widening" are members of the same set, "urban legend".
Do you realize your logic is totally flawed? it is a gigantic straw man. You bring something unrelated, say that they are, just to bring the other down.

And I am sure that Komodo is not a urban legend.

Miguel
No but intentional widening rather than searching deeper is certainly a legend.

It has never been discussed, so it is not a legend. By the way, it is irrelevant whether it is intentional or not.

Miguel

michiguel · Post by **michiguel** » Fri Oct 17, 2014 12:26 am

Laskos wrote:
Adam Hair wrote:Since Andreas has shared his results for Crafty, I will add Crafty's plot to my graph. Again, this is a plot of the Wilo (as opposed to Elo) differences from 1 core for each engine as the number of cores increase. This allows us to compare each engine's SMP performance on roughly the same scale.

Wilos will revolutionize chess engine ratings. They get rid of strength differences, the time control and hardware.

We have done some exploratory work with Adam and it has very nice properties. Quite soon I will modify Ordo to produce ratings of this kind (draws will have to be treated differently). We believe that the strength is truly given by W/L and the draws are a side effect, that becomes higher at higher levels. That is one of the reasons why it looks tougher to make progress at the K, SF and H levels = the high draw rate. But, they are making more progress than it looks if you observe the W/L ratios.

Anyway, I fit the data with an equation and I got this

The equation is

speed up = n / (f * (n^2 - n) + 1)

log2 (W/L) = k * log2 (Speedup)

k is a proportionality constant and f is a factor that mean the "apparent" amount of extra work (no parallelizable) that each extra cpu introduce.

For
K8, f = 0.0011 (0.11%) k = 1.82
K7, f = 0.0016 (0.16%) k = 1.62
SF, f = 0.0087 (0.87%) k = 2.09 (starts great)
Za, f = 0.0100 (1.00%) k = 1.84
Cr, f = 0.0160 (1.60%) k = 1.82

Miguel

bob · Post by **bob** » Fri Oct 17, 2014 12:40 am

syzygy wrote:
bob wrote:Explain HOW you are going to pull that off. You are already searching 2x the minimal tree for this position. Most of which is not helping you at all.
Note that if some of it is helping, we're already done. That some part of the extra nodes that are helping is referred to as "widening".

This is the thing, nothing more, nothing less.

Widening == not all "extra" nodes are worthless.

But you CAN reduce the total with effort, and you can NOT reduce just the ones that are no good.

And IF those extra nodes help, search 'em in the regular search also, they will help there too.

Threads test incl. Stockfish 5 and Komodo 8

Re: Threads test incl. Stockfish 5 and Komodo 8

Re: Threads test incl. Stockfish 5 and Komodo 8

Re: Threads test incl. Stockfish 5 and Komodo 8

Re: Threads test incl. Stockfish 5 and Komodo 8

Re: Threads test incl. Stockfish 5 and Komodo 8

Re: Threads test incl. Stockfish 5 and Komodo 8

Re: Threads test incl. Stockfish 5 and Komodo 8

Re: Threads test incl. Stockfish 5 and Komodo 8

Re: Threads test incl. Stockfish 5 and Komodo 8

Re: Threads test incl. Stockfish 5 and Komodo 8