Threads test incl. Stockfish 5 and Komodo 8

Laskos · Post by **Laskos** » Fri Oct 17, 2014 7:48 pm

michiguel wrote:
Laskos wrote:
michiguel wrote:An image in a decent size (I could not edit it after 15 min, arghh).

The equation is

speed up = n / (f * (n^2 - n) + 1)

log2 (W/L) = k * log2 (Speedup)

k is a proportionality constant and f is a factor that mean the "apparent" amount of extra work (non parallelizable) that each extra cpu introduce.

For
K8, f = 0.0011 (0.11%) k = 1.82
K7, f = 0.0016 (0.16%) k = 1.62
SF, f = 0.0087 (0.87%) k = 2.09 (starts great)
Za, f = 0.0100 (1.00%) k = 1.84
Cr, f = 0.0160 (1.60%) k = 1.82

Miguel
Very interesting. Can you hint how you derived f?
You assumed an Amdahl-like behavior:
S = (R + 1) * n / (R + n) like in this model?
http://www.talkchess.com/forum/viewtopi ... 29&t=48780

About Wilos, I speculated that Win/Loss ratio remains pretty constant beyond ultra-fast controls several years ago on another forum. More than one year ago, I speculated that here too:
http://www.talkchess.com/forum/viewtopi ... 27&t=48733
With some data. Inversing with draw rate and Elo gain, I even saw a very slow increase with time control of Win/Loss ratio, but almost flat. Wilos will get rid of all these strength, time control, hardware, etc. issues present in Elos. From your graphs and interpolation, it seems that Zappa and Crafty apparent good Elo gain to 8 cores occurs just because they are weaker, and have less draws. In Wilos, their gain is worse than that of top engines, and to 16 cores Komodo 8 is standing out.
That was a good observation Kai. We were brainstorming about how to introduce draws and deal with the fact that draw rates change with increasing strength. If you look at the physical chemical "model" of Ordo, it becomes obvious that TRUE strength is related to the ratio W/L (that is related to the difference in "energetic" levels between W and L.

As you point out, Z and C seem to have a better scaling at the beginning because they are weaker than SF and K. Exactly.

One thing that Andreas could do is to test Engine_X vs Engine_X (with twice the time, not twice the cores). This will be a theoretical 100% speed up for two cores. If we get the ratio we will get an "effective" speed up.

Sorry, I was going give the derivation but I got trapped with work
Code: Select all
Tipical Amdahl can be shown as

      S + P 
SU = ---------
      S + P/n

Numerator is the work with one thread, denominator is the work with "n" threads
Where
SU = speed up
S = serial fraction of the work that cannot be parallelized
P = work that can be parallelized
n = threads
Of course, S + P = 1

But, Amdahl cannot explain that the speed up may go down at a certain number of threads.

Then, I assumed that the work with n threads starts to have spurious work added for each thread added, and that it is a work that cannot be parallelized. That is, for each thread added, every thread slows down a bit in areas doing nothing. So, I add a term f*(n-1) in the denominator and place 1 in the numerator since S+P=1.
Code: Select all
             1
SU = -------------------
      S + P/n + f (n-1)
I can stop here, but I assumed that S is near zero because the fitting hinted me it was reasonable (that is, the amdahl term S in negligible compared to the new term f*(n-1)) then.
P = 1 and the term S disappears:
Code: Select all
             1
SU = -------------------
        1/n + f (n-1)
multiplying n in the num and den:
Code: Select all
             n
SU = -------------------
        1 + f (n^2-n)
Miguel

Thanks Miguel, makes a lot of sense. Concerned only with that your SU will inevitable decline with more cores at some time. I am not sure this is a general asymptotic behavior, but could be.

EDIT: maybe adding a term P*f*(n-1) instead of f*(n-1) is more reasonable? So each core wastes something proportional to the parallelizable stuff? Maybe it won't fit data so well, but will look asymptotically more reasonable?

Laskos · Post by **Laskos** » Fri Oct 17, 2014 8:13 pm

Laskos wrote: Thanks Miguel, makes a lot of sense. Concerned only with that your SU will inevitable decline with more cores at some time. I am not sure this is a general asymptotic behavior, but could be.

EDIT: maybe adding a term P*f*(n-1) instead of f*(n-1) is more reasonable? So each core wastes something proportional to the parallelizable stuff? Maybe it won't fit data so well, but will look asymptotically more reasonable?

EDIT 2: I mean parallelizable after each introduction of a new core, so the term would be (1-f*(n-1))*f*(n-1) or something like that.

Ajedrecista · Post by **Ajedrecista** » Fri Oct 17, 2014 8:51 pm

Hello:

It is needless to say that I have no idea about parallel computing. However, I want to bring here my two cents if possible.

I found this article that can be slightly correlated with the topic of mores cores ---> slowdown:

Resource Allocation and the Law of Diminishing Returns

The interesting example is the example 3 at the bottom of page 4. This paper introduces a fixed overhead called o and a variable overhead called k, which has a lineal correlation with the number of resources R, that is kR. Here, R could be the number of threads. I think the idea is similar to Miguel's proposal SU = n/[1 + n(n - 1)f]. Taking the same notation than the article:

Code: Select all

I(R) = (1 + o + k*R)/(1 + o + k*R - P + P/R)

Where P is the parallelizable part. The easiest thing that came to my mind was calculate dI/dR = 0; according to Derive 6:

Code: Select all

k*R*(R - 2) - o - 1 = 0

The following operations are mine:

R² - 2*R - (1 + o)/k = 0

R = 1 ± sqrt[1 + (1 + o)/k]

But since R > 0:

R = R_critical = 1 + sqrt[1 + (1 + o)/k]

Since R is a natural number, it would make sense to use ceiling or floor functions. Other possibility could be replace kR by k(R - 1) (then R_critical = 1 + sqrt[(1 + o)/k] if I am not wrong). The strange thing is that I got R_critical(kR) > R_critical[k(R - 1)] when I expected the opposite. I hope no typos.

The idea is that the model with overheads helps to explain the slowdown with high number of cores. Please notice that I do not claim that this model is correct.

Regards from Spain.

Ajedrecista.

syzygy · Post by **syzygy** » Fri Oct 17, 2014 9:06 pm

bob wrote:
syzygy wrote:
bob wrote:
syzygy wrote:
bob wrote:Explain HOW you are going to pull that off. You are already searching 2x the minimal tree for this position. Most of which is not helping you at all.
Note that if some of it is helping, we're already done. That some part of the extra nodes that are helping is referred to as "widening".

This is the thing, nothing more, nothing less.

Widening == not all "extra" nodes are worthless.
But you CAN reduce the total with effort, and you can NOT reduce just the ones that are no good.
With effort you can make Crafty beat Stockfish...

And IF those extra nodes help, search 'em in the regular search also, they will help there too.
Nope. Firstly, you cannot duplicate the 4-core tree with a single threaded search without time penalty. Secondly, even if you could, it would only help in case of super-linear speedup which nobody is claiming.
Come back to the discussion when you have some basic idea about what is being discussed. As of right now, you are arguing without a clue about what is happening inside a tree.

Wow, you finally got the point?

michiguel · Post by **michiguel** » Fri Oct 17, 2014 10:08 pm

Ajedrecista wrote:Hello:

It is needless to say that I have no idea about parallel computing. However, I want to bring here my two cents if possible.

I found this article that can be slightly correlated with the topic of mores cores ---> slowdown:

Resource Allocation and the Law of Diminishing Returns

The interesting example is the example 3 at the bottom of page 4. This paper introduces a fixed overhead called o and a variable overhead called k, which has a lineal correlation with the number of resources R, that is kR. Here, R could be the number of threads. I think the idea is similar to Miguel's proposal SU = n/[1 + n(n - 1)f]. Taking the same notation than the article:
Code: Select all
I(R) = (1 + o + k*R)/(1 + o + k*R - P + P/R)
Where P is the parallelizable part. The easiest thing that came to my mind was calculate dI/dR = 0; according to Derive 6:
Code: Select all
k*R*(R - 2) - o - 1 = 0

The following operations are mine:

R² - 2*R - (1 + o)/k = 0

R = 1 ± sqrt[1 + (1 + o)/k]

But since R > 0:

R = R_critical = 1 + sqrt[1 + (1 + o)/k]
Since R is a natural number, it would make sense to use ceiling or floor functions. Other possibility could be replace kR by k(R - 1) (then R_critical = 1 + sqrt[(1 + o)/k] if I am not wrong). The strange thing is that I got R_critical(kR) > R_critical[k(R - 1)] when I expected the opposite. I hope no typos.

The idea is that the model with overheads helps to explain the slowdown with high number of cores. Please notice that I do not claim that this model is correct.

Regards from Spain.

Ajedrecista.

The concept is identical, except that they have a flaw in the derivation if I understand correctly. They are adding O to the numerator. The numerator represents the time for 1 thread, so it has no extra overhead.

Miguel

michiguel · Post by **michiguel** » Fri Oct 17, 2014 10:30 pm

Laskos wrote:
Laskos wrote: Thanks Miguel, makes a lot of sense. Concerned only with that your SU will inevitable decline with more cores at some time. I am not sure this is a general asymptotic behavior, but could be.

EDIT: maybe adding a term P*f*(n-1) instead of f*(n-1) is more reasonable? So each core wastes something proportional to the parallelizable stuff? Maybe it won't fit data so well, but will look asymptotically more reasonable?
EDIT 2: I mean parallelizable after each introduction of a new core, so the term would be (1-f*(n-1))*f*(n-1) or something like that.

If I we not discard S like I did above, the asymptotic behavior could be ok, I think.

Code: Select all


             1
SU = -------------------
      S + P/n + f (n-1) 
      
P = 1 - S
      
             1
SU = -----------------------
      S + (1-S)/n + f (n-1) 

rearrange

             n
SU = --------------------------
      1 + S (n-1) + f (n^2-n)

So,
f (n^2-n) describes the increasing serial work and
S (n-1) describes the constant serial work

if f = 0, then the equation becomes Amdahl, which has a plateau = 1/S, rather than a peak and a further decrease.

Miguel

Laskos · Post by **Laskos** » Fri Oct 17, 2014 11:05 pm

michiguel wrote:
Laskos wrote:
Laskos wrote: Thanks Miguel, makes a lot of sense. Concerned only with that your SU will inevitable decline with more cores at some time. I am not sure this is a general asymptotic behavior, but could be.

EDIT: maybe adding a term P*f*(n-1) instead of f*(n-1) is more reasonable? So each core wastes something proportional to the parallelizable stuff? Maybe it won't fit data so well, but will look asymptotically more reasonable?
EDIT 2: I mean parallelizable after each introduction of a new core, so the term would be (1-f*(n-1))*f*(n-1) or something like that.
If I we not discard S like I did above, the asymptotic behavior could be ok, I think.
Code: Select all
             1
SU = -------------------
      S + P/n + f (n-1) 
      
P = 1 - S
      
             1
SU = -----------------------
      S + (1-S)/n + f (n-1) 

rearrange

             n
SU = --------------------------
      1 + S (n-1) + f (n^2-n)
So,
f (n^2-n) describes the increasing serial work and
S (n-1) describes the constant serial work

if f = 0, then the equation becomes Amdahl, which has a plateau = 1/S, rather than a peak and a further decrease.

Miguel

Right, maybe you could fit with that one. Thanks.

bob · Post by **bob** » Sat Oct 18, 2014 2:08 am

Laskos wrote:
bob wrote:
syzygy wrote:
bob wrote:
syzygy wrote:
bob wrote:Explain HOW you are going to pull that off. You are already searching 2x the minimal tree for this position. Most of which is not helping you at all.
Note that if some of it is helping, we're already done. That some part of the extra nodes that are helping is referred to as "widening".

This is the thing, nothing more, nothing less.

Widening == not all "extra" nodes are worthless.
But you CAN reduce the total with effort, and you can NOT reduce just the ones that are no good.
With effort you can make Crafty beat Stockfish...

And IF those extra nodes help, search 'em in the regular search also, they will help there too.
Nope. Firstly, you cannot duplicate the 4-core tree with a single threaded search without time penalty. Secondly, even if you could, it would only help in case of super-linear speedup which nobody is claiming.
Come back to the discussion when you have some basic idea about what is being discussed. As of right now, you are arguing without a clue about what is happening inside a tree.
You seem to not know how a single core tree grows. Example is that 3 year old thread
http://www.talkchess.com/forum/viewtopi ... 83&t=38808
The pictures are gone, but it still can be understood, with Marco Costabla surely understanding it:

Laskos wrote:
Tree shape is a completely different matter, and as you can see, it is widening going to 5,10,15,20,25 _target_ depth. I can clearly separate EBF into two components: EBF_widening, which is ~1.4, and EBF_deepening, which is ~1.6, for a total EBF of 1.4*1.6 ~ 2.2. As I already wrote, going from target N to target N+1 requires a comparable effort in both deepening and widening.
mcostalba wrote:
This is very interesting, if I have understood correctly you say that the extra effort that we need to move the search at depth N+1 is due for a good 45% (1,4 vs 1,6) by additional search at the inner nodes and only for 55% by an additional search at the new deeper leaf nodes.
With Zach Wegner having some plot of the tree behavior.
Zach Wegner wrote:
Laskos wrote:
Try to talk that to Bob, he doesn't even know how to read a plot, where the slopes (first derivatives) are giving EBF(target depth, ply) on the tree, not the absolute values, and those slopes are both larger and smaller for different plies than the general EBF(target depth), giving a total pretty constant EBF(target depth) i.e. from target depth N to N+1 (different trees). I am really eager to see someone plot the tree shape of Stockfish for different target depths, there is no way it's not widening compared to the purely exponential, his "widening", which is not a widening at all (he seems to be confused about what widening of the tree is). Bob's theoretical tree shape is the purely exponential, non-widening W^D, how silly is that? I don't think he knows the tree shape even of his Crafty.

Kai
I did plot some Stockfish trees and, as expected since it uses logarithmic reductions, the trees have a huge amount of widening. In fact, the widening accounts for far more of the tree growth than deepening.
_________________
You Bob missed the whole point, IIRC, and you have no idea how the tree grows even in single-cored Crafty iteration after iteration.

The term "widening" was simply wrong. ANY tree is wider near the tips than near the root. Big surprise. But it doesn't grow EXTRA wider as it goes deeper, just a fairly constant proportion.

BTW I knew how an alpha/beta tree grows before you knew how to spell the same.

bob · Post by **bob** » Sat Oct 18, 2014 4:06 am

Laskos wrote:
syzygy wrote:
Laskos wrote:Then smirked when I said it IS very high effective speed-up, with you citing some papers from 80es for DTS on completely different architecture.
Those numbers were based on 24 positions searched up to 6 ply or so, so have to be taken cum grano salis. (Not to mention that the node counts listed are not real but reconstructed.)
Ok.
I have a feeling Bob will not accept the bet, and will divagate on superlinear and papers on DTS from 80es for Cray.

What "completely different architecture"? Cray Blitz used a parallel machine, shared memory, 2 to 32 cores. Quite similar to today's machines in fact, particularly a 16 core chip. So what exactly are you talking about?

Those positions were NOT searched to "6 plies". Be nice to see sygyzy actually get some technical details correct. Those positions came from a real game, were searched to depths from 10-12 plies give or take, and they were repeated a bunch of times to get a good average speedup.

He's quite good at criticizing, quite poor at actually doing anything useful on the topic of parallel search.

bob · Post by **bob** » Sat Oct 18, 2014 4:08 am

syzygy wrote:
bob wrote:
syzygy wrote:
bob wrote:
syzygy wrote:
bob wrote:Explain HOW you are going to pull that off. You are already searching 2x the minimal tree for this position. Most of which is not helping you at all.
Note that if some of it is helping, we're already done. That some part of the extra nodes that are helping is referred to as "widening".

This is the thing, nothing more, nothing less.

Widening == not all "extra" nodes are worthless.
But you CAN reduce the total with effort, and you can NOT reduce just the ones that are no good.
With effort you can make Crafty beat Stockfish...

And IF those extra nodes help, search 'em in the regular search also, they will help there too.
Nope. Firstly, you cannot duplicate the 4-core tree with a single threaded search without time penalty. Secondly, even if you could, it would only help in case of super-linear speedup which nobody is claiming.
Come back to the discussion when you have some basic idea about what is being discussed. As of right now, you are arguing without a clue about what is happening inside a tree.
Wow, you finally got the point?

I have known you didn't understand parallel search for a LONG time, if that is the "point" you are talking about. I suspect MOST here got THAT point quite a while back...

Threads test incl. Stockfish 5 and Komodo 8

Re: Threads test incl. Stockfish 5 and Komodo 8

Re: Threads test incl. Stockfish 5 and Komodo 8

Re: Threads test including Stockfish 5 and Komodo 8.

Re: Threads test incl. Stockfish 5 and Komodo 8

Re: Threads test including Stockfish 5 and Komodo 8.

Re: Threads test incl. Stockfish 5 and Komodo 8

Re: Threads test incl. Stockfish 5 and Komodo 8

Re: Threads test incl. Stockfish 5 and Komodo 8

Re: Threads test incl. Stockfish 5 and Komodo 8

Re: Threads test incl. Stockfish 5 and Komodo 8