Parallelization questions, ABDADA or DTS?

Daniel Shawul · Post by **Daniel Shawul** » Sun Mar 25, 2012 12:25 am

I think Vincent answered your question thoroughly

I would also like to know if Intel will push Cilk to be the language of choice for those hardwares ? How is the performance of the scheduler for chess? It has been used before by CilkChess so it should not be bad. But the tree may not be very selective compared to current heavily pruned trees.

I have implemented a cluster YBW which I think is on par with cluster-Toga performance wise. But I didn't test it well because the clusterI had access to used fast ethernet connection (which means slow) and it really did not scale well even on 32 processors. But it works as long as all processors are active.

One thing I did uniquely is to use a combined SMP-cluster search, where the search takes advantage of "fat nodes" (nodes with 8-core SMP machines) by starting an SMP search. It helps with the speed up but could introduce inefficiencies with load balancing as some workers become extra powerfull. Also it makes implementation becomes complicated since you can just use MPI to start a process for every core. That is do message passing even when you know it is an SMP machine. MPI actually optimizes stuff in that case but it will not be as good as the SMP algorithm.

Daniel Shawul · Post by **Daniel Shawul** » Sun Mar 25, 2012 12:31 am

Trying to figure out difference of YBW and Jamboree.It looks like it is very similar but there is a "wait for all children" at J12 that may be different. Anyone knows details of Jamboree ? http://supertech.csail.mit.edu/papers/t ... szmaul.pdf

Code: Select all


&#40;J1&#41; De&#64257;ne jamboree&#40;n; ; ) as
&#40;J2&#41; If n is a leaf then return static_eval&#40;n&#41;.
&#40;J3&#41; Let ~c  the children of n, and
&#40;J4&#41; b  jamboree&#40;c0; ; )&#58;
&#40;J5&#41; If b   then return b.
&#40;J6&#41; If b >  then set   b.
&#40;J7&#41; In Parallel&#58; For i from 1 below j~cj do&#58;
&#40;J8&#41; Let s  jamboree&#40;~ci;   1; )&#58;
&#40;J9&#41; If s > b then set b  s.
&#40;J10&#41; If s   then abort-and-return s.
&#40;J11&#41; If s >  then
&#40;J12&#41; Wait for the completion of all previous iterations
&#40;J13&#41; of the parallel loop.
&#40;J14&#41; Set s  jamboree&#40;~ci; ; ). ;; Research for value
&#40;J15&#41; If s   then abort-and-return s.
&#40;J16&#41; If s >  then set   s.
&#40;J17&#41; If s > b then set b  s.
&#40;J18&#41; Note the completion of the ith iteration of the parallel loop.
&#40;J19&#41; enddo
&#40;J20&#41; return b.
Figure 4-5&#58; Algorithm jamboree

Dragulic · Post by **Dragulic** » Sun Mar 25, 2012 1:21 am

diep wrote:Can you give a link

http://ieeexplore.ieee.org/Xplore/login ... %3D1383246
From there, many google searches can be inspired. Like, for gigascale integration.

diep · Post by **diep** » Sun Mar 25, 2012 1:43 am

Dragulic wrote:
diep wrote:Can you give a link
http://ieeexplore.ieee.org/Xplore/login ... %3D1383246
From there, many google searches can be inspired. Like, for gigascale integration.

I see some VLSI stuff from 2005.
We live in 2012 now, so hardware predictions from 2005 are a tad outdated

Dragulic · Post by **Dragulic** » Sun Mar 25, 2012 1:59 am

diep wrote:outdated

Maybe but maybe no. The cycle from theoretisation of a new level of approach to the end-user delivery of a practical production item can be 5-10 years even in this fast moving field. Example, 3D transistors were postulated there in 2004/5. Delivery is yet to occur but the next generation (22nm) from Intel is expected to use these.

I stand to belief that I will see 10^6 thread systems becoming common. But then I am young and in good health with a long life expectancy.

BeRo · Post by **BeRo** » Sun Mar 25, 2012 8:29 pm

Okay so, I've implemented YWBC (including helpful master concept and shared global transposition table) in my engine now.

I'm getting now, if parallel-threaded between 3000k and 4000k nodes per second on my Intel i7 2630QM 2 GHz quadcore (8 CPU threads because with hypertheading and with 2.9 GHz TurboBoost, if only core is used) notebook, and if non-parallel between 1000k and 2000k nodes per second.

I'll test it tomorrow or so also on my AMD Phenom II 1090T 3.2 GHz hexacore (without hyperthreading, so real 6 cores then and with 3.8 GHz TurboBoost, if only one core is used) desktop computer.

It's a small but already nice performance increase for my engine, but I think, that I must still profiling my code and optimizing the main biggest bottlenecks of the code.

diep · Post by **diep** » Sun Mar 25, 2012 9:27 pm

BeRo wrote:Okay so, I've implemented YWBC (including helpful master concept and shared global transposition table) in my engine now.

I'm getting now, if parallel-threaded between 3000k and 4000k nodes per second on my Intel i7 2630QM 2 GHz quadcore (8 CPU threads because with hypertheading and with 2.9 GHz TurboBoost, if only core is used) notebook, and if non-parallel between 1000k and 2000k nodes per second.

I'll test it tomorrow or so also on my AMD Phenom II 1090T 3.2 GHz hexacore (without hyperthreading, so real 6 cores then and with 3.8 GHz TurboBoost, if only one core is used) desktop computer.

It's a small but already nice performance increase for my engine, but I think, that I must still profiling my code and optimizing the main biggest bottlenecks of the code.

What speedup do you see of 8 over 1?
(both scaling as well as speedup)

maybe want to turn off turboboost to measure that...

Parallelization questions, ABDADA or DTS?

Re: Parallelization questions, ABDADA or DTS?

Re: Parallelization questions, ABDADA or DTS?

Re: Parallelization questions, ABDADA or DTS?

Re: Parallelization questions, ABDADA or DTS?

Re: Parallelization questions, ABDADA or DTS?

Re: Parallelization questions, ABDADA or DTS?

Re: Parallelization questions, ABDADA or DTS?