asynchronous search

bob · Post by **bob** » Thu Apr 08, 2010 6:53 pm

Daniel Shawul wrote:Well it is a bottleneck from the perspective of asynchoronous search which has none of that.
I am using large split depths for the cluster search (d = 8 - 12) due to huge communication
over head. For a d=12 I see only a few splits in a 5 min search and if i don't split enough
it won't scale well. May be I am barking on the wrong tree here as the cause of all this the slow network.
Synchronization is a price paid for low search overhead. The question is if this price is paid too early ?
Here is what the APHID authors say about this. Not everyone could agree about it but it was atleast
enlightening to have another view point besides the regular serial alpha-beta .
This class of algorithms can not achieve a linear speedup primarily due to synchronization overhead;
the search tree may have thousands of synchronization points and there are numberous occasions where
the processes are starved for work. The algorithms have low search overhead , but his is primarily due
to the implementation of a globally shared transposition table

Oh, I know exactly how "easy" it is Smile

I have to admit that I so far have failed in making my own cluster design simple and elegant.

What's the problem here? You can protect against the spurious message and ignore it in the slave. This kind of asynchronicity (lets call it "in transit") is quite manageable and won't kill speedups.

If the master gets the result from the slave, it *will* know whether it already sent the HELP to the slave or not, and adjust accordingly.
I agree it is doable. Infact it is done but I just can't figure out a simple way to do it.
The problem is the owner of that node has to keep old information while it is searching elsewhere.
Then it has to decide if it has to temprarily pause what is doing now and finish the old split point.
Or continue searching what it does now and go back and finish the split block afterwards.
This is the kind of thing that smells like Italian spaghetti

But anyway thank you all for sharing your experiences. Now I will have to go back and fix the thing
to get some scaling from it because at its current state it is just junk code.

cheers

Most know I play the guitar (hardly professionally, but I enjoy playing). One of my guitar heros, Chet Atkins, was once asked "how does one become a player with your skill level?" He replied, "Practice and perseverance is the way I teach others to play. Overnight stardom is another way, but I don't know how to teach that..."

Moral... this is not a quick and easy thing to do, otherwise it would already be in the "solved" column of computer chess problems. It is not going to be solved quickly. It is not going to be solved "simply" either, any more than a traditional YBW search that works reasonably well is "simple".

CRoberson · Post by **CRoberson** » Fri Apr 09, 2010 4:07 am

My distributed implementation is as follows:

My own YBW coding. It splits in a depth band. There is a range of depths at which it can split as long as the current node has enough of a draft. I use MPI and all CPU's are nodes.The coding is a manager /worker relationship. The network is connected by a gigabit switch. I am using one i7 920 OC'd to 4.0 GHz and one Q6600 OC'd to 3 GHz.

Lack of shared transposition tables produces a 3x performance loss for plies below 13. Around 17 ply, it produces a 5x loss. So, it takes a number of processors to overcome the initial loss. With 4 processors on the same machine, it matches or slightly betters the single proc performance up to about 15 ply. At 17 plies, 4 procs is slower (4<5).

Using the i7 920 as the single proc base, I haven't gotten a speedup over the single proc i7. With the Q6600 as the base, I do get a speed up by using both machines.

The current performance is modeled by the following equation:
speedup = Num CPUs/5 assuming I am searching plies > 16. I'd like to get more machines to test the scalability of the model.

Gian-Carlo Pascutto · Post by **Gian-Carlo Pascutto** » Fri Apr 09, 2010 8:37 am

bob wrote: You are talking about design _decisions_, not architectural shortcomings.

When a node goes idle, you could let it do a broadcast

Honestly, I don't particularly like broadcasting in a cluster because of the potential scaling issues it brings.

Daniel Shawul · Post by **Daniel Shawul** » Fri Apr 09, 2010 1:28 pm

Thank you for sharing Charles. Some questions.

My own YBW coding. It splits in a depth band. There is a range of depths at which it can split as long as the current node has enough of a draft. I use MPI and all CPU's are nodes.The coding is a manager /worker relationship. The network is connected by a gigabit switch. I am using one i7 920 OC'd to 4.0 GHz and one Q6600 OC'd to 3 GHz.

Lack of shared transposition tables produces a 3x performance loss for plies below 13. Around 17 ply, it produces a 5x loss. So, it takes a number of processors to overcome the initial loss. With 4 processors on the same machine, it matches or slightly betters the single proc performance up to about 15 ply. At 17 plies, 4 procs is slower (4<5).

Any particular reason for treating each CPU as a node? I originally thought that using all processors at a node by SMP could be better due to the possiblility of using shared tt and also simply because SMP is better. However later on I encountered load balancing issues since each node will have different processing power. I plan to do a pre-initialization to calculate the individual split depths for each node, because sending some powerfull nodes a tiny job could be a total waste.

I haven't particularly waited for a very long search to complete (maximum 5-10 mins). What is the typical cluster split depths you used at tournament level tc?

Using the i7 920 as the single proc base, I haven't gotten a speedup over the single proc i7. With the Q6600 as the base, I do get a speed up by using both machines.

The current performance is modeled by the following equation:
speedup = Num CPUs/5 assuming I am searching plies > 16. I'd like to get more machines to test the scalability of the model.

That is actually pretty close to what I got. For 16 processors it scaled like a 3.5 processor SMP engine. But that is just only NPS wise and the real scaling could be say 3.0.
I understand the i7 920 is super-fast and it is difficult to get a speed up over it by distributing work over the net. Each node I have are dual cores.
If I use two nodes, I can somehow adjust some parameters to get a no-loss situation (speed up of 1). If I treat each processor as a node though it becomes more difficult.
Thanks again
Daniel

bob · Post by **bob** » Fri Apr 09, 2010 9:12 pm

Gian-Carlo Pascutto wrote:
bob wrote: You are talking about design _decisions_, not architectural shortcomings.

When a node goes idle, you could let it do a broadcast
Honestly, I don't particularly like broadcasting in a cluster because of the potential scaling issues it brings.

I don't like broadcasting to a large cluster. But one can play tricks with how you broadcast so that only a few see a broadcast from one particular node. There are also lots of good research papers on using random "I am idle" message targets to prevent cluster hot-spots, etc...

CRoberson · Post by **CRoberson** » Wed Apr 14, 2010 11:21 pm

Daniel Shawul wrote:Thank you for sharing Charles. Some questions.

Any particular reason for treating each CPU as a node? I originally thought that using all processors at a node by SMP could be better due to the possiblility of using shared tt and also simply because SMP is better. However later on I encountered load balancing issues since each node will have different processing power. I plan to do a pre-initialization to calculate the individual split depths for each node, because sending some powerfull nodes a tiny job could be a total waste.

I haven't particularly waited for a very long search to complete (maximum 5-10 mins). What is the typical cluster split depths you used at tournament level tc?

On benchmark test positions, I test starting at 10 ply and go 1 ply at a time to 20 or 21 ply.

I currently use each cpu as a node, because MPI uses shared mem for comm on the same machine. That way I can test how much the network overhead is degrading performance. Also, I have several ideas to test. My current implementation is just the starting point.

That is actually pretty close to what I got. For 16 processors it scaled like a 3.5 processor SMP engine. But that is just only NPS wise and the real scaling could be say 3.0.
I understand the i7 920 is super-fast and it is difficult to get a speed up over it by distributing work over the net. Each node I have are dual cores.
If I use two nodes, I can somehow adjust some parameters to get a no-loss situation (speed up of 1). If I treat each processor as a node though it becomes more difficult.
Thanks again
Daniel

All my performance quotes are time-to-ply instead of NPS. TTP is more important than NPS. Also, NPS and TTP are not perfectly correlated when comparing various distributed algorithms.

I don't see how you can get a no-loss situation with just two machines. Are you sharing transposition tables across machines? If not, I would assume you get a 3x to 5x performance penalty by not, thus requiring more than 2 machines to make up the difference. OTOH, if you only get a 2x speed up from transposition tables then that makes sense.

Daniel Shawul · Post by **Daniel Shawul** » Thu Apr 15, 2010 10:22 pm

All my performance quotes are time-to-ply instead of NPS. TTP is more important than NPS. Also, NPS and TTP are not perfectly correlated when comparing various distributed algorithms.

I agree. But my NPS scaling is so low i didn't even bother to measure the TTP scaling. I guess that for an algorithm like YBW with a low search overhead the TTP scaling could be safely approximated to be a fraction of the NPS scaling.

I don't see how you can get a no-loss situation with just two machines. Are you sharing transposition tables across machines? If not, I would assume you get a 3x to 5x performance penalty by not, thus requiring more than 2 machines to make up the difference. OTOH, if you only get a 2x speed up from transposition tables then that makes sense.

When i say a no-loss situation i meant that with two machines connected by ethernet I can get the performance to be same as a single machine working alone. NOT 2x more, which would be a 100% scaling.
For 2 scaling is 1 - 1.2. For 16 machines scaling is around 3x.

If a node has 2 or more processors I use SMP and the TT is shared between threads. Other than that ,TTs are not shared between nodes i.e no distributed TTS.

Daniel

asynchronous search

Re: asynchronous search

Re: asynchronous search

Re: asynchronous search

Re: asynchronous search

Re: asynchronous search

Re: asynchronous search

Re: asynchronous search