Implementing parallel search

matthewlai · Post by **matthewlai** » Wed Sep 03, 2014 1:18 am

jdart wrote:Feel free to look at my implementation, which is doing what you describe:

https://github.com/jdart1/arasan-chess/tree/master/src

see Search::searchSMP and the ThreadPool class (threadp.h/.cpp).

Implementing the "helpful master" concept will increase performance but it is not a huge gain, so if you want you can defer implementing this.

--Jon

Thanks. Will definitely take a look.

matthewlai · Post by **matthewlai** » Wed Sep 03, 2014 1:20 am

syzygy wrote:
jdart wrote:Implementing the "helpful master" concept will increase performance but it is not a huge gain, so if you want you can defer implementing this.
I don't think the OP is there, yet.

The way I understand the question is: should the master be one of the search threads that search moves at the split node, or should it only somehow loop around waiting for available slaves in order to serve them with moves.

Yeah I don't even have negamax yet (though this is my second engine, and I do have a completed engine already - without parallel search).

I am just making the high level design decisions now, to not make my life harder later on.

ymatioun · Post by **ymatioun** » Wed Sep 03, 2014 2:49 am

On my 4-core computer this algorithm gets average CPU utilization of around 75%, so on average 3 out of 4 cores are working.
This is not perfect, but good enough for 4 cores. But this would not scale well beyond 4 cores; if i get a bigger computer, i'll have to do something more complicated.

Youri.

matthewlai · Post by **matthewlai** » Wed Sep 03, 2014 2:55 am

ymatioun wrote:On my 4-core computer this algorithm gets average CPU utilization of around 75%, so on average 3 out of 4 cores are working.
This is not perfect, but good enough for 4 cores. But this would not scale well beyond 4 cores; if i get a bigger computer, i'll have to do something more complicated.

Youri.

Ah i see. BTW, a cheap way to test on a 16-core computer is using Amazon EC2. You can get a 16-core Sandy Bridge Xeon Linux virtual machine for about $0.25/hr with spot instances (basically using Amazon's spare cycles, and they can shut you down any time, though they rarely do). If you don't like to be shut down it's $2/hr.

jdart · Post by **jdart** » Wed Sep 03, 2014 3:21 am

Well, the original question was:

The problem is, what should the master do, then? Should it also pick a node and start searching? If not, it looks like it would be difficult to manage number of actively-searching threads, and there will be many sleeping threads. If so, what if one of the children it spawned finishes early, and needs more work to do?

In my implementation, the master will search the first node at the root. At a point where the search can be split, one or more child threads will be started. The master just keeps going from that point, searching more moves. When it gets to the end of the search loop, it can just wait for any children that were started at that node and are not yet completed. But with the "helpful master" enabled, it will enter an idle loop from which it can be allocated to help with the searching at a lower node. The master exits that loop when it can't help any more and all child threads are done.

The decision to split is made at multiple depths so if a thread becomes idle it does not necessarily stay there but can be re-employed at another split point.

But the limitation is imposed that a thread is not allocated to an unrelated node - once it is employed searching it must complete the work initially allocated, or work at a sub-node below that point. That way, when the master is done and all allocated children from the same node are done, then the search is known to be complete.

I may well have mis-remembered some of this but I think this is correct.

The main preparation work for this is an appropriate set of data structures.

--Jon

matthewlai · Post by **matthewlai** » Wed Sep 03, 2014 4:04 am

jdart wrote:Well, the original question was:

The problem is, what should the master do, then? Should it also pick a node and start searching? If not, it looks like it would be difficult to manage number of actively-searching threads, and there will be many sleeping threads. If so, what if one of the children it spawned finishes early, and needs more work to do?
In my implementation, the master will search the first node at the root. At a point where the search can be split, one or more child threads will be started. The master just keeps going from that point, searching more moves. When it gets to the end of the search loop, it can just wait for any children that were started at that node and are not yet completed. But with the "helpful master" enabled, it will enter an idle loop from which it can be allocated to help with the searching at a lower node. The master exits that loop when it can't help any more and all child threads are done.

The decision to split is made at multiple depths so if a thread becomes idle it does not necessarily stay there but can be re-employed at another split point.

But the limitation is imposed that a thread is not allocated to an unrelated node - once it is employed searching it must complete the work initially allocated, or work at a sub-node below that point. That way, when the master is done and all allocated children from the same node are done, then the search is known to be complete.

I may well have mis-remembered some of this but I think this is correct.

The main preparation work for this is an appropriate set of data structures.

--Jon

That makes sense.

I guess if a thread gets a fail high, it will signal sibling threads to stop, too? I wonder if it will be possible/beneficial to also update siblings' alpha while they are still searching. Theoretically speaking it obviously helps, but I'm not sure if it will help enough for the massive increase in complexity.

jdart · Post by **jdart** » Wed Sep 03, 2014 4:23 am

I guess if a thread gets a fail high, it will signal sibling threads to stop, too?

Yes. I just do this by keeping a "cutoff" flag in the parent node - if that gets set then all searches exit.

I wonder if it will be possible/beneficial to also update siblings' alpha while they are still searching.

You can do this. The common case though is a non-PV node where beta = alpha+1.

--Jon

matthewlai · Post by **matthewlai** » Wed Sep 03, 2014 4:26 am

jdart wrote: You can do this. The common case though is a non-PV node where beta = alpha+1.

Of course. I totally forgot about that.

jdart · Post by **jdart** » Fri Sep 12, 2014 5:26 am

Amazon gives you instances sized in terms of virtual cores, not physical cores. I think too hyperthreading is on by default so you are not necessarily getting the core count they specify.

AWS might be somewhat useful for testing. but generally even the cluster instances are not really high-performance, or designed for near-100% utilitization.

--Jon

matthewlai · Post by **matthewlai** » Fri Sep 12, 2014 8:34 am

jdart wrote:Amazon gives you instances sized in terms of virtual cores, not physical cores. I think too hyperthreading is on by default so you are not necessarily getting the core count they specify.

AWS might be somewhat useful for testing. but generally even the cluster instances are not really high-performance, or designed for near-100% utilitization.

--Jon

For the bigger instance types, the definition of "core" is 1 hyper-thread on a Xeon E5-2670 v2.
http://aws.amazon.com/ec2/instance-types/
"Each vCPU is a hyperthread of an Intel Xeon core for M3, C3, R3, HS1, G2, and I2."

And if you look up that CPU, it's a 2.5GHz IvyBridge 10-core (20 hyper-threads), supporting a maximum 2S (2 sockets) configuration.

If you look at the biggest compute instance (c3.8xlarge), you get 32 vCPUs, which corresponds to 16 physical cores.

So if you get that, you'll only be sharing with at most 1 small neighbour with 4 cores. However, I suspect they give you the whole machine, and the reason they only give you 32 vCPUs is to maintain consistency with the previous generation instance, which is based on the first gen E5-2670, with only 8 cores and 16 threads.

With the previous generation instance (cc2.8xlarge), you get the entire machine.

The fact that they have hyper-threading enabled is inconsequential if you only spawn N/2 compute threads. The Linux scheduler is smart enough to spread them out to physical cores in that case.

I would say 2.5 GHz Ivy Bridge Xeons are fairly high-performance. Many people run them with near-100% utilization. What's wrong with that?

Implementing parallel search

Re: Implementing parallel search

Re: Implementing parallel search

Re: Implementing parallel search

Re: Implementing parallel search

Re: Implementing parallel search

Re: Implementing parallel search

Re: Implementing parallel search

Re: Implementing parallel search

Re: Implementing parallel search

Re: Implementing parallel search